Metadata-Version: 2.1
Name: pyarrow-ops
Version: 0.0.1
Summary: Useful data crunching tools for Apache Arrow in Python
Home-page: https://github.com/TomScheffers/pyarrow_ops
Author: Tom Scheffers
Author-email: tom@youngbulls.nl 
License: APACHE
Download-URL: https://pypi.org/project/pyarrow-ops/
Keywords: arrow,pyarrow,data
Platform: UNKNOWN
Description-Content-Type: text/markdown
Requires-Dist: pyarrow (>=3.0.0)
Requires-Dist: numpy (>=1.16.6)

# Pyarrow ops
Pyarrow ops is Python libary for data crunching operations directly on the pyarrow.Table class, using only numpy. For convenience, function naming and behavior tries to replicates that of the Pandas API. The performance is currently on par with pandas, however performance can be significantly improved by utilizing pyarrow.compute functions or improving algorithms in numpy.

## Installation

Use the package manager [pip](https://pip.pypa.io/en/stable/) to install pyarrow_ops.

```bash
pip install pyarrow_ops
```

## Usage
See test_func.py for full runnable test example

```python
import pyarrow as pa 
from pyarrow_ops import join, filters, groupby, head, drop_duplicates

# Create pyarrow.Table
t = pa.Table.from_pydict({
    'Animal': ['Falcon', 'Falcon', 'Parrot', 'Parrot', 'Parrot'],
    'Max Speed': [380., 370., 24., 26., 24.]
})
head(t) # Use head to print, like df.head()

# Drop duplicates based on column values
d = drop_duplicates(t, on=['Animal'], keep='first')

# Groupby iterable
for key, value in groupby(t, ['Animal']):
    print(key)
    head(value)

# Group by aggregate functions
g = groupby(t, ['Animal']).median()
g = groupby(t, ['Animal']).sum()
g = groupby(t, ['Animal']).min()
g = groupby(t, ['Animal']).agg({'Max Speed': 'max'})

# Group by window functions

# Use filter predicates using list of tuples (column, operation, value)
f = filters(t, ('Animal', '=', 'Falcon'))
f = filters(t, [('Animal', 'not in', ['Falcon', 'Duck']), ('Max Speed', '<', 25)])

# Join operations (currently performs inner join)
t2 = pa.Table.from_pydict({
    'Animal': ['Falcon', 'Parrot'],
    'Age': [10, 20]
})
j = join(t, t2, on=['Animal'])
```

## Relation to pyarrow
In the future many of these functions might be obsolete by enhancements in the pyarrow package, but for now it is a convenient alternative to switching back and forth between pyarrow and pandas.

## Contributing
Pull requests are very welcome, however I believe in 80% of the utility in 20% of the code. I personally get lost reading the tranches of the pandas source code. If you would like to seriously improve this work, please let me know!

