Metadata-Version: 2.1
Name: data-toolset
Version: 0.1.2
Summary: 
Home-page: https://github.com/luminousmen/data-toolset
License: MIT
Author: Kirill Bobrov
Author-email: miaplanedo@gmail.com
Requires-Python: >=3.9,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Dist: arrow (>=1.2.3,<2.0.0)
Requires-Dist: cython (>=3.0.2,<4.0.0)
Requires-Dist: duckdb (>=0.8.1,<0.9.0)
Requires-Dist: fastavro (>=1.7.3,<2.0.0)
Requires-Dist: pandas (>=2.0.3,<3.0.0)
Requires-Dist: pyarrow (>=13.0.0,<14.0.0)
Requires-Dist: python-snappy (>=0.6.1,<0.7.0)
Requires-Dist: tox (>=4.11.1,<5.0.0)
Project-URL: Repository, https://github.com/luminousmen/data-toolset
Description-Content-Type: text/markdown

[![Master](https://github.com/luminousmen/data-toolset/actions/workflows/master.yml/badge.svg?branch=master)](https://github.com/luminousmen/data-toolset/actions/workflows/master.yml)
[![codecov](https://codecov.io/gh/luminousmen/data-toolset/branch/master/graph/badge.svg?token=6V9IPSRCB0)](https://codecov.io/gh/luminousmen/data-toolset)

# data-toolset

data-toolset is designed to simplify your data processing tasks by providing a more user-friendly alternative to the traditional JAR utilities like avro-tools and parquet-tools. With this Python package, you can effortlessly handle various data file formats, including Avro and Parquet, using a simple and intuitive command-line interface.

## Installation

Python 3.9 and 3.10 are supported and tested (to some extent).

```bash
pip install --user data-toolset
```

## Usage

```bash
$ data-toolset -h
usage: data-toolset [-h] {head,tail,meta,schema,stats,query,validate,merge,count,to_json,to_csv} ...

positional arguments:
  {head,tail,meta,schema,stats,query,validate,merge,count,to_json,to_csv}
                        commands
    head                Print the first N records from a file
    tail                Print the last N records from a file
    meta                Print a file's metadata
    schema              Print the Avro schema for a file
    stats               Print statistics about a file
    query               Query a file
    validate            Validate a file
    merge               Merge multiple files into one
    count               Count the number of records in a file
    to_json             Convert a file to JSON format
    to_csv              Convert a file to CSV format

optional arguments:
  -h, --help            show this help message and exit
```

## Examples

Print the first 10 records of a Parquet file:

```bash
data-toolset head my_data.parquet -n 10
```

Query a Parquet file using a SQL-like expression:

```bash
data-toolset query my_data.parquet "SELECT * FROM 'my_data.parquet' WHERE age > 25"
```

Merge multiple Avro files into one:

```bash
data-toolset merge file1.avro file2.avro file3.avro merged_file.avro
```

## Contributing

Contributions are welcome! If you have any suggestions, bug reports, or feature requests, please open an issue on GitHub.

# TODO

- [ ] proper online documentation
- [ ] update README
- [ ] add tests for merge
- [ ] create random_sample function
- [ ] create schema_evolution function
- [ ] mature create_sample function
- [ ] optimizations TBD
- [ ] support 3.11+
