Metadata-Version: 2.2
Name: data-science-toolkit
Version: 0.1.67
Summary: Data Science Toolkit (DST) is a Python library that helps implement data science related project with ease.
Home-page: https://github.com/elhachimi-ch/dst
Author: EL HACHIMI CHOUAIB
Author-email: elhachimi.ch@gmail.com
Project-URL: Bug Tracker, https://github.com/elhachimi-ch/dst/issues
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.5
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: setuptools>=42
Requires-Dist: wheel
Requires-Dist: pandas
Requires-Dist: numpy==1.25
Requires-Dist: tensorflow
Requires-Dist: scikit-learn
Requires-Dist: seaborn
Requires-Dist: matplotlib
Requires-Dist: wordcloud
Requires-Dist: keras
Requires-Dist: plotly
Requires-Dist: xgboost
Requires-Dist: opencv-python
Requires-Dist: scikit-image
Requires-Dist: unidecode
Requires-Dist: emoji>=1.7
Requires-Dist: textblob
Requires-Dist: nltk
Requires-Dist: wordcloud
Requires-Dist: optuna
Requires-Dist: xarray
Requires-Dist: openpyxl
Requires-Dist: catboost
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: project-url
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Data Science Toolkit (DST)

[![Docs](https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat)](https://data-science-toolkit.readthedocs.io)
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)
[![Reproducible Capsule](https://img.shields.io/static/v1?label=&message=code+ocean&color=blue)](https://codeocean.com/capsule/1309232/tree)

Data Science Toolkit (DST) is a Python library that helps implement data science projects with ease: from data ingestion and preprocessing to modeling, geospatial analysis, computer vision, text vectorization, and reinforcement learning.

It bundles practical, production-friendly utilities and higher-level abstractions so you can move faster while keeping control over the details.

## Key Features

- **Data handling**: `DataFrame` for loading CSV/JSON/Excel/Parquet, cleaning, transforming, and streaming large datasets.
- **Modeling**: `Model` for traditional ML and deep learning training, cross-validation, metrics, and GPU helpers.
- **Text & NLP**: `Vectorizer` for bag-of-words/TF-IDF, tokenization, cosine similarity, and projections.
- **Charts**: `Chart` utilities for quick exploratory visuals with Matplotlib/Seaborn/Plotly.
- **GIS**: `GIS` for geospatial data layers, joins, CRS transforms, area/perimeter, and exports.
- **Computer Vision**: `ImageFactory` for resizing, cropping, contour detection, blending, and basic filters.
- **Reinforcement Learning**: `Environment` and `R3` tools to explore policies and custom environments.
- **Crop Simulation**: `CSM` modules for crop water requirement, ET simulations, and monitoring pipelines.
- **Utilities**: `Lib` with climate, math, text processing, IO helpers, and more.

## Installation

DST is published as `data-science-toolkit`.

```bash
pip install data-science-toolkit
```

If you’re installing from source (for development):

```bash
git clone https://github.com/elhachimi-ch/dst.git
cd dst
pip install -e .
```

Notes:
- Requires Python 3.5+.
- Some features (e.g., deep learning, GIS, CV) pull heavier dependencies (TensorFlow, CatBoost, OpenCV, Geo stack). Install times may vary.

## Quickstart

```python
from data_science_toolkit.dataframe import DataFrame
from data_science_toolkit.model import Model

# Load a toy dataset
data = DataFrame()
data.load_dataset('iris')
y = data.get_column('target')
data.drop_column('target')

# Fit a decision tree
model = Model(data_x=data.get_dataframe(), data_y=y, model_type='dt', training_percent=0.8)
model.train()
model.report()          # classification metrics
model.cross_validation(5)
```

### Work with Parquet (large data)

```python
from data_science_toolkit.dataframe import DataFrame

# Stream a Parquet dataset efficiently
df = DataFrame(data_path="path/to/parquet/dir", data_type="parquet", n_workers="auto")
summary = df.describe()  # computes per-column stats without loading entire data into RAM
print(summary)
```

### Text Vectorization

```python
from data_science_toolkit.vectorizer import Vectorizer

documents = [
	"data science is fun",
	"toolkits help data workflows",
	"science advances with good tools"
]

vec = Vectorizer(documents_as_list=documents, vectorizer_type='tfidf', ngram_tuple=(1,2))
matrix = vec.get_matrix()
features = vec.get_features_names()
print(len(features), features[:10])
```

### Geospatial Utilities

```python
from data_science_toolkit.gis import GIS

gis = GIS()
gis.add_data_layer("parcels", "data/parcels.geojson", data_type="sf")
gis.add_area_column("parcels", unit="ha")
gis.to_crs("parcels", epsg="3857")
gis.export("parcels", "out/parcels_3857", file_format="geojson")
```

### Computer Vision Helpers

```python
from data_science_toolkit.imagefactory import ImageFactory

img = ImageFactory("data/sample.jpg")
img.to_gray_scale()
img.gaussian_blur((5,5))
img.save("out/processed.jpg")
```

## Documentation

Full API docs and tutorials live at: https://data-science-toolkit.readthedocs.io

## Contributing

Contributions and suggestions are welcome via GitHub pull requests.

Typical workflow:
- Fork the repo and create a feature branch.
- Install dev dependencies: `pip install -e .`.
- Add tests or notebook snippets where relevant.
- Open a PR with a clear description and examples.

## Maintainership

We’re actively enhancing the repo with new algorithms and utilities. Feedback on priorities is appreciated.

## License

MIT License. See the LICENSE file for details.

## Citation

If you use DST in academic work, please cite the repository and (optionally) reference the Code Ocean capsule for reproducibility: https://codeocean.com/capsule/1309232/tree

