Metadata-Version: 2.4
Name: dalla-data-processing
Version: 0.0.3
Summary: data processing pipeline with deduplication, stemming, quality checking, and readability scoring, used for the DALLA Models
Author-email: Hadi Hamoud <hhamoud@dohainstitute.edu.qa>, Digital Research Unit - Arab Center <dru@dohainstitute.edu.qa>
Project-URL: Homepage, https://github.com/U4RASD/dalla-data-processing
Project-URL: Documentation, https://github.com/U4RASD/dalla-data-processing#readme
Project-URL: Repository, https://github.com/U4RASD/dalla-data-processing
Project-URL: Bug Tracker, https://github.com/U4RASD/dalla-data-processing/issues
Keywords: arabic,nlp,data-processing,deduplication,stemming,readability,quality
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: <3.13,>=3.12
Description-Content-Type: text/markdown
Requires-Dist: datasets>=2.14.0
Requires-Dist: transformers>=4.30.0
Requires-Dist: click>=8.0.0
Requires-Dist: tqdm>=4.65.0
Requires-Dist: pyarrow>=12.0.0
Requires-Dist: structlog>=24.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: pre-commit>=3.0.0; extra == "dev"
Provides-Extra: dedup
Requires-Dist: camel-tools==1.5.7; extra == "dedup"
Provides-Extra: dedup-native
Requires-Dist: cffi>=1.15.0; extra == "dedup-native"
Provides-Extra: stem
Requires-Dist: camel-tools==1.5.7; extra == "stem"
Provides-Extra: quality
Requires-Dist: camel-tools==1.5.7; extra == "quality"
Provides-Extra: readability
Requires-Dist: textstat>=0.7.0; extra == "readability"
Provides-Extra: pack
Requires-Dist: sentencepiece>=0.2.0; extra == "pack"
Requires-Dist: pyyaml; extra == "pack"
Provides-Extra: all
Requires-Dist: dalla-data-processing[dedup,dedup-native,dev,pack,quality,readability,stem]; extra == "all"

# Dalla Data Processing (dalla-dp)

A comprehensive Arabic data processing pipeline with deduplication, stemming, quality checking, and readability scoring, used for the DALLA Models.

## Compatibility

- **Linux**: Fully supported
- **macOS**: Fully supported (Intel or through rosetta)
- **Windows**: Supported through WSL (Windows Subsystem for Linux) only, for native windows: manual build from source works for deduplication.

## Installation

### Quick Start (All Features)

For most users, install with all features enabled:

<b>Using uv</b>

```bash
uv pip install "dalla-data-processing[all]"
```

<b>Using pip</b>

```bash
pip install "dalla-data-processing[all]"
```

### Modular Installation (Advanced)

Install only the components you need to keep dependencies minimal:

```bash
# Base installation (no processing features, only core dependencies)
pip install dalla-data-processing

# Install specific features
pip install "dalla-data-processing[dedup]"        # Deduplication only
pip install "dalla-data-processing[stem]"         # Stemming only
pip install "dalla-data-processing[quality]"      # Quality checking only
pip install "dalla-data-processing[readability]"  # Readability scoring only
pip install "dalla-data-processing[pack]"         # Dataset packing only

# Combine multiple features
pip install "dalla-data-processing[dedup,stem,quality]"
```

### Development Installation

<b>From Source (with uv - recommended)</b>

```bash
git clone https://github.com/U4RASD/dalla-data-processing.git
cd dalla-data-processing

# Install all features and dev dependencies
uv sync --all-extras

# Or install with specific extras only
uv sync --extra dedup --extra stem
```

<b>From Source (with pip)</b>

```bash
git clone https://github.com/U4RASD/dalla-data-processing.git
cd dalla-data-processing

# Install with all features for development
pip install -e ".[all,dev]"
```

## Components

> **Note:** Each component requires its corresponding extra to be installed. Install with `[all]` to enable all features, or see [Modular Installation](#modular-installation-advanced) to install only what you need.

### 1. [Deduplication](dalla_data_processing/deduplication/README.md)
Detect and remove duplicate or near-duplicate documents from your datasets using the Onion algorithm.
- **Requires:** `[dedup]` extra

### 2. [Stemming](dalla_data_processing/stemming/README.md)
Apply morphological analysis and stemming using CAMeL Tools.
- **Requires:** `[stem]` extra

### 3. [Quality Checking](dalla_data_processing/quality/README.md)
Check text quality using morphological analysis to detect errors and foreign words.
- **Requires:** `[quality]` extra

### 4. [Readability Scoring](dalla_data_processing/readability/README.md)
Calculate readability scores using Flesch Reading Ease and Osman methods.
Contains also ranking according to both scores
- **Requires:** `[readability]` extra

### 5. [Dataset Packing](dalla_data_processing/packing/README.md)
Pack and prepare datasets for training.
- **Requires:** `[pack]` extra

## Links

- Homepage: https://github.com/U4RASD/dalla-data-processing
- Issues: https://github.com/U4RASD/dalla-data-processing/issues
- Documentation: https://github.com/U4RASD/dalla-data-processing#readme
