Metadata-Version: 2.3
Name: endofactory
Version: 0.1.2
Summary: Revolutionary EndoVQA dataset construction tool for rapid dataset mixing and configuration
License: MIT
Keywords: endoscopy,medical,VQA,dataset,polars,yaml,cli
Author: TiramisuQiao
Author-email: tlmsq@outlook.com
Requires-Python: >=3.9
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Provides-Extra: tests
Requires-Dist: pillow (>=10.0.0)
Requires-Dist: polars (>=0.20.0)
Requires-Dist: pydantic (>=2.0.0)
Requires-Dist: pytest (>=7.0) ; extra == "tests"
Requires-Dist: pytest (>=8.4.1,<9.0.0)
Requires-Dist: pytest-benchmark (>=4.0) ; extra == "tests"
Requires-Dist: pyyaml (>=6.0)
Requires-Dist: rich (>=13.0.0)
Requires-Dist: typer (>=0.9.0)
Requires-Dist: uuid (>=1.30)
Project-URL: Homepage, https://github.com/your-org/EndoFactory
Project-URL: Issues, https://github.com/your-org/EndoFactory/issues
Project-URL: Repository, https://github.com/your-org/EndoFactory
Description-Content-Type: text/markdown

# EndoFactory 🏭

<p align="center">
  <img src="assets/logo.png" alt="EndoFactory Logo" width="220" />
</p>

Revolutionary tool for constructing EndoVQA datasets through YAML configuration.

## Quick Start

### 1. Installation
If you just want the cli, pip it!
```bash
pip install endofactory -i https://pypi.org/simple
```
Or if you want to be contributor,
```bash
git clone <repository-url>
cd EndoFactory
poetry install
```

### 2. Generate Test Data
```bash
poetry run python tests/test_data_generator.py
```

### 3. Create Configuration
```bash
poetry run python -m endofactory.cli create-config --output config.yaml
```

### 4. Edit Configuration
```yaml
datasets:
  - name: endoscopy_vqa_v1
    image_path: /path/to/images
    parquet_path: /path/to/metadata.parquet
    weight: 0.6
  - name: medical_vqa_v2
    image_path: /path/to/images2
    parquet_path: /path/to/metadata2.parquet
    weight: 0.4

columns:
  - uuid
  - question
  - answer
  - options
  - task
  - category

task_proportions:
  task_proportions:
    classification: 0.5
    detection: 0.3
    segmentation: 0.2

export:
  output_path: /path/to/output
  format: parquet
```

### 5. Build Dataset
```bash
poetry run python -m endofactory.cli build config.yaml --verbose
```

### 6. View Results
```bash
poetry run python -m endofactory.cli view output/endovqa_dataset.parquet
```

## CLI Commands

### `create-config`
Generate example configuration file
```bash
endofactory create-config [--output CONFIG_PATH]
```

### `build`
Build mixed dataset from configuration
```bash
endofactory build CONFIG_PATH [--verbose]
```

### `stats`
Show dataset statistics
```bash
endofactory stats CONFIG_PATH
```

### `view`
Visualize parquet file structure and data
```bash
endofactory view PARQUET_FILE [--rows N] [--columns]
```

## Configuration Options

### Dataset Weights
Control proportion of each dataset in final mix:
```yaml
datasets:
  - name: dataset_a
    weight: 0.7  # 70% from dataset_a
  - name: dataset_b  
    weight: 0.3  # 30% from dataset_b
```

### Task Proportions
Control distribution of different task types:
```yaml
task_proportions:
  task_proportions:
    classification: 0.4
    detection: 0.4
    segmentation: 0.2
  subtask_proportions:
    classification:
      organ_classification: 0.6
      disease_classification: 0.4
```

### Global Columns
Specify columns to extract (missing columns filled with null):
```yaml
columns:
  - uuid
  - question
  - answer
  - task
```

## Features

- **🚀 Fast Dataset Mixing**: YAML-based configuration for dataset blending
- **📊 Task Proportion Control**: Precise control over task/subtask distribution
- **💾 Multiple Export Formats**: Support for Parquet and JSONL output
- **🔍 Data Visualization**: One-command parquet file inspection
- **🛡️ Schema Flexibility**: Automatic handling of different column structures
- **⚡ High Performance**: Polars-powered data processing
- **🎯 Reproducible**: Configurable random seeds

## Project Structure

```
EndoFactory/
├── src/endofactory/
│   ├── __init__.py
│   ├── cli.py
│   ├── config.py
│   ├── core.py
│   └── yaml_loader.py
├── tests/
├── assets/
├── example_config.yaml
└── pyproject.toml
```

## Requirements

- Python >= 3.9
- Poetry (recommended) or pip

## License

MIT License


