Metadata-Version: 2.4
Name: scipreprocess
Version: 0.1.0
Summary: A modular pipeline for preprocessing scientific documents (PDF, DOCX, TEX, XML, TXT)
Author-email: Tarikul Islam Anik <contact@oxyzen.dev>
License-Expression: MIT
Project-URL: Homepage, https://github.com/Tarikul-Islam-Anik/scipreprocess
Project-URL: Repository, https://github.com/Tarikul-Islam-Anik/scipreprocess
Project-URL: Documentation, https://github.com/Tarikul-Islam-Anik/scipreprocess#readme
Project-URL: Issues, https://github.com/Tarikul-Islam-Anik/scipreprocess/issues
Project-URL: Changelog, https://github.com/Tarikul-Islam-Anik/scipreprocess/blob/main/CHANGELOG.md
Keywords: scientific,preprocessing,nlp,pdf,documents
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Text Processing
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: unidecode>=1.3.0
Provides-Extra: pdf
Requires-Dist: PyMuPDF>=1.23.0; extra == "pdf"
Provides-Extra: docx
Requires-Dist: python-docx>=0.8.11; extra == "docx"
Provides-Extra: xml
Requires-Dist: lxml>=4.9.0; extra == "xml"
Provides-Extra: ocr
Requires-Dist: opencv-python>=4.8.0; extra == "ocr"
Requires-Dist: pytesseract>=0.3.10; extra == "ocr"
Provides-Extra: nlp
Requires-Dist: nltk>=3.8.0; extra == "nlp"
Requires-Dist: spacy>=3.7.0; extra == "nlp"
Requires-Dist: scispacy>=0.5.0; extra == "nlp"
Requires-Dist: pysbd>=0.3.4; extra == "nlp"
Provides-Extra: ml
Requires-Dist: scikit-learn>=1.3.0; extra == "ml"
Requires-Dist: sentence-transformers>=2.2.0; extra == "ml"
Requires-Dist: faiss-cpu>=1.7.4; extra == "ml"
Provides-Extra: all
Requires-Dist: PyMuPDF>=1.23.0; extra == "all"
Requires-Dist: python-docx>=0.8.11; extra == "all"
Requires-Dist: lxml>=4.9.0; extra == "all"
Requires-Dist: opencv-python>=4.8.0; extra == "all"
Requires-Dist: pytesseract>=0.3.10; extra == "all"
Requires-Dist: nltk>=3.8.0; extra == "all"
Requires-Dist: spacy>=3.7.0; extra == "all"
Requires-Dist: scispacy>=0.5.0; extra == "all"
Requires-Dist: pysbd>=0.3.4; extra == "all"
Requires-Dist: scikit-learn>=1.3.0; extra == "all"
Requires-Dist: sentence-transformers>=2.2.0; extra == "all"
Requires-Dist: faiss-cpu>=1.7.4; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.5.0; extra == "dev"
Dynamic: license-file

# SciPreprocess

[![CI](https://github.com/Tarikul-Islam-Anik/scipreprocess/actions/workflows/ci.yml/badge.svg)](https://github.com/Tarikul-Islam-Anik/scipreprocess/actions/workflows/ci.yml)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

A modular, open-source pipeline for preprocessing scientific documents in multiple formats (PDF, DOCX, LaTeX, JATS XML, TXT) for LLM consumption and NLP analysis.

## Features

- 📄 **Multi-format support**: PDF, DOCX, TEX, JATS XML, and plain text
- 🔍 **OCR support**: Extract text from scanned documents with Tesseract
- 🧹 **Text cleaning**: Remove citations, normalize unicode, clean special characters
- 🔤 **NLP processing**: Tokenization, lemmatization, stopword removal using spaCy or NLTK
- 📑 **Section detection**: Automatically identify paper sections (Abstract, Introduction, etc.)
- 🔗 **Acronym handling**: Detect and expand acronyms using scispacy
- 📊 **Feature extraction**: TF-IDF and semantic embeddings with sentence-transformers
- 🔎 **Semantic search**: FAISS indexing for efficient similarity search
- 🧩 **Modular design**: Use only the components you need

## Installation

### From PyPI (Recommended)

```bash
pip install scipreprocess
```

### With Optional Dependencies

Install specific feature sets:

```bash
# PDF support
pip install "scipreprocess[pdf]"

# NLP features
pip install "scipreprocess[nlp]"

# Machine learning features
pip install "scipreprocess[ml]"

# OCR support
pip install "scipreprocess[ocr]"

# Everything
pip install "scipreprocess[all]"
```

### Development Installation

For development or from source:

```bash
git clone https://github.com/Tarikul-Islam-Anik/scipreprocess.git
cd scipreprocess
pip install -e ".[all,dev]"
```

### Post-Installation Setup

For NLP features, download required models:

```bash
# Download spaCy model
python -m spacy download en_core_web_sm

# Install scispacy model (optional but recommended)
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_sm-0.5.1.tar.gz
```

## Quick Start

### Basic Usage

```python
from scipreprocess import preprocess_file

# Process a single document
doc_json, clean_text = preprocess_file("path/to/paper.pdf")

# Access the results
print(doc_json['metadata']['title'])
print(doc_json['abstract'])
print(doc_json['sections'])
print(doc_json['acronyms'])
```

### Process Multiple Documents

```python
from scipreprocess import preprocess_documents

# Process multiple documents
files = ["paper1.pdf", "paper2.docx", "paper3.tex"]
results = preprocess_documents(files)

# Access results
documents = results['documents']
tfidf_matrix = results['tfidf']['X']
vectorizer = results['tfidf']['vectorizer']
chunks = results['chunks']
embeddings = results['embeddings']  # if enabled
```

### Custom Configuration

```python
from scipreprocess import PipelineConfig
from scipreprocess.pipeline import PreprocessingPipeline

# Configure the pipeline
config = PipelineConfig(
    use_ocr=True,
    use_spacy=True,
    use_semantic_embeddings=True,
    spacy_model='en_core_sci_sm',
    embedding_model='sentence-transformers/all-MiniLM-L6-v2',
    chunk_target_sentences=(3, 8)
)

# Create pipeline with custom config
pipeline = PreprocessingPipeline(config)
doc_json, text = pipeline.preprocess_file("paper.pdf")
```

## Pipeline Components

The pipeline is organized into modular components:

- **`parsers`**: Document ingestion (PDF, DOCX, TEX, XML, TXT)
- **`preprocessing`**: Text cleaning, tokenization, lemmatization
- **`acronyms`**: Acronym detection and expansion
- **`sectioning`**: Section splitting and chunking
- **`features`**: TF-IDF and semantic embeddings
- **`pipeline`**: Main orchestration

## Architecture

```
scipreprocess/
├── config.py          # Configuration dataclasses
├── models.py          # Data models (ParsedDocument)
├── utils.py           # Dependency management and helpers
├── parsers.py         # Document parsers for each format
├── preprocessing.py   # Text cleaning and NLP
├── acronyms.py        # Acronym detection/expansion
├── sectioning.py      # Section splitting and chunking
├── features.py        # Feature extraction (TF-IDF, embeddings)
└── pipeline.py        # Main pipeline orchestration
```

## Output Format

The pipeline produces structured JSON for each document:

```python
{
    "metadata": {
        "title": "Paper Title",
        "source_file": "path/to/file.pdf",
        "pages": 12
    },
    "abstract": "Abstract text...",
    "sections": [
        {"heading": "Introduction", "text": "..."},
        {"heading": "Methods", "text": "..."},
        ...
    ],
    "acronyms": {
        "NLP": "Natural Language Processing",
        "ML": "Machine Learning"
    },
    "figures": [],
    "tables": [],
    "equations": [],
    "references": []
}
```

## Dependencies

### Required
- `unidecode`: Unicode normalization

### Optional
- `PyMuPDF`: PDF parsing
- `python-docx`: DOCX parsing
- `lxml`: XML parsing
- `opencv-python` + `pytesseract`: OCR support
- `nltk`: Basic NLP (tokenization, stopwords, lemmatization)
- `spacy` + `scispacy`: Advanced NLP and abbreviation detection
- `pysbd`: Sentence boundary detection
- `scikit-learn`: TF-IDF vectorization
- `sentence-transformers`: Semantic embeddings
- `faiss`: Similarity search

## Development

### Setup Development Environment

```bash
# Clone the repository
git clone https://github.com/Tarikul-Islam-Anik/scipreprocess.git
cd scipreprocess

# Install in development mode with dev dependencies
pip install -e ".[all,dev]"

# Run tests
pytest

# Format code
black src/ tests/

# Lint code
ruff check src/ tests/

# Type checking
mypy src/
```

## Documentation

- Examples: [examples/basic_usage.py](examples/basic_usage.py)

### Running Tests

```bash
# Run all tests
pytest

# Run with coverage
pytest --cov=scipreprocess --cov-report=html
```

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add some amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Citation

If you use this pipeline in your research, please cite:

```bibtex
@software{scipreprocess,
  title = {SciPreprocess: A Modular Scientific Document Preprocessing Pipeline},
  author = {Anik, Tarikul Islam},
  year = {2025},
  url = {https://github.com/Tarikul-Islam-Anik/scipreprocess}
}
```

## Acknowledgments

- Built with [spaCy](https://spacy.io/), [scispacy](https://allenai.github.io/scispacy/), and [sentence-transformers](https://www.sbert.net/)
- Inspired by the needs of scientific text processing and NLP research

