Metadata-Version: 2.4
Name: ragchunker
Version: 0.1.4
Summary: A lightweight RAG chunking and preprocessing library
Author-email: Muhammad Usama <muhammadusama7207@gmail.com>
License: MIT
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pdfminer.six==20250506
Requires-Dist: python-docx==1.2.0
Requires-Dist: python-pptx==1.0.2
Requires-Dist: pillow==11.3.0
Requires-Dist: pytesseract==0.3.13
Requires-Dist: openai-whisper==20250625
Requires-Dist: pandas==2.3.3
Requires-Dist: pyarrow==21.0.0
Requires-Dist: beautifulsoup4==4.14.2
Requires-Dist: trafilatura==2.0.0
Requires-Dist: nltk==3.9.2
Requires-Dist: scikit-learn==1.7.2
Requires-Dist: sentence-transformers==5.1.1
Requires-Dist: numpy==2.3.3
Requires-Dist: hf_xet==1.1.10
Requires-Dist: openai==2.3.0
Requires-Dist: faiss-cpu==1.12.0
Requires-Dist: setuptools==80.9.0
Requires-Dist: wheel==0.45.1
Requires-Dist: twine==6.2.0
Requires-Dist: pytest==8.4.2
Requires-Dist: chromadb==1.1.1
Provides-Extra: test
Requires-Dist: pytest==8.4.2; extra == "test"
Requires-Dist: pytest-cov==7.0.0; extra == "test"
Dynamic: license-file

# ragchunker

A modular Python package for Retrieval-Augmented Generation (RAG) preprocessing. It supports:

- **File Ingestion**: Load documents from various formats (TXT, PDF, DOCX, PPTX, images with OCR, audio with Whisper, CSV/Excel/Parquet, HTML, JSON, ZIP/TAR).
- **Chunking**: Split documents into chunks using fixed-length, semantic, or recursive strategies.
- **Provenance**: Track metadata, checksums, and token counts for chunks.
- **Embeddings**: Generate embeddings using Sentence-Transformers or OpenAI.
- **Storage**: Save chunks and embeddings to JSONL, Parquet, SQLite, NumPy, or FAISS; optional integration with Pinecone or ChromaDB.

## Installation

Install the core package:

```bash
pip install ragchunker



For Testing:

from ragchunker.rag_pipeline import run_rag_pipeline
from ragchunker.storage import store_from_result

VECTOR_DB = "chroma"  # "faiss" or "chroma"


## For Open Source Embeddings
result = run_rag_pipeline(
    data_dir="data",                # Folder with your PDFs, docs, or txt files
    output_dir="output",            # Where JSONL, Chunks and embeddings will be saved
    chunk_strategy="semantic",      # or "fixed", "recursive"
    chunk_size=800,
    overlap=100,
    embed_model="all-MiniLM-L6-v2", # or any other
    embed_provider="sentence-transformers",
    openai_api_key=None                    # or "sk-your-openai-key" if using OpenAI
)

print("\n✅ Pipeline executed successfully!")


# For Open AI Embeddings
result = run_rag_pipeline(
    data_dir="data",                # Folder with your PDFs, docs, or txt files
    output_dir="output",            # Where JSONL, Chunks and embeddings will be saved
    chunk_strategy="semantic",      # or "fixed", "recursive"
    chunk_size=800,
    overlap=100,
    embed_model="text-embedding-3-small", # or any other
    embed_provider="openai",
    openai_api_key=""                   # or "sk-your-openai-key" if using OpenAI
)

print("\n✅ Pipeline executed successfully!")


print(f"\nStoring embeddings + chunks to: {VECTOR_DB}")
store_info = store_from_result(result, VECTOR_DB)
print("Store info:", store_info)
print("Done.")
