Metadata-Version: 2.4
Name: atomic-rag-lib
Version: 0.1.0
Summary: A modular, research-backed RAG building block library
Author-email: Rohin Patel <rohin.patel@outlook.com>
License-Expression: MIT
Project-URL: Source Code, https://github.com/rohinp/atomic-rag
Keywords: rag,retrieval-augmented-generation,llm,nlp,vector-search,bm25
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pydantic>=2.0
Provides-Extra: ollama
Requires-Dist: ollama>=0.4.0; extra == "ollama"
Provides-Extra: openai
Requires-Dist: openai>=1.0; extra == "openai"
Provides-Extra: markitdown
Requires-Dist: markitdown[pdf]>=0.1.0; extra == "markitdown"
Provides-Extra: retrieval
Requires-Dist: chromadb>=0.5; extra == "retrieval"
Requires-Dist: rank-bm25>=0.2; extra == "retrieval"
Provides-Extra: reranker
Requires-Dist: sentence-transformers>=3.0; extra == "reranker"
Provides-Extra: ragas
Requires-Dist: ragas>=0.2; extra == "ragas"
Requires-Dist: datasets>=2.0; extra == "ragas"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-cov>=5.0; extra == "dev"
Requires-Dist: pytest-mock>=3.14; extra == "dev"
Provides-Extra: all
Requires-Dist: ollama>=0.4.0; extra == "all"
Requires-Dist: markitdown[pdf]>=0.1.0; extra == "all"
Requires-Dist: chromadb>=0.5; extra == "all"
Requires-Dist: rank-bm25>=0.2; extra == "all"
Requires-Dist: sentence-transformers>=3.0; extra == "all"
Dynamic: license-file

# atomic-rag

> Note: WIP I'm still reviewing the generated code, though the exmaples work :-) .

A modular Python library of research-backed RAG building blocks. Each component solves one specific failure mode of retrieval-augmented generation and can be used independently or composed into a full pipeline.

The design goal is the opposite of LangChain: no magic, no hidden abstractions. Every module has a clear input/output contract (`DataPacket`), is independently testable, and can be swapped without touching anything else.

---

## Install

**Quickstart — full local stack with Ollama:**

```bash
pip install -e ".[all,dev]"
```

This installs every dependency needed to run the pipeline end-to-end (Ollama, ChromaDB, BM25, sentence-transformers, MarkItDown) plus the test suite. Then pull the required models:

```bash
ollama pull nomic-embed-text   # embeddings
ollama pull llama3.2:3b        # chat / reasoning
```

**Alternative — requirements.txt:**

```bash
pip install -r requirements.txt
pip install -e .
```

**Pick only what you need:**

```bash
pip install -e ".[dev]"          # tests only — no runtime deps
pip install -e ".[retrieval]"    # ChromaDB + BM25
pip install -e ".[reranker]"     # cross-encoder reranking (optional)
pip install -e ".[ollama]"       # local models via Ollama
pip install -e ".[openai]"       # OpenAI API models
pip install -e ".[markitdown]"   # PDF/PPTX/XLSX ingestion
pip install -e ".[ragas]"        # Ragas evaluation metrics
```

## Quick Start

**Ingest a PDF or Office document:**

```python
from atomic_rag.ingestion import MarkItDownIngestor

ingestor = MarkItDownIngestor()
docs = ingestor.ingest("reports/q4-2024.pdf")

for doc in docs:
    print(f"[{doc.chunk_index}] {doc.content[:80]}...")
```

**Ingest a Python codebase (AST-based chunking):**

```python
from atomic_rag.ingestion import CodeIngestor

ingestor = CodeIngestor()
docs = ingestor.ingest_directory("src/")  # walks recursively, ignores __pycache__ etc.

for doc in docs:
    print(f"[{doc.chunk_index}] {doc.metadata['type']:<8} {doc.metadata.get('name', '')}  ({doc.source})")
```

## Development

```bash
pytest                          # run all tests (integration tests excluded)
pytest -m integration           # run integration tests (requires real dependencies)
pytest tests/test_ingestion.py  # run a single test file
pytest tests/test_ingestion.py::TestMarkdownChunker::test_splits_on_h2_headers  # single test
pytest --cov=atomic_rag --cov-report=term-missing  # with coverage
```

---

## Architecture

All modules communicate through a single `DataPacket` object that accumulates state as it moves through the pipeline. Modules never mutate their input — they return a copy with their output fields populated.

```
DataPacket(query="...")
  -> [Phase 2] expanded_queries populated
  -> [Phase 3] documents populated (retrieved + reranked, with scores)
  -> [Phase 4] context populated (compressed string for the LLM)
  -> [Phase 5] answer populated
  -> [Eval]    eval_scores populated (faithfulness, answer_relevance, context_precision)
```

Each phase also appends a `TraceEntry` to `packet.trace` for observability.

### Phases

| Phase | Problem solved | Key technique | Status |
|---|---|---|---|
| 1 — Ingestion | Messy PDFs destroy table/header structure | Markdown-native parsing (MarkItDown) + AST-based code chunking | **done** |
| 3 — Retrieval | Vector search misses keywords and acronyms | Hybrid search (vector + BM25) + RRF + cross-encoder reranking | **done** |
| 4 — Context | LLMs ignore information buried mid-context | Sentence-level cosine filtering (SentenceCompressor) | **done** |
| 2 — Query | Vague queries miss the relevant documents | HyDE + multi-query expansion | **done** |
| 5 — Agent | Hallucinations when retrieved context is insufficient | Corrective RAG (C-RAG) with evaluator + fallback | **done** |
| Eval | No visibility into where the pipeline fails | Faithfulness + answer relevance + Ragas integration | **done** |

Phase 3 before Phase 2 is intentional — hybrid retrieval delivers the highest quality improvement per unit of work. Query intelligence (Phase 2) has diminishing returns until retrieval is solid.

### Tech Stack

| Layer | Library |
|---|---|
| Parsing | Microsoft MarkItDown (swap: Docling) |
| Vector store | ChromaDB (swap: Qdrant) |
| Keyword search | rank-bm25 |
| Reranking | sentence-transformers cross-encoders |
| LLM / Embedder | Ollama (swap: OpenAI, or any ChatModelBase) |
| Evaluation | Built-in scorers + optional Ragas integration |

---

## Docs

Start at [`docs/index.md`](docs/index.md) — it has a guided reading order, a full table of contents, and a pipeline diagram.

Quick links:
- [DataPacket contract](docs/concepts/data-packet.md)
- [Ingestion module](docs/modules/ingestion.md)
- [Retrieval module](docs/modules/retrieval.md)
- [Hybrid search technique](docs/techniques/hybrid-search.md)
- [Cross-encoder reranking](docs/techniques/cross-encoder-reranking.md)
- [Markdown-native parsing](docs/techniques/markdown-native-parsing.md)
- [Context module](docs/modules/context.md)
- [Context compression technique](docs/techniques/context-compression.md)
- [Query module](docs/modules/query.md)
- [HyDE technique](docs/techniques/hyde.md)
- [Multi-query expansion technique](docs/techniques/multi-query-expansion.md)
- [Agent module](docs/modules/agent.md)
- [Corrective RAG technique](docs/techniques/corrective-rag.md)
- [Evaluation module](docs/modules/evaluation.md)
- [Swapping backends guide](docs/guides/swapping-backends.md)

## Examples

- [`examples/code_qa/`](examples/code_qa/) — full pipeline demo: indexes a Python codebase and answers questions via retrieval + compression + C-RAG
