Metadata-Version: 2.4
Name: topic-rag
Version: 1.0.1
Summary: Topic-Enhanced Retrieval-Augmented Generation Library
License-Expression: AGPL-3.0-or-later
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Indexing
Requires-Python: <3.12,>=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: matplotlib
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: requests
Requires-Dist: scikit-learn
Requires-Dist: scipy
Requires-Dist: seaborn
Requires-Dist: streamlit>=1.45.1
Dynamic: license-file

# topic-rag — Topic-Enhanced Retrieval-Augmented Generation

Install it anywhere with:

```bash
pip install topic-rag
```

## What it does

Standard RAG systems retrieve documents purely by text similarity (how close the words are). `topic-rag` adds a second layer — it discovers hidden **topics** across your document collection and uses those topics to boost retrieval accuracy. A query about "neural networks" will score higher against documents that share that topic cluster, even if the exact words differ.

## Core usage

### 1. Basic retrieval

```python
from topic_rag import DocumentProcessor, TopicEnhancedRAGRetriever, RAGEvaluator

# Build a topic-aware corpus from your documents
processor = DocumentProcessor(n_topics=10)
corpus = processor.process_corpus(documents)   # list of {id, text, title}

# Retrieve with topic enhancement
retriever = TopicEnhancedRAGRetriever(corpus)
results = retriever.retrieve("What is transfer learning?", k=5)

# Evaluate retrieval quality
evaluator = RAGEvaluator()
metrics = evaluator.evaluate_retrieval(results, relevant_doc_ids)
# → recall@5, precision@5, MRR, NDCG, hit_rate
```

### 2. Benchmarking against standard datasets

```python
from topic_rag import EvaluationPipeline, ExperimentConfig

pipeline = EvaluationPipeline()
config = ExperimentConfig(
    dataset_name="squad_v2",   # ms_marco | natural_questions | hotpot_qa | trivia_qa
    max_documents=500,
    max_queries=100,
    n_topics=10
)
results = pipeline.run_single_experiment(config)
```

### 3. Statistical validation

```python
from topic_rag import StatisticalAnalyzer

analyzer = StatisticalAnalyzer()
stats = analyzer.calculate_comparison_statistics(standard_results, enhanced_results)
# → paired t-test, Wilcoxon signed-rank, Cohen's d effect size, confidence intervals
```

### 4. Paper generation

```python
from topic_rag import PaperGenerator

gen = PaperGenerator()
files = gen.generate_complete_paper({
    "title": "My RAG Study",
    "authors": ["Your Name"],
    "institution": "Your University",
    "results": experiment_results,
    "output_format": "LaTeX + PDF",   # or "Markdown"
    "sections": { "abstract": True, "methodology": True, ... }
})
```

## Advantages over plain RAG

| | Standard RAG | topic-rag |
|---|---|---|
| Retrieval signal | TF-IDF similarity only | TF-IDF + latent topic overlap |
| Semantic grouping | None | Automatic topic discovery |
| Evaluation | Manual | Built-in (Recall, MRR, NDCG) |
| Statistical proof | None | t-test, Wilcoxon, effect sizes |
| Paper output | None | LaTeX + Markdown auto-generated |
| Datasets | Bring your own | MS MARCO, NQ, SQuAD, HotpotQA, TriviaQA built-in |
| Dependencies | Heavy (PyTorch, transformers) | Lightweight (numpy, scikit-learn, scipy) |

## Key design decisions

- **No GPU required** — uses TF-IDF and a lightweight topic model (no PyTorch, no sentence-transformers)
- **Self-contained** — all benchmark datasets have built-in fallback data, so experiments run offline
- **Research-ready** — statistical tests and paper generation make it suitable for academic submission
- **AGPL-3.0** — open source; any service built on it must also be open source

## Who it's for

- Researchers benchmarking retrieval systems
- Engineers who want a lightweight RAG baseline without heavy ML infrastructure
- Anyone who needs reproducible, statistically validated RAG experiments with automatic paper output

## License

This project is licensed under the **GNU Affero General Public License v3.0 (AGPL-3.0)**.

See the [LICENSE](LICENSE) file for the full license text.

### What AGPL-3.0 means

- Anyone can view, use, and modify the code
- Any modified version used to provide a network service **must** release its source code
- Companies cannot embed this in proprietary software without open-sourcing their product

For commercial licensing enquiries, please contact the project maintainers.
