Metadata-Version: 2.4
Name: topic-rag
Version: 1.0.3
Summary: Lightweight topic-aware document processing and retrieval
License-Expression: AGPL-3.0-or-later
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Indexing
Requires-Python: <3.12,>=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: scikit-learn
Dynamic: license-file

# topic-rag — Topic-Enhanced Retrieval-Augmented Generation

Install it anywhere with:

```bash
pip install topic-rag
```

## What it does

Standard RAG systems retrieve documents purely by text similarity (TF-IDF cosine distance). `topic-rag` adds a second signal — it automatically discovers hidden **topics** across your document collection and uses those topics to boost retrieval accuracy. A query about "neural networks" will score higher against documents that share that topic cluster, even if the exact words differ.

## Quick Start

```python
from topic_rag import DocumentProcessor, TopicEnhancedRAGRetriever

# Build a topic-aware corpus from your documents
processor = DocumentProcessor(n_topics=10)
corpus = processor.process_corpus(documents)   # list of {id, text, title}

# Retrieve with topic enhancement
retriever = TopicEnhancedRAGRetriever(corpus)
results = retriever.retrieve("What is transfer learning?", k=5)

for r in results:
    print(f"[{r['score']:.4f}] {r['sentence']}")
```

## Two Retrieval Modes

| Retriever | How it works |
|---|---|
| `StandardRAGRetriever` | Ranks by TF-IDF cosine similarity only |
| `TopicEnhancedRAGRetriever` | Combines TF-IDF similarity **+** latent topic overlap for better context-aware ranking |

Use both side by side to compare standard vs topic-enhanced retrieval on your own data:

```python
from topic_rag import DocumentProcessor, StandardRAGRetriever, TopicEnhancedRAGRetriever

processor = DocumentProcessor(n_topics=10)
corpus = processor.process_corpus(documents)

std_retriever = StandardRAGRetriever(corpus)
enh_retriever = TopicEnhancedRAGRetriever(corpus)

std_results = std_retriever.retrieve("my query", k=5)
enh_results = enh_retriever.retrieve("my query", k=5)
```

## Key Design Decisions

- **No GPU required** — uses TF-IDF and a lightweight NMF-based topic model (no PyTorch, no sentence-transformers)
- **Minimal dependencies** — only `numpy` and `scikit-learn` at the core
- **Self-contained** — no external API calls or downloads needed

## License

This project is licensed under the **GNU Affero General Public License v3.0 (AGPL-3.0)**.

See the [LICENSE](LICENSE) file for the full license text.

### What AGPL-3.0 means

- Anyone can view, use, and modify the code
- Any modified version used to provide a network service **must** release its source code
- Companies cannot embed this in proprietary software without open-sourcing their product

For commercial licensing enquiries, please contact the project maintainers.
