Metadata-Version: 2.4
Name: datamaestro-ir
Version: 0.3.0
Summary: Datamaestro module for Information Retrieval datasets
Project-URL: Homepage, https://github.com/xpmir/datamaestro_ir
Project-URL: Documentation, https://datamaestro-ir.readthedocs.io/en/latest/
Project-URL: Repository, https://github.com/xpmir/datamaestro_ir
Project-URL: Bug Tracker, https://github.com/xpmir/datamaestro_ir/issues
Author-email: Benjamin Piwowarski <benjamin@piwowarski.fr>
License: GPL-3.0-or-later
License-File: LICENSE
Keywords: dataset manager,experiments,information retrieval
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Requires-Dist: attrs
Requires-Dist: datamaestro>=1.9.9
Requires-Dist: experimaestro>=2.0.0
Requires-Dist: impact-index>=1.3.1
Requires-Dist: lxml
Requires-Dist: typing-extensions
Description-Content-Type: text/markdown

[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit) [![PyPI version](https://badge.fury.io/py/datamaestro-ir.svg)](https://badge.fury.io/py/datamaestro-ir)

# Information Retrieval Datasets

This [datamaestro](https://github.com/bpiwowar/datasets) plugin provides easy and systematic access to information retrieval datasets. It handles automated downloading and preparation of standard IR collections, exposes them through a typed Python API, and includes efficient document stores for fast text access (file, mmap, or in-memory).

Full documentation: [datamaestro-ir.readthedocs.io](https://datamaestro-ir.readthedocs.io/)

## Available Datasets

### Ad-hoc Retrieval

- **TREC Ad-hoc (1–8), Robust 2004/2005** — classic TREC test collections over TIPSTER/AQUAINT corpora
- **BEIR Benchmark** — 15+ datasets: TrecCovid, NQ, ArguAna, Touché, ClimateFever, SciDocs, NFCorpus, HotpotQA, FiQA, Quora, DBpedia-Entity, FEVER, SciFact, CQADupStack (12 sub-forums)
- **LoTTE** — domain-specific retrieval across 6 domains (lifestyle, recreation, science, technology, writing, pooled) × dev/test × search/forum queries
- **MS MARCO Passage & Document** — passage ranking (8.8M passages) and document ranking (v1: 3.2M, v2: 12M documents)
- **CORD-19 / TREC-COVID** — COVID-19 research article retrieval (192K documents)

### Conversational Search

- **TREC CaST 2019–2022** — conversational passage retrieval with decontextualized queries, tree-structured conversations (2022), and segmented passages
- **iKAT 2023–2025** — interactive knowledge-seeking over ClueWeb22

### Query Rewriting

- **CANARD** — context-aware query rewriting (train/dev/test)
- **QReCC** — question rewriting in conversational context (14K conversations, 81K QA pairs)
- **OrConvQA** — open-retrieval conversational QA over 11M Wikipedia passages

### Knowledge Distillation & Training Data

- **MS MARCO Ensemble/BERT Teacher** — 40M triples with teacher scores
- **rank-distillm** — BM25/ColBERTv2/RankZephyr annotated passages
- **MS MARCO Hard Negatives** — hard negatives mined from multiple retrieval models
- **Neural Ranking KD** — knowledge distillation teacher scores

### Base Document Collections

- **TIPSTER** (AP, FT, WSJ, ZIFF, …), **AQUAINT**, **TREC CAR** (29.8M paragraphs), **WAPO** v2/v4, **KILT** (42M Wikipedia articles)

