Metadata-Version: 2.4
Name: adminlineage
Version: 0.2.2
Summary: Build administrative evolution keys across time with exact-match constrained Gemini adjudication
Author: AdminLineage Contributors
License-Expression: MIT
Project-URL: Homepage, https://github.com/TahaIbrahimSiddiqui/AdminLineageAI
Project-URL: Repository, https://github.com/TahaIbrahimSiddiqui/AdminLineageAI
Project-URL: Issues, https://github.com/TahaIbrahimSiddiqui/AdminLineageAI/issues
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=2.0
Requires-Dist: pydantic>=2.7
Requires-Dist: PyYAML>=6.0
Requires-Dist: python-dotenv>=1.0
Requires-Dist: google-genai>=0.7
Provides-Extra: io
Requires-Dist: pyarrow>=15.0; extra == "io"
Provides-Extra: dev
Requires-Dist: build>=1.2; extra == "dev"
Requires-Dist: mypy>=1.10; extra == "dev"
Requires-Dist: pandas-stubs>=2.2; extra == "dev"
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: ruff>=0.5; extra == "dev"
Requires-Dist: twine>=5.1; extra == "dev"
Requires-Dist: types-PyYAML>=6.0; extra == "dev"
Requires-Dist: vulture>=2.16; extra == "dev"
Dynamic: license-file

# AdminLineageAI

AdminLineageAI makes crosswalks between administrative locations such as districts (ADM2), subdistricts (ADM3), states (ADM1), and countries (ADM0) across datasets that may come from completely different sources and different periods. It uses AI to compare likely matches, reason over spelling variants and language-specific forms, administrative split/merges/renames and produce a usable crosswalk plus review artifacts.

Matching administrative units by hand is labour-intensive work. Through this package, we hope to reduce the manual work of matching administrative units between datasets while still keeping a clear review trail and reproducibility.

The package generates candidate matches between two datasets, asks Gemini to choose among them, and writes a crosswalk plus review artifacts. It outputs a final evolution key plus review files as CSV and Parquet.

Published package: [adminlineage on PyPI](https://pypi.org/project/adminlineage/)

<p align="center">
  <img alt="This is an experimental utility. Treat these crosswalks as assistive outputs and cross-verify them, especially in important cases." src="https://img.shields.io/static/v1?label=This%20is%20an%20experimental%20utility.&message=Treat%20these%20crosswalks%20as%20assistive%20outputs%20and%20cross-verify%20them%2C%20especially%20in%20important%20cases.&color=red">
</p>

## Possible use cases

Below are few possible scenarios where this package can be of assistance. Moreover, we would love to hear about other user experiences and use cases for this package.

- For instance, one has scheme dataset from a government scheme and need to match it against a standard administrative list such as a census table. The scheme source may write `Paschimi Singhbhum` while another uses `West Singhbhum`. Plain fuzzy matching will miss cases like this unless you manually standardize prefixes and suffixes first. While AI can do matching for this because it has context that `paschim` in Hindi means `west`. The same kind of issue shows up across many widely spoken languages.
- Handling administrative churn. Districts and other units are regularly split, merged, renamed, or grouped differently, and there is often no up-to-date public evolution list for newly created units, the package does a wide google search and find possible predessor or sucessor for each administrative unit in the primary dataset
- Creating entirely new evolution crosswalks that do not exist between two time period at an administrative level.  

## Important Features

- The default setting of the package is set to have best results with minimal token cost. Please feel free to change them according to your needs.
- To keep the token costs minimal, we do exact string match plus pruning of matching candidates on the primary side before first stage. 
- Hierarchical matching with `exact_match`. If your data are nested, you can match names within exact scopes such as `country`, `state`, or `district`. For example, you can choose to match only district names within a states or subdistricts with a district. This works well, but the exact-match column string need to line up exactly across both datasets.
- Replay and reproducibility. Academic pipelines often need to be rerun many times. With replay enabled, repeated semantic requests can reuse prior completed LLM work instead of calling the API again. The `seed` parameter helps keep request identity deterministic and makes reruns easier to reproduce.

The supported live workflow in AdminLineageAI is:

- Compatible with any `gemini-3+` model
- Google Search grounding enabled
- strict JSON output from the model
- user-controlled batching with automatic split fallback on failed multi-row requests
- an optional bounded second-stage rescue pass for unmatched rows when `string_exact_match_prune`
  is set to `from` or `to`

The bounded second stage works like this:

- first pass still does the normal grounded shortlist adjudication
- if `string_exact_match_prune="from"`, the rescue pass revisits rows with `merge="only_in_from"`
- if `string_exact_match_prune="to"`, it revisits rows with `merge="only_in_to"`
- it runs one grounded research call to look for a predecessor or successor name
- if that research comes back as `unknown` with no lineage hint, the row is left alone and the
  rescue pass stops there
- otherwise it searches the full opposite table, rebuilds a short global shortlist, and runs one
  final strict JSON decision call without additional search grounding
- the second stage is sequential, one-pass, resumable, and writes `second_stage_results.jsonl`

## How To Use

You do not need the CLI to use AdminLineageAI. The simplest path is the Python API.

1. Install the published package.

```bash
pip install adminlineage
```

Install the optional parquet dependency if you want parquet output support:

```bash
pip install "adminlineage[io]"
```

2. Set a Gemini API key in `GEMINI_API_KEY`, or use another environment variable name and pass it explicitly.

```bash
GEMINI_API_KEY=your_api_key_here
```

The package can load a nearby `.env` file when it looks for the key.

3. Choose the name column on each side, and add optional exact-match columns, IDs, or extra context columns if you have them.

5. Run the matcher.

```python
import pandas as pd
import adminlineage

df_from = pd.read_csv("from_units.csv")
df_to = pd.read_csv("to_units.csv")

crosswalk_df, metadata = adminlineage.build_evolution_key(
    df_from,
    df_to,
    country="India",
    year_from=1951,
    year_to=2001,
    map_col_from="district",
    map_col_to="district",
    exact_match=["state"],
    id_col_from="unit_id",
    id_col_to="unit_id",
    relationship="auto",
    string_exact_match_prune="from",
    evidence=False,
    reason=False,
    model="gemini-3.1-flash-lite-preview",
    gemini_api_key_env="GEMINI_API_KEY",
    replay_enabled=True,
    seed=42,
)

print(crosswalk_df[["from_name", "to_name", "merge", "score"]].head())
print(metadata["artifacts"])
```

6. Review the outputs. By default, AdminLineageAI writes artifacts under `outputs/<country>_<year_from>_<year_to>_<map_col_from>`. The main ones are `evolution_key.csv`, `review_queue.csv`, and `run_metadata.json`.

## Common Options

- `exact_match`: Restricts matching to rows that agree exactly on one or more scope columns such as `country`, `state`, or `district`.
- `string_exact_match_prune`: Controls how aggressively exact string hits are removed from later AI work. Use this to control token spend.
- `relationship`: Declares the kind of relationship you expect, or leave it as `auto`.
- `max_candidates`: Limits how many candidate rows are shown to the model for each source row. The default is 6.
- `evidence`: Adds a short factual summary column.
- `reason`: Adds a longer explanation column.
- `replay_enabled`: Reuses prior completed LLM work when the semantic request matches.
- `seed`: Keeps request identity deterministic for more reproducible reruns.
- `output_dir`: Changes where run artifacts are written.

## Matching Flow Example

This example follows a nested district-level match inside `India > Uttar Pradesh` from `2011` to `2025`. Here `string_exact_match_prune='to'` (this set `to` as primary side and `from` as secondary side where all candidates stay global).

```mermaid
flowchart TD
    A["From table (2011)<br/>India / Uttar Pradesh / Agra<br/>India / Uttar Pradesh / Kanpur Dehat<br/>India / Uttar Pradesh / Faizabad<br/>India / Uttar Pradesh / Allahabad"]
    B["To table (2025)<br/>India / Uttar Pradesh / Agra<br/>India / Uttar Pradesh / Kanpur Rural<br/>India / Uttar Pradesh / Ayodhya<br/>India / Uttar Pradesh / Prayagraj"]
    C["Nested settings<br/>map_col='district'<br/>exact_match=['state']<br/>string_exact_match_prune='to'<br/>this set 'to' as primary side<br/>and 'from' as secondary side<br/>where all candidates stay global"]
    D["Validate inputs and normalize names"]
    E["Exact string match pruning before LLM"]
    F["Agra -> Agra<br/>no LLM used here<br/>just exact string match"]
    H["AI matches remaining rows on primary side<br/>(Kanpur Rural, Ayodhya, Prayagraj)<br/>using grounded Gemini search<br/>"]
    I["AI matches Kanpur Dehat -> Kanpur Rural<br/>because it has context that 'dehat' means 'rural' in Hindi"]
    J{"Do Ayodhya or Prayagraj stay unmatched<br/>after first stage?"}
    L["Do intensive Gemini search of potential predecessor / successor of Ayodhya / Prayagraj<br/>if they were renamed, merged, split, or transferred"]
    M["If Gemini finds a potential predecessor / successor for that district<br/>match it with the global district list from the secondary side"]
    N["Write final evolution key<br/>Agra -> Agra<br/>Kanpur Dehat -> Kanpur Rural<br/>Faizabad -> Ayodhya<br/>Allahabad -> Prayagraj"]
    O["Write artifacts<br/>evolution_key.csv<br/>review_queue.csv<br/>run_metadata.json<br/>replay bundle"]

    subgraph G["First stage"]
        H
        I
    end

    subgraph P["Second stage"]
        L
        M
    end

    A --> C
    B --> C
    C --> D
    D --> E
    E --> F
    E --> H
    H --> I
    I --> J
    J -- "No" --> N
    J -- "Yes" --> L
    L --> M
    A --> N
    B --> N
    M --> N
    N --> O
```

## Hand Check Against Scheme Ground Truth

This is a quick hand check against a human-made evolution key for a government scheme implemented nationally in India. The scheme side is `2025` districts, mapped back to their predecessor `2011` districts.

The comparison is oriented from the scheme side: for each `district_2025` in the hand key, does the evolution key recover the expected `district_2011` predecessor? Names were normalized before comparison. Spelling and transliteration-only differences were treated as aligns. A row counts as a match only when the evolution key has a non-blank `from_name`.

- `aligns` means the evolution key points to the same 2011 district name
- `disagrees` means the evolution key points to a different 2011 district
- `no match` means the evolution key does not provide any non-blank `from_name`

| Outcome | Count | Share of 612 hand-coded district pairs |
|---|---:|---:|
| Aligns with scheme hand mapping | 595 | 97.22% |
| Disagrees with scheme hand mapping | 11 | 1.80% |
| Evolution key provides no 2011 match | 6 | 0.98% |

Takeaway: most scheme districts map back to the same 2011 predecessor as the hand key, a few disagree, and a small number have no match. Treat this as a sanity check, not a full audit.

## Optional CLI Workflow

The CLI is useful when you want a saved YAML config for repeatable runs, but it is optional.

```bash
adminlineage preview --config examples/config/example.yml
adminlineage validate --config examples/config/example.yml
adminlineage run --config examples/config/example.yml
adminlineage export --input outputs/india_1951_2001_subdistrict/evolution_key.csv --format jsonl
```

The package includes these example assets:

- `examples/config/example.yml`
- `examples/loaders/sample_loader.py`
- `examples/adminlineage_gemini_3_1_flash_lite.ipynb`

## Python API

Public objects available from `import adminlineage`:

- `build_evolution_key`
- `preview_plan`
- `validate_inputs`
- `export_crosswalk`
- `get_output_schema_definition`
- `OUTPUT_SCHEMA_VERSION`
- `__version__`

### `build_evolution_key`

Build the evolution key and write run artifacts.

Required arguments:

| Argument | Type | Meaning |
|---|---|---|
| `df_from` | `pd.DataFrame` | Earlier-period table |
| `df_to` | `pd.DataFrame` | Later-period table |
| `country` | `str` | Country label used in prompts and metadata |
| `year_from` | `int \| str` | Earlier-period label |
| `year_to` | `int \| str` | Later-period label |
| `map_col_from` | `str` | Source name column |

Optional arguments:

| Argument | Type | Default | Meaning |
|---|---|---|---|
| `map_col_to` | `str \| None` | `None` | Target name column. Falls back to `map_col_from` when omitted. |
| `exact_match` | `list[str] \| None` | `None` | Columns that must agree before comparison. |
| `id_col_from` | `str \| None` | `None` | Source ID column. |
| `id_col_to` | `str \| None` | `None` | Target ID column. |
| `extra_context_cols` | `list[str] \| None` | `None` | Extra columns added to the model payload. |
| `relationship` | `str` | `auto` | One of `auto`, `father_to_father`, `father_to_child`, `child_to_father`, `child_to_child`. |
| `string_exact_match_prune` | `str` | `none` | `none` keeps exact-string hits in later AI work, `from` removes matched source rows from AI work, `to` removes matched source and target rows from later AI work. |
| `evidence` | `bool` | `False` | Adds a short evidence summary and includes the `evidence` column. |
| `reason` | `bool` | `False` | Adds a longer explanation in the `reason` column. |
| `model` | `str` | `gemini-3.1-flash-lite-preview` | Gemini model name. |
| `gemini_api_key_env` | `str` | `GEMINI_API_KEY` | Environment variable name used for the API key. |
| `batch_size` | `int` | `5` | Maximum number of source rows per Gemini request. When a multi-row request fails, the pipeline retries in smaller batches. |
| `max_candidates` | `int` | `6` | Candidate shortlist size per source row. |
| `output_dir` | `str \| Path` | `outputs` | Base output directory for run artifacts. |
| `seed` | `int` | `42` | Deterministic seed for repeatable request identity. |
| `temperature` | `float` | `0.75` | Gemini temperature. |
| `enable_google_search` | `bool` | `True` | Enables grounded Gemini adjudication. |
| `request_timeout_seconds` | `int \| None` | `90` | Per-request timeout. |
| `env_search_dir` | `str \| Path \| None` | `None` | Starting directory used when searching for `.env`. |
| `replay_enabled` | `bool` | `False` | Reuses prior completed LLM work when the semantic request matches. |
| `replay_store_dir` | `str \| Path \| None` | `None` | Replay store path. Falls back to `.adminlineage_replay` internally when replay is enabled. |

Return value:

- `tuple[pd.DataFrame, dict]`
- first item: the crosswalk DataFrame
- second item: run metadata with counts, warnings, request details, and artifact paths

### `preview_plan`

Preview grouping and candidate-generation behavior without calling Gemini.

```python
adminlineage.preview_plan(
    df_from,
    df_to,
    *,
    country,
    year_from,
    year_to,
    map_col_from,
    map_col_to=None,
    exact_match=None,
    id_col_from=None,
    id_col_to=None,
    extra_context_cols=None,
    string_exact_match_prune="none",
    max_candidates=6,
)
```

Return value: a diagnostics dict describing validity, group sizes, exact-string hits, and candidate budgets.

### `validate_inputs`

Validate the two input tables without running the pipeline.

```python
adminlineage.validate_inputs(
    df_from,
    df_to,
    *,
    country,
    map_col_from,
    map_col_to=None,
    exact_match=None,
    id_col_from=None,
    id_col_to=None,
)
```

Return value: a diagnostics dict that reports whether the inputs are valid and what is missing or duplicated.

### `export_crosswalk`

Convert a materialized crosswalk file into another format.

```python
adminlineage.export_crosswalk(
    input_path="outputs/india_1951_2001_subdistrict/evolution_key.csv",
    output_format="jsonl",
    output_path=None,
)
```

Return value: the written output path.

Supported output formats:

- `csv`
- `parquet`
- `jsonl`

### `get_output_schema_definition`

Return a machine-readable description of the materialized output schema.

```python
schema = adminlineage.get_output_schema_definition(include_evidence=False)
```

Arguments:

| Argument | Type | Default | Meaning |
|---|---|---|---|
| `include_evidence` | `bool` | `False` | Includes the `evidence` column in the returned schema definition. |

Return value: a dict containing the schema version, ordered output columns, required columns, and enum values, including the `merge` indicator enum.

### `OUTPUT_SCHEMA_VERSION`

String constant for the current materialized output schema version.

### `__version__`

String constant for the package version.

## Optional CLI Reference

Commands:

```bash
adminlineage run --config path/to/config.yml
adminlineage preview --config path/to/config.yml
adminlineage validate --config path/to/config.yml
adminlineage export --input path/to/evolution_key.csv --format {csv|parquet|jsonl} [--output path]
```

`preview` and `validate` do not call Gemini. `run` writes the full artifact set. `export` converts an existing materialized crosswalk file. If you are using the Python API directly, you can ignore this section.

## CLI YAML Config Reference

Top-level sections:

- `request`
- `data`
- `llm`
- `pipeline`
- `cache`
- `retry`
- `replay`
- `output`

### `request`

| Key | Default | Meaning |
|---|---|---|
| `country` | required | Country label used in prompts and metadata. |
| `year_from` | required | Earlier-period label. |
| `year_to` | required | Later-period label. |
| `map_col_from` | required | Source name column. |
| `map_col_to` | `null` | Target name column. Falls back to `map_col_from`. |
| `exact_match` | `[]` | Columns that must agree before comparison. |
| `id_col_from` | `null` | Source ID column. |
| `id_col_to` | `null` | Target ID column. |
| `extra_context_cols` | `[]` | Extra columns added to the model payload. |
| `relationship` | `auto` | Relationship mode. |
| `string_exact_match_prune` | `none` | Exact-string pruning mode. |
| `evidence` | `false` | Adds the `evidence` column. |
| `reason` | `false` | Adds the `reason` column. |

### `data`

| Key | Default | Meaning |
|---|---|---|
| `mode` | `files` | One of `files` or `python_hook`. |
| `from_path` | `null` | Required when `mode: files`. |
| `to_path` | `null` | Required when `mode: files`. |
| `callable` | `null` | Required when `mode: python_hook`. Uses `module:function` syntax. |
| `params` | `{}` | Arbitrary config payload passed to the loader hook. |

Loader contract for `python_hook` mode:

```python
def load_data(config: dict) -> tuple[pd.DataFrame, pd.DataFrame]:
    ...
```

The included example hook is `examples/loaders/sample_loader.py`.

For file mode, `data.from_path` and `data.to_path` are resolved relative to the config file location, not your shell location.

### `llm`

| Key | Default | Meaning |
|---|---|---|
| `provider` | `gemini` | Use `gemini` for live runs or `mock` for dry runs and testing. |
| `model` | `gemini-3.1-flash-lite-preview` | Gemini model name. |
| `gemini_api_key_env` | `GEMINI_API_KEY` | Environment variable name for the API key. |
| `temperature` | `0.75` | Gemini temperature. |
| `seed` | `42` | Deterministic seed. |
| `enable_google_search` | `true` | Enables grounded adjudication. |
| `request_timeout_seconds` | `90` | Per-request timeout. |

### `pipeline`

| Key | Default | Meaning |
|---|---|---|
| `batch_size` | `5` | Maximum number of source rows per Gemini request. Failed multi-row requests are retried in smaller batches. |
| `max_candidates` | `6` | Candidate shortlist size per source row. You can raise this if you want a wider shortlist. |
| `review_score_threshold` | `0.6` | Rows below this score are flagged for review. |

### `cache`

| Key | Default | Meaning |
|---|---|---|
| `enabled` | `true` | Enables the SQLite LLM cache. |
| `backend` | `sqlite` | Current cache backend. |
| `path` | `llm_cache.sqlite` | Cache database path. |

### `retry`

| Key | Default | Meaning |
|---|---|---|
| `max_attempts` | `6` | Maximum retry attempts for transient LLM failures. |
| `base_delay_seconds` | `1.0` | Initial retry delay. |
| `max_delay_seconds` | `20.0` | Maximum retry delay. |
| `jitter_seconds` | `0.2` | Random jitter added to retry timing. |

### `replay`

| Key | Default | Meaning |
|---|---|---|
| `enabled` | `false` | Enables exact replay for fully completed runs. |
| `store_dir` | `.adminlineage_replay` | Replay bundle directory. |

Relative replay store paths are resolved from the config file location. This section only matters if you are using the CLI workflow.

### `output`

| Key | Default | Meaning |
|---|---|---|
| `write_csv` | `true` | Writes `evolution_key.csv`. |
| `write_parquet` | `true` | Writes `evolution_key.parquet`. |

Minimal config shape:

```yaml
request:
  country: India
  year_from: 1951
  year_to: 2001
  map_col_from: subdistrict
  map_col_to: subdistrict
  exact_match: [state, district]
  id_col_from: unit_id
  id_col_to: unit_id
  relationship: auto
  string_exact_match_prune: none
  evidence: false
  reason: false

data:
  mode: files
  from_path: ../data/from_units.csv
  to_path: ../data/to_units.csv

llm:
  provider: gemini
  model: gemini-3.1-flash-lite-preview
  gemini_api_key_env: GEMINI_API_KEY
  temperature: 0.75
  seed: 42
  enable_google_search: true
  request_timeout_seconds: 90

pipeline:
  batch_size: 5
  max_candidates: 6
  review_score_threshold: 0.6

cache:
  enabled: true
  backend: sqlite
  path: llm_cache.sqlite

retry:
  max_attempts: 6
  base_delay_seconds: 1.0
  max_delay_seconds: 20.0
  jitter_seconds: 0.2

replay:
  enabled: false
  store_dir: .adminlineage_replay

output:
  write_csv: true
  write_parquet: true
```

## Outputs And Utilities

### Main Artifacts

| Artifact | Meaning |
|---|---|
| `evolution_key.csv` | Main crosswalk output. |
| `evolution_key.parquet` | Parquet version of the crosswalk output. |
| `review_queue.csv` | Rows that need manual review. |
| `run_metadata.json` | Run counts, warnings, request details, and artifact paths. |
| `links_raw.jsonl` | Incremental per-row decision log used for resumability and replay publishing. |

### Crosswalk Columns

| Column | Meaning |
|---|---|
| `from_name`, `to_name` | Raw source and target names. |
| `from_canonical_name`, `to_canonical_name` | Normalized names used during matching. |
| `from_id`, `to_id` | User IDs when supplied, otherwise fallback internal IDs. |
| `score` | Confidence in the chosen link, in `[0, 1]`. |
| `link_type` | One of `rename`, `split`, `merge`, `transfer`, `no_match`, `unknown`. |
| `relationship` | One of `father_to_father`, `father_to_child`, `child_to_father`, `child_to_child`, `unknown`. |
| `merge` | `both` for matched rows, `only_in_from` for source-only rows, `only_in_to` for target-only rows appended after the source pass. |
| `evidence` | Short grounded summary. Included only when `evidence=True`. |
| `reason` | Longer explanation. Present as a column, but empty unless `reason=True`. |
| exact-match columns | Copied context columns from the request, such as `state` or `district`. |
| `country`, `year_from`, `year_to` | Request metadata. |
| `run_id` | Deterministic run identifier. |
| `from_key`, `to_key` | Internal stable keys used by the pipeline. |
| `constraints_passed` | Constraint checks recorded for that row. |
| `review_flags`, `review_reason` | QA flags and their comma-joined summary. |

`review_queue.csv` is a filtered subset of the crosswalk for rows that were flagged for manual review. Target-only rows remain in the final evolution key with `merge="only_in_to"`.

## Operational Notes

- `exact_match` scopes the candidate search. If you set `exact_match=["state", "district"]`, a row only compares against rows from the same `(state, district)` group. This is the main hierarchical matching mechanism in the package.
- Candidate generation happens before Gemini. `max_candidates` controls how many shortlist entries the model sees for each source row. The default is 6, but you can still raise it explicitly.
- Exact string handling happens before the model call. `string_exact_match_prune` controls whether already matched rows remain in later AI work.
- Live Gemini work is grounded with Google Search and returns strict JSON. The pipeline then materializes CSV and Parquet outputs itself.
- When `string_exact_match_prune` is `from` or `to`, the package can run one bounded second-stage rescue pass on unmatched primary-side rows. That pass does one grounded research call, and only does a second shortlist decision call if the research returned a usable `lineage_hint`.
- Replay is opt-in. When `replay_enabled=True`, rerunning the same semantic request reuses the prior completed LLM output instead of calling Gemini again.
- `seed` helps keep request identity deterministic and makes runs easier to reproduce.
- Cache is configured in CLI config. When enabled, the package uses a SQLite cache at `cache.path`.
- Retry behavior is configurable in CLI config. Transient Gemini failures are retried according to the `retry` section before a row is marked unresolved.
- `export_crosswalk` and `adminlineage export` convert an existing materialized crosswalk into `csv`, `parquet`, or `jsonl`.

## A Few Practical Defaults

- `model="gemini-3.1-flash-lite-preview"`
- `temperature=0.75`
- `enable_google_search=True`
- `evidence=False`
- `reason=False`
- `relationship="auto"`
- `string_exact_match_prune="none"`

Those are the current defaults. Change them when you need replay, evidence, stricter scoping, or different review thresholds.

## Reporting Issues

If you run into a bug, a broken match, or a confusing output, please open an issue on [GitHub](https://github.com/TahaIbrahimSiddiqui/AdminLineageAI/issues).

The most useful issue reports include:

- the package version, for example `adminlineage.__version__`
- whether you used the Python API, CLI, or one of the notebooks
- the model name and the main matching settings, especially `exact_match`, `string_exact_match_prune`, `batch_size`, `max_candidates`, and `enable_google_search`
- whether the run was fresh, resumed from an existing output directory, or reused replay artifacts
- a small sanitized input example that reproduces the problem
- the relevant rows from `evolution_key.csv` or `review_queue.csv`
- `run_metadata.json`, and when relevant `links_raw.jsonl`, `grounding_notes.jsonl`, and `second_stage_results.jsonl`
- the traceback or log excerpt if the run failed


## Citation

If you use AdminLineageAI in published work, please cite the package and briefly report the workflow you used.

Suggested software citation:

Siddiqui, T. I., and Vetharenian Hari. (2026). *AdminLineageAI* (Version 0.2.1) [Python package]. [https://pypi.org/project/adminlineage/](https://pypi.org/project/adminlineage/)

If the workflow matters for interpretation, report the key settings in your methods or appendix:

- country and time span
- administrative level and exact-match scope
- `string_exact_match_prune` mode
- Gemini model name
- whether Google Search grounding was enabled
- whether the bounded second-stage rescue pass was active
- whether outputs were manually reviewed or corrected
