Metadata-Version: 2.4
Name: hitlist
Version: 1.21.0
Summary: Curated mass spectrometry evidence for MHC ligand data from IEDB and CEDAR
Author-email: Alex Rubinsteyn <alex.rubinsteyn@unc.edu>
License:                                  Apache License
                                   Version 2.0, January 2004
                                http://www.apache.org/licenses/
        
           TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
        
           1. Definitions.
        
              "License" shall mean the terms and conditions for use, reproduction,
              and distribution as defined by Sections 1 through 9 of this document.
        
              "Licensor" shall mean the copyright owner or entity authorized by
              the copyright owner that is granting the License.
        
              "Legal Entity" shall mean the union of the acting entity and all
              other entities that control, are controlled by, or are under common
              control with that entity. For the purposes of this definition,
              "control" means (i) the power, direct or indirect, to cause the
              direction or management of such entity, whether by contract or
              otherwise, or (ii) ownership of fifty percent (50%) or more of the
              outstanding shares, or (iii) beneficial ownership of such entity.
        
              "You" (or "Your") shall mean an individual or Legal Entity
              exercising permissions granted by this License.
        
              "Source" form shall mean the preferred form for making modifications,
              including but not limited to software source code, documentation
              source, and configuration files.
        
              "Object" form shall mean any form resulting from mechanical
              transformation or translation of a Source form, including but
              not limited to compiled object code, generated documentation,
              and conversions to other media types.
        
              "Work" shall mean the work of authorship, whether in Source or
              Object form, made available under the License, as indicated by a
              copyright notice that is included in or attached to the work
              (an example is provided in the Appendix below).
        
              "Derivative Works" shall mean any work, whether in Source or Object
              form, that is based on (or derived from) the Work and for which the
              editorial revisions, annotations, elaborations, or other modifications
              represent, as a whole, an original work of authorship. For the purposes
              of this License, Derivative Works shall not include works that remain
              separable from, or merely link (or bind by name) to the interfaces of,
              the Work and Derivative Works thereof.
        
              "Contribution" shall mean any work of authorship, including
              the original version of the Work and any modifications or additions
              to that Work or Derivative Works thereof, that is intentionally
              submitted to Licensor for inclusion in the Work by the copyright owner
              or by an individual or Legal Entity authorized to submit on behalf of
              the copyright owner. For the purposes of this definition, "submitted"
              means any form of electronic, verbal, or written communication sent
              to the Licensor or its representatives, including but not limited to
              communication on electronic mailing lists, source code control systems,
              and issue tracking systems that are managed by, or on behalf of, the
              Licensor for the purpose of discussing and improving the Work, but
              excluding communication that is conspicuously marked or otherwise
              designated in writing by the copyright owner as "Not a Contribution."
        
              "Contributor" shall mean Licensor and any individual or Legal Entity
              on behalf of whom a Contribution has been received by Licensor and
              subsequently incorporated within the Work.
        
           2. Grant of Copyright License. Subject to the terms and conditions of
              this License, each Contributor hereby grants to You a perpetual,
              worldwide, non-exclusive, no-charge, royalty-free, irrevocable
              copyright license to reproduce, prepare Derivative Works of,
              publicly display, publicly perform, sublicense, and distribute the
              Work and such Derivative Works in Source or Object form.
        
           3. Grant of Patent License. Subject to the terms and conditions of
              this License, each Contributor hereby grants to You a perpetual,
              worldwide, non-exclusive, no-charge, royalty-free, irrevocable
              (except as stated in this section) patent license to make, have made,
              use, offer to sell, sell, import, and otherwise transfer the Work,
              where such license applies only to those patent claims licensable
              by such Contributor that are necessarily infringed by their
              Contribution(s) alone or by combination of their Contribution(s)
              with the Work to which such Contribution(s) was submitted. If You
              institute patent litigation against any entity (including a
              cross-claim or counterclaim in a lawsuit) alleging that the Work
              or a Contribution incorporated within the Work constitutes direct
              or contributory patent infringement, then any patent licenses
              granted to You under this License for that Work shall terminate
              as of the date such litigation is filed.
        
           4. Redistribution. You may reproduce and distribute copies of the
              Work or Derivative Works thereof in any medium, with or without
              modifications, and in Source or Object form, provided that You
              meet the following conditions:
        
              (a) You must give any other recipients of the Work or
                  Derivative Works a copy of this License; and
        
              (b) You must cause any modified files to carry prominent notices
                  stating that You changed the files; and
        
              (c) You must retain, in the Source form of any Derivative Works
                  that You distribute, all copyright, patent, trademark, and
                  attribution notices from the Source form of the Work,
                  excluding those notices that do not pertain to any part of
                  the Derivative Works; and
        
              (d) If the Work includes a "NOTICE" text file as part of its
                  distribution, then any Derivative Works that You distribute must
                  include a readable copy of the attribution notices contained
                  within such NOTICE file, excluding those notices that do not
                  pertain to any part of the Derivative Works, in at least one
                  of the following places: within a NOTICE text file distributed
                  as part of the Derivative Works; within the Source form or
                  documentation, if provided along with the Derivative Works; or,
                  within a display generated by the Derivative Works, if and
                  wherever such third-party notices normally appear. The contents
                  of the NOTICE file are for informational purposes only and
                  do not modify the License. You may add Your own attribution
                  notices within Derivative Works that You distribute, alongside
                  or as an addendum to the NOTICE text from the Work, provided
                  that such additional attribution notices cannot be construed
                  as modifying the License.
        
              You may add Your own copyright statement to Your modifications and
              may provide additional or different license terms and conditions
              for use, reproduction, or distribution of Your modifications, or
              for any such Derivative Works as a whole, provided Your use,
              reproduction, and distribution of the Work otherwise complies with
              the conditions stated in this License.
        
           5. Submission of Contributions. Unless You explicitly state otherwise,
              any Contribution intentionally submitted for inclusion in the Work
              by You to the Licensor shall be under the terms and conditions of
              this License, without any additional terms or conditions.
              Notwithstanding the above, nothing herein shall supersede or modify
              the terms of any separate license agreement you may have executed
              with Licensor regarding such Contributions.
        
           6. Trademarks. This License does not grant permission to use the trade
              names, trademarks, service marks, or product names of the Licensor,
              except as required for reasonable and customary use in describing the
              origin of the Work and reproducing the content of the NOTICE file.
        
           7. Disclaimer of Warranty. Unless required by applicable law or
              agreed to in writing, Licensor provides the Work (and each
              Contributor provides its Contributions) on an "AS IS" BASIS,
              WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
              implied, including, without limitation, any warranties or conditions
              of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
              PARTICULAR PURPOSE. You are solely responsible for determining the
              appropriateness of using or redistributing the Work and assume any
              risks associated with Your exercise of permissions under this License.
        
           8. Limitation of Liability. In no event and under no legal theory,
              whether in tort (including negligence), contract, or otherwise,
              unless required by applicable law (such as deliberate and grossly
              negligent acts) or agreed to in writing, shall any Contributor be
              liable to You for damages, including any direct, indirect, special,
              incidental, or consequential damages of any character arising as a
              result of this License or out of the use or inability to use the
              Work (including but not limited to damages for loss of goodwill,
              work stoppage, computer failure or malfunction, or any and all
              other commercial damages or losses), even if such Contributor
              has been advised of the possibility of such damages.
        
           9. Accepting Warranty or Additional Liability. While redistributing
              the Work or Derivative Works thereof, You may choose to offer,
              and charge a fee for, acceptance of support, warranty, indemnity,
              or other liability obligations and/or rights consistent with this
              License. However, in accepting such obligations, You may act only
              on Your own behalf and on Your sole responsibility, not on behalf
              of any other Contributor, and only if You agree to indemnify,
              defend, and hold each Contributor harmless for any liability
              incurred by, or claims asserted against, such Contributor by reason
              of your accepting any such warranty or additional liability.
        
           END OF TERMS AND CONDITIONS
        
           APPENDIX: How to apply the Apache License to your work.
        
              To apply the Apache License to your work, attach the following
              boilerplate notice, with the fields enclosed by brackets "[]"
              replaced with your own identifying information. (Don't include
              the brackets!)  The text should be enclosed in the appropriate
              comment syntax for the file format. We also recommend that a
              file or class name and description of purpose be included on the
              same "printed page" as the copyright notice for easier
              identification within third-party archives.
        
           Copyright [yyyy] [name of copyright owner]
        
           Licensed under the Apache License, Version 2.0 (the "License");
           you may not use this file except in compliance with the License.
           You may obtain a copy of the License at
        
               http://www.apache.org/licenses/LICENSE-2.0
        
           Unless required by applicable law or agreed to in writing, software
           distributed under the License is distributed on an "AS IS" BASIS,
           WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
           See the License for the specific language governing permissions and
           limitations under the License.
        
Project-URL: Homepage, https://github.com/pirl-unc/hitlist
Project-URL: Repository, https://github.com/pirl-unc/hitlist
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas
Requires-Dist: pyarrow
Requires-Dist: PyYAML
Requires-Dist: tqdm
Provides-Extra: alleles
Requires-Dist: mhcgnomes>=3.31.0; extra == "alleles"
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Dynamic: license-file

# hitlist

[![Tests](https://github.com/pirl-unc/hitlist/actions/workflows/tests.yml/badge.svg)](https://github.com/pirl-unc/hitlist/actions/workflows/tests.yml)
[![PyPI](https://img.shields.io/pypi/v/hitlist.svg)](https://pypi.org/project/hitlist/)

A curated, harmonized, **ML-training-ready** MHC ligand mass-spectrometry dataset.

hitlist ingests immunopeptidome data from [IEDB](https://www.iedb.org/), [CEDAR](https://cedar.iedb.org/), and paper supplementary tables (PRIDE/jPOSTrepo); partitions MS-eluted observations from in-vitro binding-assay measurements into two separate parquet files (so downstream consumers never silently conflate them); joins every MS observation to expert-curated sample metadata (HLA genotype, tissue, disease, perturbation, instrument); and ships both indexes as parquet + a pandas-friendly Python API.

## What's curated

| | Count |
|---|---|
| Curated PMIDs (`pmid_overrides.yaml`) | **155** — covers 89.5% of observations |
| `ms_samples` entries with per-sample metadata | 359 |
| `ms_samples` entries with 4-digit HLA typing | 237 |
| Supplementary CSVs ingested (PRIDE/jPOSTrepo) | 5 (JY, HeLa, SK-MEL-37, Raji, plasma — Gomez-Zepeda 2024) |
| Species reference proteomes (registry) | 19 (Ensembl: 4, UniProt: 15) |
| Viral reference proteomes (registry) | 30 distinct viruses, 54 name aliases |

## What's in the two indexes

After `hitlist data build` (snapshot of the shipping 1.10.x default build):

### `observations.parquet` — MS-eluted immunopeptidome

| | |
|---|---|
| **Total observations** (MS-eluted, all species) | **4,053,693** |
| **Unique peptides** | **1,285,987** |
| Unique MHC alleles | 691 |
| MHC species covered | 21 |
| IEDB rows | 3,986,991 |
| CEDAR rows | 595 |
| Supplementary rows | 66,107 |

### `binding.parquet` — in-vitro binding-assay measurements

| | |
|---|---|
| **Total binding rows** (peptide microarray, refolding, MEDi, qualitative-tier) | **895,785** |
| **Unique peptides** | **258,199** |

The two indexes share the schema (including gene annotations from the peptide-mappings sidecar), but supplementary curation is MS-only — binding is pure IEDB/CEDAR.

### Human MHC-I breakdown

| | |
|---|---|
| Observations | 2,672,046 |
| Unique peptides | 748,386 |
| Mono-allelic (exact allele) | 579,096 obs / 300K peptides / 119 alleles |
| Multi-allelic with allele match | 450,399 obs |
| Multi-allelic with class-pool (`N of M alleles`) | 784,370 obs |
| Allele-resolved `sample_mhc` coverage | **74.8%** |

## Install

```bash
pip install hitlist
```

## Quick start for ML training

```bash
# One-time: register IEDB + CEDAR downloads and build
hitlist data register iedb /path/to/mhc_ligand_full.csv
hitlist data register cedar /path/to/cedar-mhc-ligand-full.csv
hitlist data build                           # a few minutes end-to-end;
                                             # writes observations.parquet +
                                             # binding.parquet + peptide_mappings.parquet

# Export training-ready CSVs
hitlist export training --include-evidence ms --class I --species "Homo sapiens" --mono-allelic \
    --min-allele-resolution four_digit -o mono_allelic_classI.csv

hitlist export training --include-evidence ms --class II --species "Homo sapiens" \
    -o classII_training.csv

# Presto-style flank-aware export: one row per (evidence row, peptide mapping)
hitlist export training --include-evidence both --class I --species "Homo sapiens" \
    --explode-mappings -o presto_training.parquet
```

`hitlist export training` does **not** create a new canonical store. It composes the existing `observations.parquet`, `binding.parquet`, and `peptide_mappings.parquet` indexes into one training-facing export surface. The low-level indexes keep their semantic boundaries; the training export gives downstream consumers one obvious API/CLI path when they want model-ready tables.

## Python API

### Training-data export

```python
from hitlist.export import generate_ms_observations_table
from hitlist.export import generate_training_table

# Mono-allelic human class I MS observations with ground-truth allele
mono_ms = generate_ms_observations_table(
    mhc_class="I",
    species="Homo sapiens",
    is_mono_allelic=True,
    min_allele_resolution="four_digit",
)

# Presto-style mapping-aware export: one row per (evidence row, peptide mapping)
presto = generate_training_table(
    include_evidence="both",
    mhc_class="I",
    species="Homo sapiens",
    explode_mappings=True,
)
# columns now include: evidence_kind, evidence_row_id, protein_id, position,
# n_flank, c_flank, proteome, proteome_source
```

`generate_observations_table()` remains available as a backward-compatible alias.

Species filters accept any variant — `"Homo sapiens"`, `"human"`, `"homo_sapiens"`, `"Homo sapiens (human)"` all work.

### Raw observations loading

```python
from hitlist.observations import (
    load_ms_observations,     # MS-eluted immunopeptidome
    load_binding,             # in-vitro binding-assay measurements
    load_all_evidence,        # union, tagged with an evidence_kind column
    is_built, is_binding_built,
    observations_path, binding_path,
)

# MS-elution (the default training-data path)
df = load_ms_observations()                       # everything (MS-eluted only)
df = load_ms_observations(mhc_class="I")          # class I only
df = load_ms_observations(species="Homo sapiens") # human only
df = load_ms_observations(source="iedb")          # filter by source
df = load_ms_observations(columns=["peptide", "mhc_restriction", "src_cancer"])

# Binding assays — same filter API, reads binding.parquet
bd = load_binding(mhc_class="I", mhc_restriction="HLA-A*02:01")

# Union — for affinity-predictor training, or UI flags that want both.
# Rows are tagged with evidence_kind ∈ {"ms", "binding"}.
both = load_all_evidence(gene_name="PRAME", mhc_class="I")
both["evidence_kind"].value_counts()
```

`load_observations()` remains available as a backward-compatible alias.

### Building / curation

```python
from hitlist.builder import build_observations
from hitlist.curation import (
    classify_ms_row,
    normalize_species,
    normalize_allele,
    load_pmid_overrides,
)
from hitlist.supplement import scan_supplementary, load_supplementary_manifest

build_observations(with_flanking=True, use_uniprot_search=True, force=False)
normalize_species("human")           # → "Homo sapiens"
normalize_allele("H-2Kb")            # → "H2-K*b"
scan_supplementary()                 # DataFrame of curated paper-supplement peptides
```

### Peptide → protein attribution and flanking context

`hitlist data build` always produces three parquet files (use `--no-mappings` to skip `peptide_mappings.parquet`):

- `~/.hitlist/observations.parquet` — one row per assay observation
- `~/.hitlist/binding.parquet` — one row per binding-assay observation
- `~/.hitlist/peptide_mappings.parquet` — one row per (peptide, protein, position)

The mappings sidecar **preserves multi-mapping** so a peptide shared by MAGEA1/A4/A10/A12 keeps every paralog. Observations additionally carry semicolon-joined identity columns:

| column | example |
|---|---|
| `gene_names` | `MAGEA4;MAGEA10` |
| `gene_ids` | `ENSG00000147381;ENSG00000124260` |
| `protein_ids` | `P43359;P43363` |
| `n_source_proteins` | `2` |

```python
from hitlist.observations import load_ms_observations
from hitlist.mappings import load_peptide_mappings

# Central columns — fast for everyday filters (uses mappings sidecar for pushdown)
df = load_ms_observations(gene_name="PRAME")

# Long form for paralog / position / flank analysis
mappings = load_peptide_mappings(gene_name="MAGEA4")
# columns: peptide, protein_id, gene_name, gene_id, position, n_flank, c_flank, proteome
```

For ad-hoc queries without building the full table:

```python
from hitlist.proteome import ProteomeIndex

idx = ProteomeIndex.from_ensembl(release=112, species="human")
flanking = idx.map_peptides(["SLLMWITQC", "GILGFVFTL"], flank=10)
```

### Proteome registry / UniProt resolution

```python
from hitlist.downloads import (
    lookup_proteome,           # org string → registry entry (dict)
    fetch_species_proteome,    # download FASTA and cache to ~/.hitlist/proteomes/
    resolve_proteome_via_uniprot,  # direct UniProt REST lookup
    list_proteomes,            # manifest section
)

lookup_proteome("Mycobacterium tuberculosis", use_uniprot=True)
# → {'kind': 'uniprot', 'proteome_id': 'UP000001020', ...}
```

## Output schema — `generate_ms_observations_table()`

| Column | Meaning |
|---|---|
| `peptide` | Amino acid sequence |
| `mhc_restriction` | Allele from IEDB (may be `"HLA class I"` for multi-allelic studies) |
| `sample_mhc` | Allele(s) known for the source sample — the **useful** field for training |
| `mhc_class` | `I`, `II`, or `non classical` |
| `mhc_species` | Canonical species (normalized via mhcgnomes) |
| `is_monoallelic` | True if sample has a single transfected allele (721.221, C1R, K562, MAPTAC…) |
| `has_peptide_level_allele` | True if `mhc_restriction` is a specific allele (not `"HLA class I"`) |
| `is_potential_contaminant` | True for MS-eluted peptides that failed NetMHCpan binding prediction |
| `sample_match_type` | How `sample_mhc` was populated (see below) |
| `matched_sample_count` | Number of curated samples for this PMID |
| `src_cancer`, `src_healthy_tissue`, `src_ebv_lcl`, ... | Mutually-exclusive biological source categories |
| `source` | `iedb`, `cedar`, or `supplement` |
| `source_organism`, `reference_title`, `cell_name`, `source_tissue`, `disease` | IEDB sample context |
| `instrument`, `instrument_type`, `acquisition_mode`, `fragmentation`, `labeling`, `ip_antibody` | MS acquisition from `ms_samples` curation |
| `gene_names`, `gene_ids`, `protein_ids`, `n_source_proteins` | Multi-mapping peptide → source-protein attribution (always populated; use `peptide_mappings.parquet` for long-form positions + flanks) |

### `sample_match_type` — join provenance

| Value | Meaning | Training-grade? |
|---|---|---|
| `allele_match` | IEDB recorded a specific allele and it matched a curated sample genotype | **Yes** — high confidence |
| `single_sample_fallback` | IEDB class-only but study has exactly 1 sample, so `sample_mhc` = that sample's full genotype | Yes (for deconvolution) |
| `pmid_class_pool` | IEDB class-only + multiple samples — `sample_mhc` = union of all class-matching alleles across samples | Yes (for deconvolution), lower precision |
| `unmatched` | No curated sample for this PMID, or all samples have `mhc: unknown` | No — `sample_mhc` empty |

## Biological source classification

Every observation is classified by mutually-exclusive biological source category:

| Category | Flag | Rule |
|---|---|---|
| Cancer | `src_cancer` | Tumor tissue, cancer patient biofluids, or non-EBV cell lines |
| Adjacent to tumor | `src_adjacent_to_tumor` | Surgically resected "normal" tissue (per-PMID override) |
| Activated APC | `src_activated_apc` | Monocyte-derived DCs/macrophages with pharmacological activation |
| Healthy somatic | `src_healthy_tissue` | Direct ex vivo, healthy donor, non-reproductive, non-thymic |
| Healthy thymus | `src_healthy_thymus` | Direct ex vivo thymus (expected for CTAs, AIRE-mediated) |
| Healthy reproductive | `src_healthy_reproductive` | Direct ex vivo testis, ovary (expected for CTAs) |
| EBV-LCL | `src_ebv_lcl` | EBV-transformed B-cell lines |
| Cell line | `src_cell_line` | Any cultured cell line |

**Cancer-specific** = `src_cancer AND NOT src_healthy_tissue`. Thymus, reproductive tissue, adjacent tissue, EBV-LCLs, and activated APCs do NOT disqualify a peptide from being cancer-specific.

## CLI reference

### Data management

```bash
hitlist data register <name> <path> [-d DESCRIPTION]    # register a local file
hitlist data fetch <name> [--force]                     # download a known dataset (IEDB/CEDAR/viral FASTAs)
hitlist data refresh <name>                             # re-download
hitlist data info <name>                                # detailed metadata (JSON)
hitlist data path <name>                                # print the registered path
hitlist data remove <name> [--delete]                   # unregister (optionally delete file)
hitlist data list                                       # show registered datasets
hitlist data available                                  # show all known datasets
```

### Build the observations table

```bash
hitlist data build [--force]                            # ~90s full scan with tqdm progress
hitlist data build                                      # always builds peptide_mappings.parquet
hitlist data build --use-uniprot                        # broader proteome coverage via UniProt REST
hitlist data build --no-mappings                        # skip mapping step (faster, no gene attribution)
hitlist data build --no-fetch-proteomes                 # don't auto-download missing proteomes
hitlist data build --proteome-release 112               # Ensembl release for human/mouse/rat
```

### Proteome management

```bash
hitlist data fetch-proteomes [--min-observations N] [--use-uniprot] [--force]
hitlist data list-proteomes
```

### Index (for raw peptide counts)

```bash
hitlist data index [--source iedb|cedar|merged|all] [--force]
```

### Export

```bash
hitlist export observations [filters...] -o train.csv   # MS immunopeptidome + sample metadata
hitlist export observations -o train.parquet            # parquet output supported
hitlist export binding [filters...] -o binding.csv      # binding-assay index (separate from MS)
hitlist export training [filters...] -o training.csv    # unified training export from canonical indexes
hitlist export samples [--class I|II]                   # per-sample conditions (YAML curation only)
hitlist export summary                                  # species x class summary
hitlist export counts [--source iedb|cedar|merged|all]  # peptide counts per PMID
hitlist export alleles                                  # validate YAML alleles with mhcgnomes
hitlist export data-alleles                             # validate all IEDB/CEDAR alleles
```

### Canonical indexes and the training export

Each `hitlist data build` writes **three** parquet files to `~/.hitlist/`:

- `observations.parquet` — MS-eluted immunopeptidome (IEDB + CEDAR + curated supplementary).
- `binding.parquet` — binding-assay rows (peptide microarray, refolding, MEDi, and
  quantitative-tier measurements like `Positive-High/Intermediate/Low`).
- `peptide_mappings.parquet` — long-form peptide → protein/position/flank mappings.

The canonical indexes are never silently mixed. Supplementary data is MS-only.
Use `hitlist export observations` and `hitlist export binding` when you want
the raw evidence families separately. Use `hitlist export training` or
`generate_training_table(...)` when you want a composed model-facing export
with `evidence_kind` tagging and optional mapping explosion for flank-aware
training pipelines.

### Filters on `hitlist export observations`

| Flag | Values |
|---|---|
| `--class` | `I`, `II`, `non classical` |
| `--species` | Any species variant (normalized via mhcgnomes) |
| `--mono-allelic` / `--multi-allelic` | Filter on `is_monoallelic` |
| `--instrument-type` | `Orbitrap`, `timsTOF`, `TOF`, `QqQ`, ... |
| `--acquisition-mode` | `DDA`, `DIA`, `PRM` |
| `--min-allele-resolution` | `four_digit`, `two_digit`, `serological`, `class_only` |
| `--mhc-allele` | Exact match on `mhc_restriction` after allele normalization. Repeatable / comma-separated. |
| `--gene` | Symbol, Ensembl ID, or old alias (HGNC synonym lookup). Repeatable / comma-separated. Requires the mappings sidecar (default-on at build). |
| `--gene-name` | Exact match on `gene_name` column (no HGNC lookup) |
| `--gene-id` | Exact match on `gene_id` column (ENSG) |
| `--serotype` | HLA serotype: locus-specific (`A24`, `B57`, `DR15`) or public epitope (`Bw4`, `Bw6`). Matches any serotype the allele belongs to, so `--serotype Bw4` returns A\*24:02, B\*27:05, B\*57:01, etc. Repeatable / comma-separated. |
| `--output` / `-o` | `.csv` or `.parquet` |

All filters are pushed down to the parquet reader (pyarrow), so `--gene PRAME` reads
only the matching row groups — typically milliseconds rather than a full table scan.
Examples:
- `hitlist export observations --gene PRAME --class I -o prame_classI.csv`
- `hitlist export observations --gene "MART-1"` (HGNC resolves to `MLANA`)
- `hitlist export observations --mhc-allele HLA-A*02:01 --mono-allelic`
- `hitlist export observations --serotype A24` (locus-specific)
- `hitlist export observations --serotype Bw4` (public epitope — A*23/24/25/32, B*13/27/44/51/52/53/57/58)

### Filters on `hitlist export binding`

Same shape as the observations filters minus the MS-specific ones
(`--mono-allelic`, `--instrument-type`, `--acquisition-mode`). `--source`
accepts only `iedb` or `cedar` — supplementary data is MS-only and never
appears in the binding index.

```bash
hitlist export binding --gene PRAME --class I -o prame_binding.csv
hitlist export binding --mhc-allele HLA-A*02:01 --serotype Bw4
```

### Filters on `hitlist export training`

`hitlist export training` exposes the shared pMHC filters plus two export-shape controls:

- `--include-evidence ms|binding|both` chooses which canonical evidence families to compose.
- `--explode-mappings` expands the output to one row per `(evidence row, peptide mapping)` with `protein_id`, `position`, `n_flank`, `c_flank`, `proteome`, and `proteome_source`.

MS-specific filters (`--mono-allelic`, `--instrument-type`, `--acquisition-mode`) apply only to the MS slice. Binding rows never gain fake sample context; they remain tagged as `evidence_kind="binding"` with `sample_match_type="not_applicable"`.

```bash
hitlist export training --include-evidence both --gene PRAME --class I -o prame_training.csv
hitlist export training --include-evidence ms --mono-allelic --class I -o mono_ms.csv
hitlist export training --include-evidence both --explode-mappings -o presto_training.parquet
```

### Sample-level expression anchors (issue #140)

For line-like `ms_samples` (C1R, 721.221, JY and sibling EBV-LCLs, HAP1,
HeLa, HEK293, THP-1, SaOS-2, A375, K562, GM12878, and common engineered
derivatives) `hitlist.line_expression` resolves every sample to a stable
expression backend + key via a 6-tier fallback hierarchy:

1. **exact-line RNA / transcript quant** (registry hit with shipped data)
2. **parent-line / engineered-derivative RNA** (e.g. HeLa.ABC-KO → HeLa)
3. **line-family class anchor** (EBV-LCL → GM12878; mono-allelic host → K562)
4. **cancer-type surrogate** (caller-supplied `pirlygenes` backend)
5. **broad tissue / lineage surrogate** (HPA)
6. **`no_expression_anchor`**

Four provenance columns — `expression_backend`, `expression_key`,
`expression_match_tier`, `expression_parent_key` — are written on every
row so downstream tooling can distinguish "exact JY RNA" from "generic
EBV-LCL stand-in" from "melanoma cohort surrogate" instead of treating
them as equally trustworthy.

CLI:

```bash
hitlist export samples --with-expression-anchors -o sample_anchors.csv

hitlist export training \
    --include-evidence ms --class I --mono-allelic \
    --with-peptide-origin \
    -o training_with_origin.parquet

hitlist export line-expression --line-key GM12878 --gene-name TP53
```

`--with-peptide-origin` attaches, per row, the argmax-TPM candidate gene
(`peptide_origin_gene`, `peptide_origin_tpm`) for the peptide in the
sample's resolved line. When transcript-level TPM is available the
score is the sum across **only those transcripts whose translation
actually contains the peptide** — isoforms that splice out the peptide
contribute zero, and a transcript is counted once regardless of how many
times the peptide appears in its protein.

Register DepMap matrices to broaden exact-line coverage:

```bash
hitlist data register depmap_rna /path/to/OmicsExpressionProteinCodingGenesTPMLogp1.csv
hitlist data register depmap_rna_transcript /path/to/OmicsExpressionTranscriptsTPMLogp1.csv
hitlist data build
```

### A note on mono-allelic curation

Mono-allelic is a **PMID-level** flag, not a per-sample property. Curation
lives in `pmid_overrides.yaml` (see `mono_allelic_host` and `ms_samples`).
Rows can legitimately have `is_monoallelic=True` with an empty
`mhc_restriction` — for example, supplementary contaminant peptides under
a mono-allelic PMID override carry the flag but not a per-row allele. If
your downstream pipeline needs a strict "mono-allelic AND has allele"
subset, post-filter on
`is_monoallelic & mhc_restriction.str.startswith("HLA-")`.

### Reports

```bash
hitlist report [--class I|II] [--output report.txt]
```

## Development

```bash
./develop.sh    # install in dev mode
./format.sh     # ruff format
./lint.sh       # ruff check + format check
./test.sh       # pytest with coverage (~3 min)
./deploy.sh     # lint + test + build + upload to PyPI
```

See [docs/pmid-curation.md](docs/pmid-curation.md) for the curation YAML format and per-study overrides.
