Metadata-Version: 2.4
Name: pdbminebuilder
Version: 0.2.2
Summary: PDBj data synchronization and database loading tool
Project-URL: Homepage, https://github.com/N283T/pdb-mine-builder
Project-URL: Documentation, https://n283t.github.io/pdb-mine-builder/
Project-URL: Repository, https://github.com/N283T/pdb-mine-builder
Project-URL: Changelog, https://github.com/N283T/pdb-mine-builder/blob/main/CHANGELOG.md
Author: nagaet
License-Expression: MIT
License-File: LICENSE
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Database
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.12
Requires-Dist: alembic>=1.13.0
Requires-Dist: ccd2rdmol>=0.2.0
Requires-Dist: defusedxml>=0.7.0
Requires-Dist: gemmi>=0.7.0
Requires-Dist: polars>=1.0.0
Requires-Dist: psycopg[binary,pool]>=3.2.0
Requires-Dist: pydantic-settings>=2.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: pyyaml>=6.0.0
Requires-Dist: rich>=13.0.0
Requires-Dist: sqlalchemy>=2.0.0
Requires-Dist: typer>=0.15.0
Description-Content-Type: text/markdown

# pdb-mine-builder

[![CI](https://github.com/N283T/pdb-mine-builder/actions/workflows/ci.yml/badge.svg)](https://github.com/N283T/pdb-mine-builder/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/pdbminebuilder)](https://pypi.org/project/pdbminebuilder/)
[![Python](https://img.shields.io/pypi/pyversions/pdbminebuilder)](https://pypi.org/project/pdbminebuilder/)
[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
[![Pixi Badge](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/prefix-dev/pixi/main/assets/badge/v0.json)](https://pixi.sh)

Build a Mine-schema database from PDB data. Synchronizes structural biology data from wwPDB mirrors (PDBj by default) via rsync and loads it into PostgreSQL.

This project is based on PDBj's [mine2updater](https://gitlab.com/pdbjapan/mine2updater). Thanks to the PDBj team for the original implementation and the [Mine](https://doi.org/10.1093/database/baq021) relational database design.

**Documentation**: [https://n283t.github.io/pdb-mine-builder/](https://n283t.github.io/pdb-mine-builder/)

## Features

- Multi-process parallel data loading with configurable workers
- Support for multiple data formats (CIF default, mmJSON optional)
- Configurable sync sources with regional wwPDB mirror support (PDBj, RCSB, PDBe)
- RDKit chemical search integration (substructure, similarity)
- SQL query interface with multi-format output (table, CSV, JSON, Parquet)
- [Interactive SQL examples](https://n283t.github.io/pdb-mine-builder/sql-examples) with 75+ queries across 10 categories
- 9 database schemas covering PDB structures, chemical components, validation reports, and more

## Installation

### Pixi (recommended)

[Pixi](https://pixi.sh/) manages all dependencies including Python, PostgreSQL, and RDKit in a single environment.

```bash
git clone https://github.com/N283T/pdb-mine-builder.git
cd pdb-mine-builder
pixi install
cp config.example.yml config.yml  # Edit with your data paths
```

```bash
pixi run db-init       # Initialize PostgreSQL
pixi run db-start      # Start PostgreSQL
pixi run pmb sync      # Sync data from wwPDB (PDBj by default)
pixi run pmb load pdbj --force  # Load data
pixi run pmb stats     # Check database statistics
```

### pip (alternative)

> **Note**: pip installs the Python package only. You must provide PostgreSQL (17+) and the [RDKit PostgreSQL cartridge](https://github.com/rdkit/rdkit-postgresql) separately. Database management commands (`pixi run db-*`) are not available.

```bash
pip install pdbminebuilder
cp config.example.yml config.yml  # Edit with your data paths and connection string
pmb --help
```

### conda + pip (alternative)

> **Note**: Database management commands (`pixi run db-*`) are not available. Use your own PostgreSQL instance.

```bash
conda create -n pmb python=3.12 rdkit-postgresql -c conda-forge
conda activate pmb
pip install pdbminebuilder
cp config.example.yml config.yml
pmb --help
```

### Docker / Podman (alternative)

> **Note**: Requires [Docker](https://docs.docker.com/get-docker/) or [Podman](https://podman.io/). Data files must be mounted as volumes.

```bash
git clone https://github.com/N283T/pdb-mine-builder.git
cd pdb-mine-builder
cp config.docker.yml config.yml  # Edit data paths
docker compose up -d             # Start PostgreSQL+RDKit and pmb
docker compose run --rm pmb update pdbj --limit 10
```

See the [Getting Started guide](https://n283t.github.io/pdb-mine-builder/docs/getting-started/installation) for detailed setup instructions.

## Pipelines

| Pipeline | Description | Entries | Tables | Size | Format |
|----------|-------------|---------|--------|------|--------|
| pdbj | Main structure data | ~250k | 250 | 183 GB | CIF / mmJSON |
| vrpt | Validation reports | ~250k | 69 | 152 GB | CIF |
| contacts | Protein-protein contacts | ~250k | 2 | 13 GB | JSON |
| cc | Chemical components (with RDKit) | ~50k | 12 | 811 MB | CIF / mmJSON |
| ccmodel | Chemical component models | ~23k | 8 | 174 MB | CIF / mmJSON |
| prd | BIRD reference dictionary | ~1.2k | 17 | 50 MB | CIF / mmJSON |

**Total: 368 tables, ~349 GB** with all PDB entries loaded (as of 2026-03-08).

See the [Database Reference](https://n283t.github.io/pdb-mine-builder/docs/database/overview) for schema details and SQL examples.

## Query

Execute SQL queries directly from the CLI with multiple output formats:

```bash
pmb query "SELECT * FROM cc.brief_summary LIMIT 5"                    # Rich table
pmb query "SELECT * FROM cc.brief_summary" -F csv > out.csv            # CSV
pmb query "SELECT * FROM cc.brief_summary LIMIT 10" -F json            # JSON
pmb query "SELECT * FROM cc.brief_summary" -F parquet -o out.parquet   # Parquet
pmb query -f query.sql                                                 # SQL from file
```

## Development

```bash
pixi run lint      # Ruff check
pixi run format    # Ruff format
pixi run test      # Run tests (pytest)
pixi run check     # All checks
```

## Requirements

- Python 3.12+
- PostgreSQL 17+ (managed by rdkit-postgresql via conda-forge)
- [Pixi](https://pixi.sh/) — manages all dependencies (conda + PyPI)
- rsync

> **Note**: Most dependencies are installed from conda-forge. Only `ccd2rdmol` (PyPI only) and `psycopg[binary,pool]` (extras required) remain as PyPI dependencies. PostgreSQL version is determined by rdkit-postgresql.

## License

MIT - See [LICENSE](LICENSE) for details.

### Relationship to mine2updater

This project is inspired by [mine2updater](https://gitlab.com/pdbjapan/mine2updater) (LGPLv3) by PDBj, which loads PDB data into PostgreSQL using Node.js. pdb-mine-builder is an independent rewrite in Python with a completely different tech stack (gemmi, SQLAlchemy, psycopg3, RDKit), architecture, and data model. No code was copied or translated from the original project. Shared concepts (pipeline names, schema structures, PDB ID encoding) derive from PDB data specifications, not from the original codebase.

## References

- Kinjo AR, Yamashita R, Nakamura H. PDBj Mine: design and implementation of relational database interface for Protein Data Bank Japan. *Database (Oxford)*. 2010;2010:baq021. doi: [10.1093/database/baq021](https://doi.org/10.1093/database/baq021)
