Metadata-Version: 2.4
Name: babelcode
Version: 0.1.0
Summary: Normalize any language identifier to canonical ISO 639-3 + ISO 15924 form
Project-URL: Homepage, https://github.com/omneity-labs/babelcode
Project-URL: Repository, https://github.com/omneity-labs/babelcode
Project-URL: Issues, https://github.com/omneity-labs/babelcode/issues
Author-email: Omar Kamali <babelcode@omarkama.li>
License: MIT
License-File: LICENSE
Keywords: bcp-47,iso-15924,iso-639,language-codes,linguistics,nlp,normalization
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Typing :: Typed
Requires-Python: >=3.10
Provides-Extra: dev
Requires-Dist: prepress; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Provides-Extra: test
Requires-Dist: pytest>=7.0; extra == 'test'
Description-Content-Type: text/markdown

# babelcode

Normalize **any** language identifier — ISO 639-1 (`en`), ISO 639-3 (`eng`), BCP-47 (`zh-Hans`), NLLB-style (`eng_Latn`), WikiPron filenames (`wp_eng_latn_us`), CHILDES corpus names (`EnglishNA`), or plain English (`German`) — into a single canonical form:

```
{iso_639_3}_{iso_15924}
```

For example: `eng_Latn`, `arb_Arab`, `cmn_Hans`, `jpn_Jpan`.

## Installation

```bash
pip install babelcode
```

## Quick start

```python
from babelcode import BabelCode

bc = BabelCode()

bc.normalize("en")          # → "eng_Latn"
bc.normalize("zh-Hans")     # → "cmn_Hans"
bc.normalize("ar")          # → "arb_Arab"  (macrolanguage → preferred individual)
bc.normalize("wp_deu_latn") # → "deu_Latn"  (WikiPron format)
bc.normalize("Farsi")       # → "pes_Arab"  (English name)
bc.normalize("EnglishNA")   # → "eng_Latn"  (CHILDES corpus name)
```

### Singleton shortcut

```python
from babelcode import get_instance

bc = get_instance()  # cached singleton — same instance every call
```

### Accessors

```python
bc.iso639_3("eng_Latn")        # → "eng"
bc.script("eng_Latn")          # → "Latn"
bc.bcp47("eng_Latn")           # → "en"
bc.name("arb")                 # → "Standard Arabic"
bc.scripts("srp")              # → ["Cyrl", "Latn"]
bc.is_macrolanguage("ara")     # → True
bc.macro_members("ara")        # → ["arb", "arz", ...]
```

### Script detection from text

```python
from babelcode import detect_script

detect_script("مرحبا")     # → "Arab"
detect_script("Привет")    # → "Cyrl"
detect_script("こんにちは")  # → "Jpan"
```

### Batch operations

```python
bc.normalize_list(["en", "de", "unknown", "fr"])
# → ["eng_Latn", "deu_Latn", None, "fra_Latn"]

bc.build_mapping(
    source_codes=["en", "de"],
    target_codes=["eng_Latn", "deu_Latn", "fra_Latn"],
)
# → {"en": "eng_Latn", "de": "deu_Latn"}
```

## Why `iso3_Script` instead of BCP-47?

BCP-47 tags (`en`, `zh-Hans`) are familiar but they conflate
macrolanguages with individual languages, have variable length, and omit
the script when it is "obvious" — which is ambiguous for multi-script
languages like Serbian.

The `{iso_639_3}_{iso_15924}` canonical form:

- Always has exactly **two components** — easy to split and compare.
- Uses **ISO 639-3** — one code per individual language, no
  macrolanguage ambiguity.
- Always includes the **ISO 15924 script** — no guessing for
  Serbian (`srp_Cyrl` vs `srp_Latn`) or Chinese (`cmn_Hans` vs
  `cmn_Hant`).

## Data sources & methodology

babelcode is pre-built from [**LinguaMeta**](https://github.com/google-research/url-nlp/tree/main/linguameta), a comprehensive open dataset from Google Research that aggregates:

| Source | What it provides |
|--------|-----------------|
| [ISO 639-3](https://iso639-3.sil.org/) (SIL) | Three-letter codes for 7 800+ languages |
| [ISO 15924](https://unicode.org/iso15924/) (Unicode) | Four-letter script codes (Latn, Arab, …) |
| [IETF BCP 47](https://www.rfc-editor.org/info/bcp47) | Standard language tags (en, zh-Hans, …) |
| LinguaMeta JSON files (~7 500) | Canonical script, macrolanguage membership, English names |

### Build process

1. **Raw data**: ~7500 per-language JSON files from LinguaMeta, each containing script associations, name data, and macrolanguage membership.
2. **Cache compilation** (`babelcode-build-cache`): Reads all JSON files and produces a single `linguameta_cache.json` (~550 KB) with pre-resolved BCP-47 → ISO 639-3 mappings, canonical scripts, English names, and macrolanguage membership tables.
3. **Runtime**: `BabelCode` loads the cache once and resolves any input format through a cascade of regex matchers and lookup tables.

The cache is distributed inside the package — no network calls at runtime, zero dependencies.

### Macrolanguage resolution

By default, macrolanguages resolve to their preferred individual language (e.g. `ar` → `arb` Standard Arabic, `zh` → `cmn` Mandarin). This can be disabled with `resolve_macro=False`.

### Script inference

When a script is not explicit in the input, babelcode uses this cascade:

1. Script from the input tag itself (e.g. `zh-Hans` → `Hans`)
2. `text_hint` parameter — runs Unicode-range script detection on sample text
3. Canonical script from the LinguaMeta cache
4. Fallback to `Latn`

## Development

```bash
git clone https://github.com/omneity-labs/babelcode
cd babelcode
pip install -e ".[dev]"
pytest
```

### Rebuilding the cache

If you update the LinguaMeta data files under `src/babelcode/data/_url_nlp_repo/`, rebuild the cache:

```bash
babelcode-build-cache
```

## License

[MIT](LICENSE) — Omar Kamali / [Omneity Labs](https://omneitylabs.com)
