Metadata-Version: 2.4
Name: labenv-embedding-cache
Version: 0.1.3
Summary: Shared embedding cache core for cross-project reuse
Author: labenv
License: Proprietary
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: numpy>=1.24
Requires-Dist: omegaconf>=2.3
Provides-Extra: embed
Requires-Dist: torch>=2.1; extra == "embed"
Requires-Dist: accelerate>=0.33; extra == "embed"
Requires-Dist: transformers>=4.40; extra == "embed"
Requires-Dist: sentence-transformers>=2.6; extra == "embed"
Provides-Extra: debug
Requires-Dist: debugpy>=1.8; extra == "debug"

# labenv_embedding_cache

Shared policy and Python cache-core library for projects that reuse text embedding caches.

## Files
- `embedding_rulebook.yaml`: machine-readable cache policy (path/layout/consistency)
- `embedding_registry.yaml`: machine-readable model registry shared across projects
- `embedding_cache_spec.yaml`: machine-readable spec for dataset key / metadata / variant tag
- `src/labenv_embedding_cache/`: reusable Python library
- `CODEX_PROMPT.md`: reusable prompt template for Codex sessions

## Install library
Preferred distribution is wheel (`v0.1.3`).

```bash
# Internal index (recommended)
pip install "labenv-embedding-cache==0.1.3"

# Optional: explicitly point pip to internal index
# PIP_EXTRA_INDEX_URL="https://<internal-index>/simple" pip install "labenv-embedding-cache==0.1.3"

# GitHub Release wheel fallback
pip install "labenv-embedding-cache @ https://github.com/ryuuua/labenv_embedding_cache/releases/download/v0.1.3/labenv_embedding_cache-0.1.3-py3-none-any.whl"

# Local editable (maintainer workflow)
pip install -e /path/to/labenv_embedding_cache
```

Rollback-only legacy install (VCS pin):

```bash
pip install "git+ssh://git@github.com/ryuuua/labenv_embedding_cache.git@c0154c06ee6e41852c58ac76d6504f5b38d20168#egg=labenv-embedding-cache"
```

## Standalone run (smoke tests)

Install with optional embedding/debug extras:

```bash
pip install -e ".[embed,debug]"
```

Embedding generation only (no cache):

```bash
python tools/embedding_smoketest.py
```

Embedding + cache read/write:

```bash
python tools/embedding_cache_smoketest.py --backend dummy
# or (downloads model)
python tools/embedding_cache_smoketest.py --backend sentence-transformers --model sentence-transformers/all-MiniLM-L6-v2
```

Debugpy (wait for attach):

```bash
python tools/embedding_cache_smoketest.py --backend dummy --debugpy --wait-for-client
```

Embedding generation from `conf/embedding` presets (auto DDP/pipeline policy via `embedding_model.md`):

```bash
python tools/generate_embeddings_from_conf.py --preset conf/embedding/qwen3_embedding.yaml --strategy auto
```

Default is non-normalized embeddings. Use `--normalize` to generate L2-normalized caches.
`normalize_embeddings=true` is treated as a separate model variant (`registry_key=...__l2`) and cache variant (`norm=l2`).

Best practices: `docs/EMBEDDING_BEST_PRACTICES.md`

## Policy path setup
No environment variable is required for normal use. The package resolves its
bundled `embedding_rulebook.yaml` automatically.

If you want to pin `EMBEDDING_RULEBOOK_PATH` explicitly in your shell profile:

```bash
export EMBEDDING_RULEBOOK_PATH="$(labenv-embedding-cache-path rulebook)"
```

Then reload shell:

```bash
source ~/.zshrc
```

## Legacy cache compatibility index (temporary)

Build a read-only index for existing `lm/**` cache files:

```bash
labenv-embedding-cache index-build --cache-dir /work/$USER/data/embedding_cache
```

Show index stats:

```bash
labenv-embedding-cache index-stats --cache-dir /work/$USER/data/embedding_cache
```

Verify whether request manifests can be served without regeneration:

```bash
labenv-embedding-cache verify-requests --requests /path/to/request_manifest.jsonl --min-selected-models 2
```

Build request manifest rows from canonical metadata in Python:

```python
import labenv_embedding_cache as lec

record = lec.build_request_manifest_entry(
    dataset_name="ag_news",
    model_id="bert",
    model_name="bert-base-uncased",
    expected_cache_path="/work/$USER/data/embedding_cache/lm/bert-base-uncased/ag_news__x.npz",
    metadata=metadata,
)
```

Export a lock payload for CI/DVC (`policy digest + index + optional verify report`):

```bash
labenv-embedding-cache lock-export \
  --cache-dir /work/$USER/data/embedding_cache \
  --requests /path/to/request_manifest.jsonl \
  --output /work/$USER/data/embedding_cache/.labenv/lock.json \
  --min-selected-models 2
```

Index file location:
- `${EMBEDDING_CACHE_DIR}/.labenv/index_v1.jsonl`

Compatibility is controlled by rulebook:
- `compatibility.legacy_index.enabled`
- `compatibility.legacy_index.sunset_date`
- `compatibility.legacy_index.require_ids_sha256_match`

Identity expansion is controlled by spec:
- `identity.profiles.default.hard_fields`
- `identity.profiles.default.soft_fields`
- `identity.profiles.default.defaults`
- `identity.profiles.default.legacy_match_policy`

## Quick usage

```python
from labenv_embedding_cache.api import get_or_compute_embeddings

vectors, resolution = get_or_compute_embeddings(
    cfg,
    texts,
    ids,
    labels=labels,
    compute_embeddings=my_backend_compute_fn,
)
print(resolution.cache_path, resolution.was_cache_hit)
```

## Docker / Docker Compose (env1-env4)

Compose files match the `labenv_config/envkit-templates` profiles (env1_a6000/env2_3090/env3_cc21_a100/env4_cc21_cpu):
- env1: `nvcr.io/nvidia/pytorch:25.02-py3`
- env2: `nvcr.io/nvidia/pytorch:25.09-py3`
- env3: `nvcr.io/nvidia/pytorch:23.10-py3`
- env4: `python:3.11-slim`

Examples:

```bash
docker compose -f docker/compose.env1.yaml run --rm app python tools/embedding_cache_smoketest.py --backend dummy
docker compose -f docker/compose.env4.yaml run --rm app python tools/embedding_smoketest.py --backend transformers
# debugpy (port publish requires --service-ports)
docker compose -f docker/compose.env1.yaml run --rm --service-ports app python tools/embedding_cache_smoketest.py --backend dummy --debugpy --wait-for-client
```

## Sweep utility

Create a stable, line-based sweep list file:

```bash
python tools/make_sweep_list.py --glob "configs/sweep/*.yaml" --out sweep.txt --root .
```

## Per-project Codex usage
In each project, paste `CODEX_PROMPT.md` (or add equivalent guidance in `AGENTS.md`) so Codex always aligns with this policy.
For cross-project rollout requests, use `PROJECT_MIGRATION_PROMPT.md`.

## Automation runbook
- Runtime-aware automation prompt: `docs/AUTOMATION_PROMPT_RUNTIME_AWARE.md`
- Copy/paste execution checklist: `docs/AUTOMATION_EXECUTION_CHECKLIST.md`
- Pre-filled ready-to-run checklist: `docs/AUTOMATION_EXECUTION_CHECKLIST_READY.md`
- What to place in central/downstream/runtime repos: `docs/REPO_AUTOMATION_ASSETS.md`

## Updating shared standards
1. Edit `embedding_rulebook.yaml`, `embedding_registry.yaml`, and/or `embedding_cache_spec.yaml`.
2. In each project, run its embedding-cache validation/tests.
3. If schema behavior changes, bump `identity.version` in `embedding_rulebook.yaml`.
