Metadata-Version: 2.4
Name: extractforms
Version: 0.2.0
Summary: A python project to turn scanned forms into a list of key-value pairs.
Project-URL: Homepage, https://github.com/Guillaume-Lombardo/extractforms
Project-URL: Repository, https://github.com/Guillaume-Lombardo/extractforms
Project-URL: Issues, https://github.com/Guillaume-Lombardo/extractforms/issues
Author-email: Guillaume Lombardo <g1lom@later.day>
License: MIT
License-File: LICENSE
Keywords: automation,cli,package,python
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.13
Requires-Dist: certifi>=2026.1.4
Requires-Dist: httpx>=0.28.1
Requires-Dist: openai>=2.1.0
Requires-Dist: pydantic-settings>=2.13.0
Requires-Dist: pydantic>=2.12.0
Requires-Dist: pymupdf>=1.26.5
Requires-Dist: python-dotenv>=1.1.0
Requires-Dist: structlog>=25.5.0
Description-Content-Type: text/markdown

# ExtractForms

`extractforms` is a Python package and CLI to extract key/value fields from PDF forms.

## Quickstart

```bash
uv sync --group dev
uv run pre-commit install
uv run ruff format .
uv run ruff check .
uv run ty check src tests
uv run pytest
uv run pre-commit run --all-files
```

## CLI

```bash
extractforms extract --input form.pdf --output results/result.json --passes 2
```

Supported options include:
- `--no-cache`
- `--dpi`, `--image-format`, `--page-start`, `--page-end`, `--max-pages`
- `--chunk-pages`
- `--backend` (`multimodal` or `ocr`)
- `--drop-blank-pages`, `--blank-page-ink-threshold`, `--blank-page-near-white-level`
- `--extra-instructions`
- `--schema-id`, `--schema-path` (expects `*.schema.json`), `--match-schema`

Schema fields support both `kind` and optional `semantic_type` metadata for richer typing
(for example: phone, address, amount, iban, postal code).

## Environment

Copy `.env.template` to `.env` and configure:
- logging (`LOG_LEVEL`, `LOG_JSON`, `LOG_FILE`)
- enterprise network/TLS (`HTTP_PROXY`, `HTTPS_PROXY`, `ALL_PROXY`, `NO_PROXY`, `CERT_PATH`)
- model endpoint (`OPENAI_BASE_URL`, `OPENAI_API_KEY`, `OPENAI_MODEL`)
- backend selection (`EXTRACTION_BACKEND`, `OCR_PROVIDER_FACTORY`, `OCR_ENABLE_TEXT_NORMALIZATION`)
- extraction behavior (`DROP_BLANK_PAGES`, `BLANK_PAGE_INK_THRESHOLD`, `BLANK_PAGE_NEAR_WHITE_LEVEL`)

Security notes:
- `OPENAI_BASE_URL` must use `https://` in non-local environments (`http://` is accepted for localhost/loopback only).
- `--schema-path` accepts only schema cache files with `.schema.json` suffix.

## Project Layout

- `src/extractforms`: package code
- `tests/unit`: fast default tests
- `tests/integration`: component-level tests
- `tests/end2end`: user-facing behavior tests
- `skills`: AI helper skills for coding workflows

## Release

1. Bump `version` in `pyproject.toml`.
2. Create and push a git tag: `vX.Y.Z`.
3. GitHub Action publishes to PyPI.

For manual validation, use workflow dispatch with `publish_target=testpypi`.
