Metadata-Version: 2.4
Name: doc2mark
Version: 0.5.0
Summary: Unified document processing with AI-powered OCR
Home-page: https://github.com/luisleo526/doc2mark
Author: HaoLiangWen
Author-email: doc2mark Team <luisleo52655@gmail.com>
Maintainer-email: doc2mark Team <luisleo52655@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/luisleo526/doc2mark
Project-URL: Documentation, https://doc2mark.readthedocs.io
Project-URL: Repository, https://github.com/luisleo526/doc2mark
Project-URL: Issues, https://github.com/luisleo526/doc2mark/issues
Project-URL: Changelog, https://github.com/luisleo526/doc2mark/blob/main/CHANGELOG.md
Keywords: document-processing,ocr,pdf,docx,xlsx,pptx,ai,gpt-4,openai,langchain,document-extraction,text-extraction
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pymupdf>=1.23.0
Requires-Dist: python-docx>=0.8.11
Requires-Dist: openpyxl>=3.0.10
Requires-Dist: python-pptx>=0.6.21
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: lxml>=4.9.0
Requires-Dist: markdown>=3.4.0
Requires-Dist: chardet>=5.0.0
Requires-Dist: Pillow>=9.0.0
Requires-Dist: pandas>=1.3.0
Requires-Dist: pydantic>=2.0.0
Provides-Extra: ocr
Requires-Dist: openai>=2.0.0; extra == "ocr"
Requires-Dist: langchain>=1.2.0; extra == "ocr"
Requires-Dist: langchain-openai>=1.1.0; extra == "ocr"
Requires-Dist: pytesseract>=0.3.10; extra == "ocr"
Provides-Extra: heif
Requires-Dist: pillow-heif>=0.16.0; extra == "heif"
Provides-Extra: mime
Requires-Dist: python-magic>=0.4.27; extra == "mime"
Requires-Dist: python-magic-bin>=0.4.14; sys_platform == "win32" and extra == "mime"
Provides-Extra: vertex-ai
Requires-Dist: langchain-google-genai>=2.0.0; extra == "vertex-ai"
Requires-Dist: langchain>=1.2.0; extra == "vertex-ai"
Provides-Extra: all
Requires-Dist: doc2mark[heif,mime,ocr,vertex_ai]; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: pytest-xdist>=3.0.0; extra == "dev"
Requires-Dist: pytest-timeout>=2.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: isort>=5.0.0; extra == "dev"
Requires-Dist: flake8>=6.0.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Requires-Dist: pre-commit>=3.0.0; extra == "dev"
Requires-Dist: twine>=4.0.0; extra == "dev"
Requires-Dist: build>=0.10.0; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx>=6.0.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=1.3.0; extra == "docs"
Requires-Dist: sphinx-autodoc-typehints>=1.23.0; extra == "docs"
Requires-Dist: myst-parser>=2.0.0; extra == "docs"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# doc2mark

[![PyPI version](https://img.shields.io/pypi/v/doc2mark.svg)](https://pypi.org/project/doc2mark/)
[![Python](https://img.shields.io/pypi/pyversions/doc2mark.svg)](https://pypi.org/project/doc2mark/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

Turn any document into clean Markdown -- in one line.

## Features

- Converts PDFs, DOCX/XLSX/PPTX, images, HTML, CSV/JSON, and more
- AI-powered OCR via **OpenAI**, **Google Gemini (Vertex AI)**, or **Tesseract**
- Preserves complex tables (merged cells, rowspan/colspan)
- One unified API + CLI for single files or entire directories
- Batch processing with parallel execution

## Install

```bash
# Core (no OCR)
pip install doc2mark

# With OpenAI OCR
pip install doc2mark[ocr]

# With Google Gemini / Vertex AI OCR
pip install doc2mark[vertex_ai]

# Everything
pip install doc2mark[all]
```

## Quick start

```python
from doc2mark import UnifiedDocumentLoader

loader = UnifiedDocumentLoader()
result = loader.load("document.pdf")
print(result.content)
```

## OCR providers

doc2mark supports three OCR providers. Pass `ocr_provider` to `UnifiedDocumentLoader` to choose one.

### OpenAI (default)

Uses GPT-4.1 vision. Requires an API key.

```bash
export OPENAI_API_KEY=sk-...
```

```python
loader = UnifiedDocumentLoader(ocr_provider="openai")

result = loader.load(
    "scanned_doc.pdf",
    extract_images=True,
    ocr_images=True,
)
```

Customize the model or use an OpenAI-compatible endpoint:

```python
loader = UnifiedDocumentLoader(
    ocr_provider="openai",
    model="gpt-4o-mini",                     # cheaper model
    base_url="http://localhost:11434/v1",     # self-hosted / Ollama
    api_key="any-string",
)
```

### Google Gemini / Vertex AI

Uses Gemini models via Google Cloud. Authenticates with [Application Default Credentials](https://cloud.google.com/docs/authentication/application-default-credentials).

```bash
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json
```

```python
loader = UnifiedDocumentLoader(
    ocr_provider="vertex_ai",
    project="my-gcp-project",          # or set GOOGLE_CLOUD_PROJECT
)

result = loader.load("scan.pdf", extract_images=True, ocr_images=True)
```

Override model and region:

```python
loader = UnifiedDocumentLoader(
    ocr_provider="vertex_ai",
    project="my-gcp-project",
    model="gemini-2.0-flash",          # default: gemini-3.1-flash-lite-preview
    location="us-central1",            # default: global
)
```

### Tesseract (offline)

Local OCR, no API key needed. Requires [Tesseract](https://github.com/tesseract-ocr/tesseract) installed on your system.

```python
from doc2mark.ocr.base import OCRConfig

loader = UnifiedDocumentLoader(
    ocr_provider="tesseract",
    ocr_config=OCRConfig(language="chinese"),   # optional language hint
)

result = loader.load("scan.png", extract_images=True, ocr_images=True)
```

### Provider comparison

| Provider | Requires | Best for | Install extra |
|----------|----------|----------|---------------|
| `openai` | `OPENAI_API_KEY` | Highest accuracy, complex layouts | `pip install doc2mark[ocr]` |
| `vertex_ai` | GCP service account | Google Cloud workflows, Gemini models | `pip install doc2mark[vertex_ai]` |
| `tesseract` | Tesseract binary | Offline / air-gapped environments | `pip install doc2mark[ocr]` |

## Supported formats

| Category | Formats |
|----------|---------|
| Office | DOCX, XLSX, PPTX |
| PDF | PDF (text + scanned) |
| Images | PNG, JPG, WEBP, TIFF, BMP, GIF, HEIC, HEIF, AVIF |
| Text / Data | TXT, CSV, TSV, JSON, JSONL |
| Markup | HTML, XML, Markdown |
| Legacy | DOC, XLS, PPT, RTF, PPS (requires LibreOffice) |

## Common recipes

### Single file

```python
from doc2mark import load

# Text-only extraction (no OCR)
md = load("report.pdf").content

# With OCR for embedded images
md = load("report.pdf", extract_images=True, ocr_images=True).content
```

### Batch processing

```python
from doc2mark import UnifiedDocumentLoader

loader = UnifiedDocumentLoader(ocr_provider="openai")

loader.batch_process(
    input_dir="documents/",
    output_dir="converted/",
    extract_images=True,
    ocr_images=True,
    save_files=True,
    show_progress=True,
)
```

### Process specific files

```python
from doc2mark import batch_process_files

results = batch_process_files(
    ["invoice.pdf", "contract.docx", "receipt.png"],
    output_dir="output/",
    extract_images=True,
    ocr_images=True,
)
```

### OCR prompt templates

doc2mark includes specialized prompts for different content types:

```python
loader = UnifiedDocumentLoader(
    ocr_provider="openai",
    prompt_template="table_focused",    # optimized for tables
)
```

Available templates: `default`, `table_focused`, `document_focused`, `multilingual`, `form_focused`, `receipt_focused`, `handwriting_focused`, `code_focused`.

### Table output styles

Control how complex tables (with merged cells) are rendered:

```python
loader = UnifiedDocumentLoader(
    table_style="minimal_html",     # clean HTML with rowspan/colspan (default)
    # table_style="markdown_grid",  # markdown with merge annotations
    # table_style="styled_html",    # full HTML with inline styles
)
```

## CLI

```bash
# Single file to stdout
doc2mark report.pdf

# Save to file
doc2mark report.pdf -o report.md

# Batch convert a directory
doc2mark documents/ -o converted/ -r

# With OpenAI OCR
doc2mark scan.pdf --ocr openai --ocr-images

# With Tesseract OCR
doc2mark scan.pdf --ocr tesseract --ocr-images

# Disable OCR entirely
doc2mark report.pdf --ocr none --no-ocr-images

# JSON output
doc2mark report.pdf --format json
```

## License

MIT -- see `LICENSE`.
