Metadata-Version: 2.2
Name: bocr
Version: 0.1.2
Summary: A Python package for OCR using Vision LLMs
Author-email: Adrian Phoulady <adrian.phoulady@gmail.com>
License: MIT
Project-URL: Repository, https://github.com/adrianphoulady/bocr
Keywords: ocr,vision,llm,vllm,bocr,text extraction,qwen-vl,llama-vision,phi-vision,ollama
Classifier: Development Status :: 4 - Beta
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: accelerate>=0.34.0
Requires-Dist: torch>=2.4.0
Requires-Dist: torchvision>=0.19.0
Requires-Dist: transformers>=4.46.1
Requires-Dist: qwen_vl_utils>=0.0.10
Requires-Dist: ollama>=0.1
Requires-Dist: opencv-python-headless>=4.10.0.84
Requires-Dist: pdf2image>=1.16.0
Requires-Dist: pillow>=10.0.0
Requires-Dist: markdown>=3.4.0
Requires-Dist: pypandoc>=1.11

# bOCR: Advanced OCR Framework with Vision LLMs

**bOCR** is a high-performance Optical Character Recognition (OCR) framework that uses Vision Large Language Models (VLLMs) for accurate text extraction and document processing. Designed for efficiency and modularity, it simplifies OCR workflows with minimal setup.

## Features

- **Minimal Setup**: Requires just a single backbone file (e.g., `qwen.py` or `ollamas.py`) for OCR execution, making it lightweight and easy to use.
- **Broad Vision LLM Support**: Integrates with vision LLMs like `Qwen`, `Llama`, `Phi`, and various VLLMs included in the `Ollama` package.
- **Customizable Prompts**: Fine-tune OCR output using either a custom or default prompt.
- **Automated Preprocessing**: Image denoising, resizing, and PDF-to-image conversion.
- **Postprocessing & Export**: Supports merging pages and multiple export formats (`plain`, `markdown`, `docx`, `pdf`).
- **Configurable Pipeline**: A single `Config` object centralizes OCR settings.
- **Detailed Logging**: Integrated verbose logging for insights and debugging.

---

## Installation

### Install from PyPI (Recommended)

```bash
pip install bocr
```

### Install from Source (Development Version)

```bash
git clone https://github.com/adrianphoulady/bocr.git
cd bocr
pip install .
```

### Required Dependencies

For **PDF and document processing**, `poppler`, `pandoc`, and LaTeX are required. You can install them as follows:

#### Linux (Debian/Ubuntu)
```bash
sudo apt install poppler-utils pandoc texlive-xetex texlive-fonts-recommended lmodern
```

#### macOS (using Homebrew)
```bash
brew install poppler pandoc --cask mactex-no-gui
```

#### Windows (using Chocolatey)
```powershell
choco install poppler pandoc miktex
```

---

## Quick Start

### Simple Example (Single File OCR)

Any backbone file in the `backbones` module, like `qwen.py`, is all you need to run OCR on an image:

```python
from bocr.backbones.qwen import extract_text

result = extract_text("sample1.png")
print(result)
```

---

### Advanced Usage

```python
from bocr import Config, ocr

config = Config(model_id="Qwen/Qwen2-VL-7B-Instruct", export_results=True, export_format="pdf", verbose=True)
files = ["sample2.pdf"]
results = ocr(files, config)
print(results)
```

### Command Line Example

```bash
bocr sample1.jpg --export-results --export-format docx --verbose
```

---

## Configuration

The `Config` class centralizes OCR settings. Key parameters:

| Parameter        | Type         | Description                                            | Default                       |
|------------------|--------------|--------------------------------------------------------|-------------------------------|
| `prompt`         | `str`/`None` | Custom OCR prompt or `None` for default.               | `None`                        |
| `model_id`       | `str`        | Vision LLM model identifier.                           | `Qwen/Qwen2.5-VL-3B-Instruct` |
| `max_new_tokens` | `int`        | Max tokens generated by model.                         | `1024`                        |
| `preprocess`     | `bool`       | Enable preprocessing of input files.                   | `False`                       |
| `resolution`     | `int`        | DPI for PDF-to-image conversion.                       | `150`                         |
| `max_image_size` | `int`/`None` | Resize images to a max size. No resizing if `None`.    | `1920`                        |
| `result_format`  | `str`        | Output format (`plain`, `markdown`).                   | `md`                          |
| `merge_text`     | `bool`       | Merge extracted text.                                  | `False`                       |
| `export_results` | `bool`       | Save results to files.                                 | `False`                       |
| `export_format`  | `str`        | File output format (`txt`, `md`, `docx`, `pdf`).       | `md`                          |
| `export_dir`     | `str`/`None` | Directory for output files. `./ocr_exports` if `None`. | `None`                        |
| `verbose`        | `bool`       | Enables detailed logging for debugging.                | `False`                       |

---

## OCR Pipeline

### 1. Preprocessing

- **URL Handling**: Downloads remote files if input is a URL.
- **PDF Conversion**: Converts PDFs into image format (requires `poppler` installed and in `PATH`).
- **Image Enhancement**: Applies denoising and contrast adjustment.
- **Resizing**: Optimizes images for Vision LLMs.

### 2. Text Extraction

- Extracts text using Vision LLMs, with support for custom prompts for tailored OCR instructions.

### 3. Postprocessing

- Formats and merges extracted text in specified format.
- Converts it into specified export formats (e.g., Markdown, PDF).
- Saves results if configured.

---

## Logging

Enable logging by setting `verbose=True` in the `Config` object. Logs provide insights into preprocessing, extraction, and postprocessing steps.

---

## Supported Models

bOCR supports Vision LLMs such as:

- `Qwen/Qwen2-VL-2B-Instruct`
- `Qwen/Qwen2-VL-7B-Instruct`
- `Qwen/Qwen2-VL-72B-Instruct`
- `Qwen/QVQ-72B-Preview`
- `meta-llama/Llama-3.2-11B-Vision-Instruct`
- `meta-llama/Llama-3.2-90B-Vision-Instruct`
- `microsoft/Phi-3.5-vision-instruct`
- `llama3.2-vision:11b` from Ollama
- `llama3.2-vision:90b` from Ollama

Additional models can be supported by implementing a new backbone in `bocr/backbones/` and updating `mappings.yaml`.

---

## License

This project is licensed under the MIT License.
