Metadata-Version: 2.4
Name: mistocr
Version: 0.2.5
Summary: Batch OCR for PDFs with heading restoration and visual content integration
Home-page: https://github.com/franckalbinet/mistocr
Author: Solveit
Author-email: nobody@fast.ai
License: Apache Software License 2.0
Keywords: nbdev jupyter notebook python
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: License :: OSI Approved :: Apache Software License
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: fastcore
Requires-Dist: mistralai
Requires-Dist: pillow
Requires-Dist: dotenv
Requires-Dist: lisette
Provides-Extra: dev
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: license-file
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Mistocr


<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

**PDF OCR is a critical bottleneck in AI pipelines.** It’s often
mentioned in passing, as if it’s a trivial step. Practice shows it’s far
from it. Poorly converted PDFs mean garbage-in-garbage-out for
downstream AI-system (RAG, …).

When [Mistral AI](https://mistral.ai) released their [state-of-the-art
OCR model](https://mistral.ai/fr/news/mistral-ocr) in March 2025, it
opened new possibilities for large-scale document processing. While
alternatives like [datalab.to](https://www.datalab.to) and
[docling.ai](https://www.docling.ai) offer viable solutions, Mistral OCR
delivers exceptional accuracy at a compelling price point.

**mistocr** emerged from months of real-world usage across projects
requiring large-scale processing of niche-domain PDFs. It addresses two
fundamental challenges that raw OCR output leaves unsolved:

- **Heading hierarchy restoration**: Even state-of-the-art OCR sometimes
  produces inconsistent heading levels in large documents—a complex task
  to get right. mistocr uses LLM-based analysis to restore proper
  document structure, essential for downstream AI tasks.

- **Visual content integration**: Charts, figures and diagrams are
  automatically classified and described, then integrated into the
  markdown. This makes visual information searchable and accessible for
  downstream applications.

- **Cost-efficient batch processing**: The OCR step exclusively uses
  Mistral’s batch API, cutting costs by 50% (\$0.50 vs \$1.00 per 1000
  pages) while eliminating the boilerplate code typically required.

**In short**: Complete PDF OCR with heading hierarchy fixes and image
descriptions for RAG and LLM pipelines.

> [!NOTE]
>
> **Want to see mistocr in action?** This
> [tutorial](https://share.solve.it.com/d/97f75412ca949af76a5945b4dfc443c7)
> demonstrates real-world PDF processing and shows how clean markdown
> enables structure-aware navigation through long documents—letting you
> find exactly what you need, fast.

## Get Started

Install latest from [pypi](https://pypi.org/project/mistocr), then:

``` sh
$ pip install mistocr
```

Set your API keys:

``` python
import os
os.environ['MISTRAL_API_KEY'] = 'your-key-here'
os.environ['ANTHROPIC_API_KEY'] = 'your-key-here'  # for refine features (see Advanced Usage for other LLMs)
```

### Complete Pipeline

#### Single File Processing

Process a single PDF with OCR (using Mistral’s batch API for cost
efficiency), heading fixes, and image descriptions:

``` python
from mistocr.pipeline import pdf_to_md
await pdf_to_md('files/test/resnet.pdf', 'files/test/md_test')
```

    Step 1/3: Running OCR on files/test/resnet.pdf...
    Mistral batch job status: QUEUED
    Mistral batch job status: RUNNING
    Mistral batch job status: RUNNING
    Step 2/3: Fixing heading hierarchy...
    Step 3/3: Adding image descriptions...
    Describing 7 images...
    Saved descriptions to ocr_temp/resnet/img_descriptions.json
    Adding descriptions to 12 pages...
    Done! Enriched pages saved to files/test/md_test
    Done!

This will (as indicated by the output):

1.  OCR the PDF using Mistral’s batch API
2.  Fix heading hierarchy inconsistencies
3.  Describe images (charts, diagrams) and add those descriptions into
    the markdown Save everything to `files/test/md_test`

The output structure will be:

    files/test/md_test/
    ├── img/
    │   ├── img-0.jpeg
    │   ├── img-1.jpeg
    │   └── ...
    ├── page_1.md
    ├── page_2.md
    └── ...

Each page’s markdown will include inline image descriptions:

```` markdown
```markdown
![Figure 1](img/img-0.jpeg)
AI-generated image description:
___
A residual learning block...
___
```
````

To print the the processed markdown, you can use the
[`read_pgs`](https://franckalbinet.github.io/mistocr/core.html#read_pgs)
function. Here’s how:

Then to read the fully processed document:

``` python
from mistocr.pipeline import read_pgs
md = read_pgs('files/test/md_test')
print(md[:500])
```

    # Deep Residual Learning for Image Recognition  ... page 1

    Kaiming He Xiangyu Zhang Shaoqing Ren Jian Sun<br>Microsoft Research<br>\{kahe, v-xiangz, v-shren, jiansun\}@microsoft.com


    ## Abstract ... page 1

    Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, ins

By default,
[`read_pgs()`](https://franckalbinet.github.io/mistocr/core.html#read_pgs)
joins all pages. Pass `join=False` to get a list of individual pages
instead.

### Advanced Usage

**Batch OCR for entire folders:**

``` python
from mistocr.core import ocr_pdf

# OCR all PDFs in a folder using Mistral's batch API
output_dirs = ocr_pdf('path/to/pdf_folder', dst='output_folder')
```

**Custom models and prompts for heading fixes:**

``` python
from mistocr.refine import fix_hdgs

# Use a different model or custom prompt
fix_hdgs('ocr_output/doc1', 
         model='gpt-4o',
         prompt=your_custom_prompt)
```

**Custom image description with rate limiting:**

``` python
from mistocr.refine import add_img_descs

# Control API usage and customize descriptions
await add_img_descs('ocr_output/doc1',
                    model='claude-opus-4',
                    semaphore=5,  # More concurrent requests
                    delay=0.5)    # Shorter delay between calls
```

For complete control over each pipeline step, see the
[core](https://fr.anckalbi.net/mistocr/core.html),
[refine](https://fr.anckalbi.net/mistocr/refine.html), and
[pipeline](https://fr.anckalbi.net/mistocr/pipeline.html) module
documentation.

## Known Limitations & Future Work

`mistocr` is under active development. Current limitations include:

- **No timeout on batch jobs**: Jobs poll indefinitely until completion.
  If a job stalls, manual intervention is required.
- **Limited error handling**: When batch jobs fail, error reporting and
  recovery options are minimal.
- **Progress monitoring**: Currently limited to periodic status prints.
  Future versions will support callbacks or streaming updates for better
  real-time monitoring.

Contributions are welcome! If you encounter issues or have ideas for
improvements, please open an issue or discussion on
[GitHub](https://github.com/franckalbinet/mistocr).

## Developer Guide

If you are new to using `nbdev` here are some useful pointers to get you
started.

### Install mistocr in Development mode

``` sh
# make sure mistocr package is installed in development mode
$ pip install -e .

# make changes under nbs/ directory
# ...

# compile to have changes apply to mistocr
$ nbdev_prepare
```

### Documentation

Documentation can be found hosted on this GitHub
[repository](https://github.com/franckalbinet/mistocr)’s
[pages](https://franckalbinet.github.io/mistocr/). Additionally you can
find package manager specific guidelines on
[conda](https://anaconda.org/franckalbinet/mistocr) and
[pypi](https://pypi.org/project/mistocr/) respectively.
