Metadata-Version: 2.2
Name: mdify
Version: 0.1.6
Summary: A powerful tool to extract text, tables, charts, and formulas from documents and convert them into Markdown format, ideal to improve LLM's accuracy and for versatile document processing.
Home-page: https://github.com/stefanodangelo/mdify
Author: Stefano D'Angelo
License: CC BY-NC 4.0
Project-URL: GitHub Repository, https://github.com/stefanodangelo/mdify
Project-URL: Documentation, https://stefanodangelo.github.io/mdify/
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: GPU :: NVIDIA CUDA :: 11.7
Classifier: Framework :: MkDocs
Classifier: License :: Free for non-commercial use
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pypdfium2==4.30.0
Requires-Dist: ultralytics==8.3.51
Requires-Dist: ultralyticsplus
Requires-Dist: ipykernel==6.29.5
Requires-Dist: paddlepaddle==2.6.2
Requires-Dist: paddleocr
Requires-Dist: pandas==2.2.0
Requires-Dist: tabulate==0.9.0
Requires-Dist: doclayout-yolo==0.0.2
Requires-Dist: unimernet==0.2.1
Requires-Dist: struct-eqtable==0.3.3
Requires-Dist: lmdeploy==0.6.4
Requires-Dist: omegaconf==2.3.0
Requires-Dist: supervision==0.25.1
Requires-Dist: torch==2.3.1
Requires-Dist: optimum==1.23.3
Requires-Dist: onnxruntime==1.15.1
Requires-Dist: onnx==1.16.1
Requires-Dist: surya-ocr<=0.8.0
Requires-Dist: protobuf==3.20.2
Requires-Dist: easyocr==1.7.2
Requires-Dist: pymupdf==1.25.1
Dynamic: author
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license
Dynamic: project-url
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

![PyPI](https://img.shields.io/pypi/v/mdify?color=blue)
![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY%20NC%204.0-lightgrey)

# MDify: Convert any PDF to Markdown  

**MDify** is a powerful Python library for converting PDFs (or image-based documents) into clean, structured Markdown.

Unlike other tools, **MDify** can accurately extract **tables, charts, and images**, even offering the option to save them separately for further use. \
This is particularly useful when working with documents like financial statements, spreadsheets, and data-rich reports, which usually have lots of tables and images. \
MDify categorizes images into general pictures and charts and extracts tables of any kind, even complex ones with merged cells and sparse data.

Whether you're working with *research papers*, *reports*, or *general documents*, MDify ensures the data is extracted in a structured, clean, and machine-readable format, making it ideal for tasks like fine-tuning, question answering, and document analysis in the context of Large Language Models (**LLMs**). \
By converting complex PDFs into well-structured Markdown, this tool helps streamline the input process for LLM applications, reducing the time spent on manual cleaning and formatting. With features like table extraction, image preservation, and high-quality OCR, MDify is a perfect fit for preparing large volumes of data for AI models.

**Note**: Currently this tool only supports English documents.


## 🚀 Installation  
First, install **MDify** via PyPI:  

```sh
pip install mdify
```


## ⚡ Quickstart  
Convert a document to Markdown with just a few lines of code:
```python
from mdify import DocumentParser

parser = DocumentParser()
parser.parse('PATH_TO_YOUR_DOCUMENT')
```

Or parse multiple documents from one folder at once simply by changing the last line to:
```python
parser.parse_multiple('PATH_TO_YOUR_FOLDER')
```

You can then choose the outputs to save using `DocumentParser(save_artifacts=...)`, or you can set the write mode to embedded, placeholder or described by passing the `write_mode` parameter to the `parse()` function.


## 🔹 Key Features  
✔️ **Handles complex layouts** - Extracts text, tables, and visual elements with precision
🖼️ **Preserves images & charts** - Gives the option to save and reuse extracted visuals for Computer Vision tasks
🎯 **Optimized for accuracy** - Combines layout detection and OCR to extract text from documents
🤖 **Preprocessing for LLM applications** - Converts documents to Markdown, which is popular for LLM training and fine-tuning tasks
🛠️ **Debug mode** - Save intermediate document elements as images for analysis

**Notes**:
- The first run will take ~2 minutes to download the necessary models.
- Currently, MDify can only handle PDFs and images (such as text extracts, document scans, etc.). Therefore, please make sure that your documents meet the requirements in order for the tool to work best.
- Diagrams are not supported yet, therefore if you use the `DESCRIBED` write mode they may be analyzed incorrectly.


## 📄 Documentation
For more information, please refer to the [official documentation](https://stefanodangelo.github.io/mdify/).


## 🤝 Contributing
MDify is an independent, open-source project developed and maintained by passionate developers. Your support is highly valued, and any contributions — whether through issues, bug reports, feature requests, or pull requests — are more than welcome!

If you are interested in improving this library or adding new features, please don't hesitate to get involved!


## 💖 Support
Being an independent developer, I would much appreciate it if you could\
[![Buy me a coffee](https://img.buymeacoffee.com/button-api/?text=buy%20me%20a%20coffee&emoji="☕"&slug=stefanodangelo&button_colour=FF5F5F&font_colour=ffffff&font_family=Lato&outline_colour=000000&coffee_colour=FFDD00)](https://www.buymeacoffee.com/stefanodangelo)


Thank you!

## ⚖️ License
This project is licensed under the [Creative Commons Attribution-NonCommercial 4.0 International License](LICENSE).

You can find the full text of the license here: [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/legalcode)
