Metadata-Version: 2.4
Name: doc2mark
Version: 0.3.2
Summary: Unified document processing with AI-powered OCR
Home-page: https://github.com/luisleo526/doc2mark
Author: HaoLiangWen
Author-email: doc2mark Team <luisleo52655@gmail.com>
Maintainer-email: doc2mark Team <luisleo52655@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/luisleo526/doc2mark
Project-URL: Documentation, https://doc2mark.readthedocs.io
Project-URL: Repository, https://github.com/luisleo526/doc2mark
Project-URL: Issues, https://github.com/luisleo526/doc2mark/issues
Project-URL: Changelog, https://github.com/luisleo526/doc2mark/blob/main/CHANGELOG.md
Keywords: document-processing,ocr,pdf,docx,xlsx,pptx,ai,gpt-4,openai,langchain,document-extraction,text-extraction
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pymupdf>=1.23.0
Requires-Dist: python-docx>=0.8.11
Requires-Dist: openpyxl>=3.0.10
Requires-Dist: python-pptx>=0.6.21
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: lxml>=4.9.0
Requires-Dist: markdown>=3.4.0
Requires-Dist: chardet>=5.0.0
Requires-Dist: Pillow>=9.0.0
Requires-Dist: pandas>=1.3.0
Provides-Extra: ocr
Requires-Dist: openai>=1.0.0; extra == "ocr"
Requires-Dist: langchain>=0.1.0; extra == "ocr"
Requires-Dist: langchain-openai>=0.0.2; extra == "ocr"
Requires-Dist: pytesseract>=0.3.10; extra == "ocr"
Provides-Extra: all
Requires-Dist: doc2mark[ocr]; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: pytest-xdist>=3.0.0; extra == "dev"
Requires-Dist: pytest-timeout>=2.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: isort>=5.0.0; extra == "dev"
Requires-Dist: flake8>=6.0.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Requires-Dist: pre-commit>=3.0.0; extra == "dev"
Requires-Dist: twine>=4.0.0; extra == "dev"
Requires-Dist: build>=0.10.0; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx>=6.0.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=1.3.0; extra == "docs"
Requires-Dist: sphinx-autodoc-typehints>=1.23.0; extra == "docs"
Requires-Dist: myst-parser>=2.0.0; extra == "docs"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# doc2mark

[![PyPI version](https://img.shields.io/pypi/v/doc2mark.svg)](https://pypi.org/project/doc2mark/)
[![Python](https://img.shields.io/pypi/pyversions/doc2mark.svg)](https://pypi.org/project/doc2mark/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

**doc2mark** converts any document to Markdown while preserving complex structures like tables, using AI-powered OCR when needed. Built with a unified API that handles everything from simple text files to complex multi-format documents with advanced batch processing capabilities.

## ✨ Key Features

- **Universal Format Support**: PDF, DOCX, XLSX, PPTX, HTML, JSON, CSV, and more
- **Advanced Batch Processing**: Process entire directories with progress tracking and error handling
- **AI-Powered OCR**: Multiple providers (OpenAI GPT-4o, Tesseract) with specialized prompt templates
- **Dynamic Configuration**: Update OCR settings on-the-fly without reinitializing
- **Table Structure Preservation**: Maintains merged cells, multi-level headers, and complex layouts
- **Multiple Output Formats**: Markdown (default), JSON, or plain text
- **Comprehensive Error Handling**: Robust processing with detailed error reporting
- **Caching Support**: Optional caching for improved performance on repeated processing

## 🚀 Quick Start

### Installation

```bash
# Basic installation
pip install doc2mark

# With OCR support
pip install doc2mark[ocr]

# With all dependencies
pip install doc2mark[all]
```

### Basic Usage

```python
from doc2mark import UnifiedDocumentLoader

# Initialize loader (defaults to OpenAI OCR)
loader = UnifiedDocumentLoader()

# Convert any document to markdown
result = loader.load('document.pdf')
print(result.content)
```

### With Enhanced OCR Configuration

```python
from doc2mark import UnifiedDocumentLoader
from doc2mark.ocr.prompts import PromptTemplate

# Configure OCR with advanced settings
loader = UnifiedDocumentLoader(
    ocr_provider='openai',
    api_key='your-openai-api-key'  # or set OPENAI_API_KEY env var
    model='gpt-4.1',  # Latest model
    temperature=0.1,
    max_tokens=4096,
    max_workers=5,
    prompt_template=PromptTemplate.DOCUMENT_FOCUSED,
    timeout=60,
    max_retries=3
)

# Process with image extraction and OCR
result = loader.load(
    'scanned_document.pdf',
    extract_images=True,
    ocr_images=True,
    show_progress=True
)
```

## 🔧 OCR Providers

### OpenAI GPT-4.1 (Recommended)

```python
# Full OpenAI configuration
loader = UnifiedDocumentLoader(
    ocr_provider='openai',
    api_key='your-openai-api-key',  # or set OPENAI_API_KEY env var
    model='gpt-4o',
    temperature=0,
    max_tokens=4096,
    max_workers=5,
    prompt_template=PromptTemplate.TABLE_FOCUSED,
    # Additional OpenAI parameters
    top_p=1.0,
    frequency_penalty=0.0,
    presence_penalty=0.0
)
```

### Tesseract (Offline)

```python
# Use Tesseract for offline processing
loader = UnifiedDocumentLoader(
    ocr_provider='tesseract'
)
```

## 📊 Advanced Batch Processing

### Process Entire Directories

```python
# Batch process with full configuration
results = loader.batch_process(
    input_dir='./documents',
    output_dir='./processed',
    output_format='markdown',
    extract_images=True,
    ocr_images=True,
    recursive=True,
    show_progress=True,
    save_files=True
)

# Check results
for file_path, result in results.items():
    if result['status'] == 'success':
        print(f"✅ {file_path}: {result['content_length']} chars")
    else:
        print(f"❌ {file_path}: {result['error']}")
```

### Process Specific Files

```python
# Process a list of specific files
files = ['report.pdf', 'data.xlsx', 'presentation.pptx']
results = loader.batch_process_files(
    file_paths=files,
    output_dir='./output',
    extract_images=True,
    ocr_images=True,
    show_progress=True
)
```

### Using Convenience Functions

```python
from doc2mark import batch_process_documents, batch_process_files

# High-level batch processing
results = batch_process_documents(
    input_dir='./docs',
    output_format='json',
    ocr_provider='openai',
    extract_images=True,
    ocr_images=True
)
```

## 🎯 Specialized Prompt Templates

doc2mark includes 8 specialized prompt templates optimized for different content types:

```python
from doc2mark.ocr.prompts import PromptTemplate

# Available templates
templates = {
    PromptTemplate.DEFAULT: "General purpose text extraction",
    PromptTemplate.TABLE_FOCUSED: "Optimized for tabular data",
    PromptTemplate.DOCUMENT_FOCUSED: "Preserves document structure", 
    PromptTemplate.FORM_FOCUSED: "Extract form fields and values",
    PromptTemplate.RECEIPT_FOCUSED: "Invoices and receipts",
    PromptTemplate.HANDWRITING_FOCUSED: "Handwritten text",
    PromptTemplate.CODE_FOCUSED: "Source code and technical docs",
    PromptTemplate.MULTILINGUAL: "Non-English documents"
}

# Use specific template
loader = UnifiedDocumentLoader(
    prompt_template=PromptTemplate.TABLE_FOCUSED
)
```

## ⚙️ Dynamic Configuration

Update OCR settings without reinitializing:

```python
# Initial setup
loader = UnifiedDocumentLoader(ocr_provider='openai')

# Update configuration dynamically
loader.update_ocr_configuration(
    model='gpt-4o-mini',
    temperature=0.3,
    prompt_template='table_focused',
    max_workers=10
)

# Validate setup
validation = loader.validate_ocr_setup()
print(f"OCR Status: {'✅ Valid' if not validation['errors'] else '❌ Issues found'}")

# Get available templates
templates = loader.get_available_prompt_templates()
for name, description in templates.items():
    print(f"  {name}: {description}")
```

## 📖 Supported Formats

| Category | Formats | Notes |
|----------|---------|-------|
| **PDF** | `.pdf` | Text extraction + OCR for scanned content |
| **Microsoft Office** | `.docx`, `.xlsx`, `.pptx` | Full support with image extraction |
| **Legacy Office** | `.doc`, `.xls`, `.ppt`, `.rtf`, `.pps` | Requires LibreOffice |
| **Text/Data** | `.txt`, `.csv`, `.tsv`, `.json`, `.jsonl` | Direct processing |
| **Web/Markup** | `.html`, `.xml`, `.md`, `.markdown` | Structure preservation |

## 🔍 Output Formats

### Markdown (Default)

```python
result = loader.load('document.pdf')
# Returns clean Markdown with preserved formatting
```

### JSON with Metadata

```python
from doc2mark import OutputFormat

result = loader.load('document.pdf', output_format=OutputFormat.JSON)
data = json.loads(result.content)
# Structured data with metadata
```

### Plain Text

```python
result = loader.load('document.pdf', output_format=OutputFormat.TEXT)
# Clean text without formatting
```

## 🌍 Language Support

Automatic language detection and preservation:

```python
# Multilingual documents
result = loader.load(
    'chinese_document.pdf',
    prompt_template=PromptTemplate.MULTILINGUAL
)

# The output preserves the original language
```

## 🛠️ Advanced Features

### Image Extraction and OCR

```python
# Extract images without OCR
result = loader.load(
    'document.pdf',
    extract_images=True,
    ocr_images=False  # Keep as base64 data
)

# Extract images with OCR processing
result = loader.load(
    'document.pdf', 
    extract_images=True,
    ocr_images=True  # Convert images to text descriptions
)

# Access extracted images
if result.images:
    print(f"Extracted {len(result.images)} images")
```

### Progress Tracking

```python
# Show detailed progress during processing
result = loader.load(
    'large_document.pdf',
    show_progress=True
)

# Batch processing with progress
results = loader.batch_process(
    'documents/',
    show_progress=True
)
```

### Caching

```python
# Enable caching for repeated processing
loader = UnifiedDocumentLoader(
    cache_dir='./cache'
)

# Subsequent calls to the same file will use cached results
```

### Error Handling

```python
from doc2mark.core.base import ProcessingError, UnsupportedFormatError

try:
    result = loader.load('document.pdf')
except UnsupportedFormatError as e:
    print(f"Format not supported: {e}")
except ProcessingError as e:
    print(f"Processing failed: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")
```

## 📊 Integration Examples

### RAG Pipeline Integration

```python
from doc2mark import UnifiedDocumentLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Process documents for RAG
loader = UnifiedDocumentLoader(
    prompt_template=PromptTemplate.DOCUMENT_FOCUSED
)

documents = ['report.pdf', 'data.xlsx', 'analysis.docx']
texts = []

for doc in documents:
    result = loader.load(doc, extract_images=True, ocr_images=True)
    texts.append(result.content)

# Split for vector database
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
chunks = text_splitter.create_documents(texts)
```

### Automated Document Processing Pipeline

```python
import os
from pathlib import Path

def process_document_pipeline(input_dir, output_dir):
    """Complete document processing pipeline."""
    
    loader = UnifiedDocumentLoader(
        ocr_provider='openai',
        model='gpt-4o',
        prompt_template=PromptTemplate.DOCUMENT_FOCUSED
    )
    
    # Validate OCR setup
    validation = loader.validate_ocr_setup()
    if validation['errors']:
        raise RuntimeError(f"OCR setup issues: {validation['errors']}")
    
    # Process all documents
    results = loader.batch_process(
        input_dir=input_dir,
        output_dir=output_dir,
        extract_images=True,
        ocr_images=True,
        show_progress=True,
        save_files=True
    )
    
    # Generate summary report
    successful = sum(1 for r in results.values() if r['status'] == 'success')
    failed = len(results) - successful
    
    print(f"📊 Processing Complete:")
    print(f"   ✅ Successful: {successful}")
    print(f"   ❌ Failed: {failed}")
    
    return results

# Usage
results = process_document_pipeline('./input_docs', './processed_docs')
```

## 🔧 Configuration Reference

### UnifiedDocumentLoader Parameters

```python
loader = UnifiedDocumentLoader(
    # OCR Provider
    ocr_provider='openai',  # 'openai' or 'tesseract'
    api_key=None,  # Auto-detects from OPENAI_API_KEY env var
    
    # OpenAI Model Configuration
    model='gpt-4o',  # OpenAI model to use
    temperature=0.0,  # Response randomness (0.0-2.0)
    max_tokens=4096,  # Maximum response length
    max_workers=5,  # Concurrent processing workers
    timeout=30,  # Request timeout in seconds
    max_retries=3,  # Retry attempts for failed requests
    
    # Advanced OpenAI Parameters
    top_p=1.0,  # Nucleus sampling parameter
    frequency_penalty=0.0,  # Reduce repetition (-2.0 to 2.0)
    presence_penalty=0.0,  # Encourage new topics (-2.0 to 2.0)
    
    # Prompt Configuration
    prompt_template=PromptTemplate.DEFAULT,  # Specialized prompt
    default_prompt=None,  # Custom prompt override
    
    # System Configuration
    cache_dir=None,  # Enable caching
    ocr_config=None  # Additional OCR configuration
)
```

### Processing Parameters

```python
result = loader.load(
    file_path='document.pdf',
    output_format=OutputFormat.MARKDOWN,  # Output format
    extract_images=False,  # Extract images from document
    ocr_images=False,  # Perform OCR on extracted images
    show_progress=False,  # Show processing progress
    encoding='utf-8',  # Text file encoding
    delimiter=None  # CSV delimiter (auto-detect if None)
)
```

## 📝 Requirements

- **Python**: 3.8+
- **Required**: `pathlib`, `logging`, `typing`
- **OCR (OpenAI)**: `openai`, `langchain`, `langchain-openai`
- **OCR (Tesseract)**: `pytesseract`, `Pillow`
- **Office Formats**: `python-docx`, `openpyxl`, `python-pptx`
- **PDF**: `PyMuPDF`
- **Legacy Formats**: LibreOffice (system dependency)

## 🚀 Performance Tips

1. **Use appropriate prompt templates** for your content type
2. **Enable caching** for repeated processing of the same files
3. **Adjust max_workers** based on your system and API limits
4. **Use batch processing** for multiple files to leverage parallel processing
5. **Set appropriate timeouts** for large documents

## 🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository
2. Create a feature branch
3. Add tests for new functionality
4. Ensure all tests pass
5. Submit a pull request

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙋‍♂️ Support

- **Issues**: [GitHub Issues](https://github.com/luisleo526/doc2mark/issues)
- **Email**: luisleo52655@gmail.com
- **Documentation**: See inline docstrings and examples above

## 🔄 Recent Updates

- ✅ Enhanced OCR configuration with 8 specialized prompt templates
- ✅ Advanced batch processing with progress tracking and error handling
- ✅ Dynamic configuration updates without reinitialization
- ✅ Comprehensive validation and setup checking
- ✅ Support for both OpenAI GPT-4o and Tesseract OCR
- ✅ Improved caching and performance optimizations
- ✅ Better error handling and logging

## ⚠️ Current Limitations

- Legacy formats (DOC, XLS, PPT) require LibreOffice installation
- Large files may require adjusted timeout settings
- OpenAI OCR requires API key and internet connection
- Batch processing performance depends on OCR provider rate limits
