Metadata-Version: 2.3
Name: invocr
Version: 1.0.1
Summary: Invoice OCR System - Convert invoices between PDF, JSON, XML, HTML formats using OCR
License: MIT
Author: InvOCR Team
Author-email: team@invocr.com
Requires-Python: >=3.9,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Dist: aiofiles (>=23.2.1,<24.0.0)
Requires-Dist: click (>=8.1.7,<9.0.0)
Requires-Dist: easyocr (>=1.7.0,<2.0.0)
Requires-Dist: fastapi (>=0.104.1,<0.105.0)
Requires-Dist: jinja2 (>=3.1.2,<4.0.0)
Requires-Dist: loguru (>=0.7.2,<0.8.0)
Requires-Dist: lxml (>=4.9.3,<5.0.0)
Requires-Dist: numpy (>=1.24.3,<2.0.0)
Requires-Dist: opencv-python (>=4.8.1.78,<5.0.0.0)
Requires-Dist: pdf2image (>=1.16.3,<2.0.0)
Requires-Dist: pdfplumber (>=0.9.0,<0.10.0)
Requires-Dist: pillow (>=10.1.0,<11.0.0)
Requires-Dist: pydantic (>=2.5.0,<3.0.0)
Requires-Dist: pydantic-settings (>=2.1.0,<3.0.0)
Requires-Dist: pytesseract (>=0.3.10,<0.4.0)
Requires-Dist: python-multipart (>=0.0.6,<0.0.7)
Requires-Dist: uvicorn[standard] (>=0.24.0,<0.25.0)
Requires-Dist: weasyprint (>=60.2,<61.0)
Project-URL: Documentation, https://invocr.readthedocs.io
Project-URL: Homepage, https://github.com/invocr/invocr
Project-URL: Repository, https://github.com/invocr/invocr
Description-Content-Type: text/markdown

# InvOCR - Invoice OCR & Conversion System

> 🔍 Universal document processing system with OCR capabilities for invoices, receipts, and financial documents

[![Python 3.9+](https://img.shields.io/badge/Python-3.9%2B-blue.svg)](https://www.python.org/downloads/)
[![FastAPI](https://img.shields.io/badge/FastAPI-0.104%2B-green.svg)](https://fastapi.tiangolo.com/)
[![Docker](https://img.shields.io/badge/Docker-Ready-blue.svg)](https://www.docker.com/)
[![License: Apache](https://img.shields.io/badge/License-Apache-yellow.svg)](https://opensource.org/licenses/Apache)

## 🚀 Features

### 📄 Document Processing
- **PDF → Images** (PNG/JPG) with configurable DPI
- **Image → JSON** using advanced OCR (Tesseract + EasyOCR)
- **PDF → JSON** (direct text + OCR fallback)
- **JSON → XML** (EU Invoice standard format)
- **JSON → HTML** (responsive templates)
- **HTML → PDF** (professional output)

### 🌍 Multi-language Support
- English, Polish, German, French, Spanish, Italian
- Auto-detection of document language
- Custom language combinations

### 📋 Document Types
- ✅ **Invoices** (commercial invoices)
- ✅ **Receipts** (retail receipts)
- ✅ **Payment confirmations**
- ✅ **Financial documents**
- ✅ **Custom business documents**

### 🔧 Interfaces
- **CLI** - Command line interface
- **REST API** - Web API with OpenAPI docs
- **Docker** - Containerized deployment
- **Batch processing** - Multiple files

## 🏗️ Project Structure

```
invocr/
├── 📁 invocr/                 # Main package
│   ├── 📁 core/               # Core processing modules
│   │   ├── ocr.py            # OCR engine (Tesseract + EasyOCR)
│   │   ├── converter.py      # Universal format converter
│   │   ├── extractor.py      # Data extraction logic
│   │   └── validator.py      # Data validation
│   │
│   ├── 📁 formats/            # Format-specific handlers
│   │   ├── pdf.py           # PDF operations
│   │   ├── image.py         # Image processing
│   │   ├── json_handler.py  # JSON operations
│   │   ├── xml_handler.py   # EU XML format
│   │   └── html_handler.py  # HTML generation
│   │
│   ├── 📁 api/               # REST API
│   │   ├── main.py          # FastAPI application
│   │   ├── routes.py        # API endpoints
│   │   └── models.py        # Pydantic models
│   │
│   ├── 📁 cli/               # Command line interface
│   │   └── commands.py      # CLI commands
│   │
│   └── 📁 utils/             # Utilities
│       ├── config.py        # Configuration
│       ├── logger.py        # Logging setup
│       └── helpers.py       # Helper functions
│
├── 📁 tests/                 # Test suite
├── 📁 scripts/               # Installation scripts
├── 📁 docs/                  # Documentation
├── 🐳 Dockerfile             # Docker configuration
├── 🐳 docker-compose.yml     # Docker Compose
├── 📋 pyproject.toml         # Poetry configuration
└── 📖 README.md              # This file
```

## ⚡ Quick Start

### Option 1: Auto Installation (Recommended)

```bash
# Clone repository
git clone https://github.com/your-username/invocr.git
cd invocr

# Run installation script
chmod +x scripts/install.sh
./scripts/install.sh
```

### Option 2: Manual Installation

```bash
# Install system dependencies (Ubuntu/Debian)
sudo apt update
sudo apt install -y tesseract-ocr tesseract-ocr-pol poppler-utils \
    libpango-1.0-0 libharfbuzz0b python3-dev build-essential

# Install Poetry
curl -sSL https://install.python-poetry.org | python3 -

# Install Python dependencies
poetry install

# Setup environment
cp .env.example .env
```

### Option 3: Docker

```bash
# Using Docker Compose (easiest)
docker-compose up

# Or build manually
docker build -t invocr .
docker run -p 8000:8000 invocr
```

## 📚 Usage Examples

### CLI Commands

```bash
# Convert PDF to JSON
invocr convert invoice.pdf invoice.json

# Convert with specific languages
invocr convert -l en,pl,de document.pdf output.json

# PDF to images
invocr pdf2img document.pdf ./images/ --format png --dpi 300

# Image to JSON (OCR)
invocr img2json scan.png data.json --doc-type invoice

# JSON to EU XML format
invocr json2xml data.json invoice.xml

# Batch processing
invocr batch ./input_files/ ./output/ --format json --parallel 4

# Full pipeline: PDF → IMG → JSON → XML → HTML → PDF
invocr pipeline document.pdf ./results/

# Start API server
invocr serve --host 0.0.0.0 --port 8000
```

### REST API

```bash
# Start server
invocr serve

# Convert file
curl -X POST "http://localhost:8000/convert" \
  -F "file=@invoice.pdf" \
  -F "target_format=json" \
  -F "languages=en,pl"

# Check job status
curl "http://localhost:8000/status/{job_id}"

# Download result
curl "http://localhost:8000/download/{job_id}" -o result.json
```

### Python API

```python
from invocr import create_converter

# Create converter instance
converter = create_converter(languages=['en', 'pl', 'de'])

# Convert PDF to JSON
result = converter.pdf_to_json('invoice.pdf')
print(result)

# Convert image to JSON with OCR
data = converter.image_to_json('scan.png', document_type='invoice')

# Convert JSON to EU XML
xml_content = converter.json_to_xml(data, format='eu_invoice')

# Full conversion pipeline
result = converter.convert('input.pdf', 'output.json', 'auto', 'json')
```

## 🌐 API Documentation

When running the API server, visit:
- **Interactive docs**: http://localhost:8000/docs
- **ReDoc**: http://localhost:8000/redoc
- **OpenAPI JSON**: http://localhost:8000/openapi.json

### Key Endpoints

- `POST /convert` - Convert single file
- `POST /convert/pdf2img` - PDF to images
- `POST /convert/img2json` - Image OCR to JSON
- `POST /batch/convert` - Batch processing
- `GET /status/{job_id}` - Job status
- `GET /download/{job_id}` - Download result
- `GET /health` - Health check
- `GET /info` - System information

## 🔧 Configuration

### Environment Variables

Key configuration options in `.env`:

```bash
# OCR Settings
DEFAULT_OCR_ENGINE=auto          # tesseract, easyocr, auto
DEFAULT_LANGUAGES=en,pl,de,fr,es # Supported languages
OCR_CONFIDENCE_THRESHOLD=0.3     # Minimum confidence

# Processing
MAX_FILE_SIZE=52428800          # 50MB limit
PARALLEL_WORKERS=4              # Concurrent processing
MAX_PAGES_PER_PDF=10           # Page limit

# Storage
UPLOAD_DIR=./uploads
OUTPUT_DIR=./output
TEMP_DIR=./temp
```

### Supported Languages

| Code | Language | Tesseract | EasyOCR |
|------|----------|-----------|---------|
| `en` | English | ✅ | ✅ |
| `pl` | Polish | ✅ | ✅ |
| `de` | German | ✅ | ✅ |
| `fr` | French | ✅ | ✅ |
| `es` | Spanish | ✅ | ✅ |
| `it` | Italian | ✅ | ✅ |

## 📊 Supported Formats

### Input Formats
- **PDF** (.pdf)
- **Images** (.png, .jpg, .jpeg, .tiff, .bmp)
- **JSON** (.json)
- **XML** (.xml)
- **HTML** (.html)

### Output Formats
- **JSON** - Structured data
- **XML** - EU Invoice standard
- **HTML** - Responsive templates
- **PDF** - Professional documents

## 🧪 Testing

```bash
# Run all tests
poetry run pytest

# Run with coverage
poetry run pytest --cov=invocr

# Run specific test file
poetry run pytest tests/test_ocr.py

# Run API tests
poetry run pytest tests/test_api.py
```

## 🚀 Deployment

### Production with Docker

```yaml
# docker-compose.prod.yml
version: '3.8'
services:
  invocr:
    image: invocr:latest
    ports:
      - "80:8000"
    environment:
      - ENVIRONMENT=production
      - WORKERS=4
    volumes:
      - ./data:/app/data
```

### Kubernetes

```yaml
# k8s-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: invocr
spec:
  replicas: 3
  selector:
    matchLabels:
      app: invocr
  template:
    metadata:
      labels:
        app: invocr
    spec:
      containers:
      - name: invocr
        image: invocr:latest
        ports:
        - containerPort: 8000
```

## 🤝 Contributing

1. Fork the repository
2. Create feature branch (`git checkout -b feature/amazing-feature`)
3. Make changes
4. Add tests
5. Run tests (`poetry run pytest`)
6. Commit changes (`git commit -m 'Add amazing feature'`)
7. Push to branch (`git push origin feature/amazing-feature`)
8. Open Pull Request

### Development Setup

```bash
# Install development dependencies
poetry install --with dev

# Install pre-commit hooks
poetry run pre-commit install

# Run linting
poetry run black invocr/
poetry run isort invocr/
poetry run flake8 invocr/

# Run type checking
poetry run mypy invocr/
```

## 📈 Performance

### Benchmarks

| Operation | Time | Memory |
|-----------|------|--------|
| PDF → JSON (1 page) | ~2-3s | ~50MB |
| Image OCR → JSON | ~1-2s | ~30MB |
| JSON → XML | ~0.1s | ~10MB |
| JSON → HTML | ~0.2s | ~15MB |
| HTML → PDF | ~1-2s | ~40MB |

### Optimization Tips

- Use `--parallel` for batch processing
- Enable `IMAGE_ENHANCEMENT=false` for faster OCR
- Use `tesseract` engine for better performance
- Configure `MAX_PAGES_PER_PDF` for large documents

## 🔒 Security

- File upload validation
- Size limits enforced
- Input sanitization
- No execution of uploaded content
- Rate limiting available
- CORS configuration

## 📋 Requirements

### System Requirements
- **Python**: 3.9+
- **Memory**: 1GB+ RAM
- **Storage**: 500MB+ free space
- **OS**: Linux, macOS, Windows (Docker)

### Dependencies
- **Tesseract OCR**: Text recognition
- **EasyOCR**: Neural OCR engine
- **WeasyPrint**: HTML to PDF conversion
- **FastAPI**: Web framework
- **Pydantic**: Data validation

## 🐛 Troubleshooting

### Common Issues

**OCR not working:**
```bash
# Check Tesseract installation
tesseract --version

# Install missing languages
sudo apt install tesseract-ocr-pol
```

**WeasyPrint errors:**
```bash
# Install system dependencies
sudo apt install libpango-1.0-0 libharfbuzz0b
```

**Import errors:**
```bash
# Reinstall dependencies
poetry install --force
```

**Permission errors:**
```bash
# Fix file permissions
chmod -R 755 uploads/ output/
```

## 📞 Support

- 📧 **Email**: support@invocr.com
- 🐛 **Issues**: [GitHub Issues](https://github.com/your-username/invocr/issues)
- 💬 **Discussions**: [GitHub Discussions](https://github.com/your-username/invocr/discussions)
- 📚 **Wiki**: [Project Wiki](https://github.com/your-username/invocr/wiki)

## 📄 License

This project is licensed under the Apache License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

- [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) - OCR engine
- [EasyOCR](https://github.com/JaidedAI/EasyOCR) - Neural OCR
- [FastAPI](https://fastapi.tiangolo.com/) - Web framework
- [WeasyPrint](https://weasyprint.org/) - HTML/CSS to PDF
- [Poetry](https://python-poetry.org/) - Dependency management

---

**Made with ❤️ for the open source community**

⭐ **Star this repository if you find it useful!**
