Metadata-Version: 2.3
Name: vhtml
Version: 0.2.4
Summary: vhtml (ang. Visual HyperText Markup Language) - Optical character recognition and HTML layout analysis library
Author: Tom Sapletta
Requires-Python: >=3.11,<4.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Provides-Extra: dev
Requires-Dist: Jinja2 (>=3.1.2,<4.0.0)
Requires-Dist: Pillow (>=10.0.1,<11.0.0)
Requires-Dist: PyMuPDF (>=1.23.5,<2.0.0)
Requires-Dist: beautifulsoup4 (>=4.13.4,<5.0.0)
Requires-Dist: easyocr (>=1.7.0,<2.0.0)
Requires-Dist: langdetect (>=1.0.9,<2.0.0)
Requires-Dist: matplotlib (>=3.7.2,<4.0.0)
Requires-Dist: numpy (>=1.24.3,<2.0.0)
Requires-Dist: opencv-contrib-python (>=4.8.1,<5.0.0)
Requires-Dist: opencv-python (>=4.8.1,<5.0.0)
Requires-Dist: pandas (>=2.0.3,<3.0.0)
Requires-Dist: pdf2image (>=1.16.3,<2.0.0)
Requires-Dist: polyglot (>=16.7.4,<17.0.0)
Requires-Dist: pytesseract (>=0.3.10,<0.4.0)
Requires-Dist: requests (>=2.31.0,<3.0.0)
Requires-Dist: scikit-image (>=0.21.0,<0.22.0)
Requires-Dist: seaborn (>=0.12.2,<0.13.0)
Requires-Dist: spacy (>=3.6.1,<4.0.0)
Requires-Dist: textblob (>=0.17.1,<0.18.0)
Requires-Dist: tqdm (>=4.66.1,<5.0.0)
Description-Content-Type: text/markdown

# vHTML - Visual HTML Generator

A modular system for converting PDF documents to HTML with OCR and layout analysis.

## Features

- PDF to image conversion with preprocessing (denoise, deskew)
- Document layout analysis and segmentation
- OCR with multi-language support (Polish, English, German)
- Language detection and confidence scoring
- HTML generation with embedded images and metadata
- Batch processing capabilities
- Command-line interface

## Installation

### Prerequisites

- Python 3.8+
- Tesseract OCR
- Poppler utilities

### Using Poetry (Recommended)

```bash
# Clone the repository
git clone https://github.com/fin-officer/vhtml.git
cd vhtml

# Install with Poetry
make install
```

### Manual Installation

```bash
# Install system dependencies
sudo apt-get update
sudo apt-get install -y tesseract-ocr tesseract-ocr-pol tesseract-ocr-eng tesseract-ocr-deu poppler-utils

# Install Python dependencies
pip install poetry
poetry install
```

## Validate Installation

To verify that all dependencies are correctly installed:

```bash
make validate
```

or

```bash
python scripts/validate_installation.py
```

## Usage

### Command Line Interface

```bash
# Process a single PDF file
poetry run python -m vhtml.main /path/to/document.pdf -o output_directory

# Process a directory of PDF files
poetry run python -m vhtml.main /path/to/pdf_directory -b -o output_directory

# Process and open in browser
poetry run python -m vhtml.main /path/to/document.pdf -v
```

### Integration Test

```bash
# Run the integration test with your PDF file
poetry run python scripts/test_integration.py /path/to/document.pdf -v
```

### Python API

```python
from vhtml.main import DocumentAnalyzer

# Initialize the analyzer
analyzer = DocumentAnalyzer()

# Process a document
html_path = analyzer.analyze_document("document.pdf", "output_dir")

# Print the path to the generated HTML
print(f"Generated HTML: {html_path}")
```

## Examples

### Generate Standalone HTML

Generate a standalone HTML file with all images, JS, and JSON embedded:

```bash
poetry run python examples/pdf2html.py
```

- Input: Folder with HTML, images, JS, and JSON (e.g., output/mhtml_example/Invoice-30392B3C-0001)
- Output: Standalone HTML (e.g., output/html_example/Invoice-30392B3C-0001_standalone.html)

### Generate MHTML (Web Archive)

Generate a fully self-contained MHTML file for browser archiving:

```bash
poetry run python examples/pdf2mhtml.py
```

- Input: PDF(s) in invoices/ (or other test files)
- Output: MHTML file (e.g., output/mhtml_example/Invoice-30392B3C-0001.mhtml)

---

- See `examples/html.py` and `examples/mhtml.py` for usage patterns and batch processing.
- Both scripts demonstrate how to use the vHTML API for document conversion and archiving.

## Core Components

- **PDFProcessor**: Handles PDF to image conversion and preprocessing
- **LayoutAnalyzer**: Analyzes document layout and segments content blocks
- **OCREngine**: Performs OCR with language detection and confidence scoring
- **HTMLGenerator**: Generates HTML with embedded images and styling
- **DocumentAnalyzer**: Integrates all components into a complete workflow

## Project Structure

```
vhtml/
├── vhtml/
│   ├── core/
│   │   ├── pdf_processor.py
│   │   ├── layout_analyzer.py
│   │   ├── ocr_engine.py
│   │   └── html_generator.py
│   └── main.py
├── scripts/
│   ├── validate_installation.py
│   └── test_integration.py
├── docs/
│   ├── ARCHITECTURE.md
│   ├── IMPLEMENTATION.md
│   └── PROJECT_STRUCTURE.md
├── Makefile
├── pyproject.toml
└── README.md
```

## Development

```bash
# Setup development environment
make setup

# Run tests
make test

# Format code
make format

# Lint code
make lint

# Build package
make build
```

## Documentation

For more detailed information, see the documentation files:

- [Architecture](docs/ARCHITECTURE.md)
- [Implementation](docs/IMPLEMENTATION.md)
- [Project Structure](docs/PROJECT_STRUCTURE.md)

## License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

