Metadata-Version: 2.4
Name: docinfer
Version: 0.1.1
Summary: Extract and infer metadata from PDF documents using AI-powered analysis
Author-email: Tino Kanngiesser <tinokanngiesser@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/tidyeval/docinfer
Project-URL: Repository, https://github.com/tidyeval/docinfer
Project-URL: BugTracker, https://github.com/tidyeval/docinfer/issues
Keywords: pdf,metadata,extraction,ai,llm
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pydantic>=2.0.0
Requires-Dist: langchain-ollama>=0.2.0
Requires-Dist: pypdf>=4.0.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: requests>=2.31.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: typer>=0.12.0
Requires-Dist: rich>=13.0.0
Provides-Extra: dev
Requires-Dist: pytest<9.0.0,>=8.3.3; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: pytest-mock>=3.14.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Dynamic: license-file

# docinfer

A Python package for extracting and inferring metadata from PDF documents using AI-powered analysis.

## Features

- Extract metadata from PDF files
- AI-powered document analysis using LLMs
- CLI tool for easy batch processing
- Flexible configuration and output formatting
- Structured metadata models using Pydantic

## Requirements

- Python 3.12 or higher
- **Ollama** - Required for AI-powered analysis
  - [Install Ollama](https://ollama.ai)
  - Pull a model: `ollama pull gemma3:4b`
- See `pyproject.toml` for full Python dependency list

## Installation

### From GitHub Repository

```bash
pip install git+https://github.com/tidyeval/docinfer.git
```

### From Local Development

Clone the repository and install in editable mode:

```bash
git clone https://github.com/tidyeval/docinfer.git
cd docinfer
pip install -e .
```

## Quick Start

### Using uvx (Recommended)

Run directly without installation using [uvx](https://docs.astral.sh/uv/guides/tools/):

```bash
uvx --from git+https://github.com/tidyeval/docinfer.git docinfer <path-to-pdf>
```

> **Note:** Once published to PyPI, you'll be able to run `uvx docinfer <path-to-pdf>` directly.

### CLI Usage

If you've installed the package locally, run directly:

```bash
docinfer <path-to-pdf>
```

#### Options

- `--model MODEL` - Specify the Ollama model (default: `gemma3:4b`)
  - Example: `docinfer document.pdf --model gemma2`
- `--json` - Output as JSON instead of formatted text
- `--no-ai` - Skip AI analysis and show embedded metadata only
- `--export FILE` - Export results to JSON file
- `--quiet` - Suppress progress output

### Python API

```python
from docinfer.services.pdf_extractor import PDFExtractor
from docinfer.services.ai_analyzer import AIAnalyzer

# Extract PDF content
extractor = PDFExtractor()
content = extractor.extract("document.pdf")

# Analyze with AI
analyzer = AIAnalyzer()
metadata = analyzer.analyze(content)
```

## Project Structure

```
docinfer/
├── src/
│   ├── cli.py              # Command-line interface
│   ├── models/             # Pydantic data models
│   ├── services/           # Core services (PDF extraction, AI analysis)
│   └── prompts/            # AI prompt templates
├── tests/                  # Unit and integration tests
├── specs/                  # Project specifications
├── pyproject.toml          # Project configuration
└── README.md               # This file
```

## Development

### Setting up Development Environment

1. Clone the repository:
   ```bash
   git clone https://github.com/tidyeval/docinfer.git
   cd docinfer
   ```

2. Create and activate virtual environment:
   ```bash
   python -m venv .venv
   source .venv/bin/activate  # On Windows: .venv\Scripts\activate
   ```

3. Install in development mode:
   ```bash
   pip install -e ".[dev]"
   ```

### Running Tests

```bash
pytest
```

### Code Quality

The project uses:
- **black** for code formatting
- **ruff** for linting
- **pytest** for testing

## Contributing

Contributions are welcome! Please ensure:
- Code passes linting and formatting checks
- Tests pass with good coverage
- Commit messages are descriptive

## License

See LICENSE file for details.

## Author

Tino Kanngiesser (tinokanngiesser@gmail.com)
