# Batchata

> Python SDK for AI batch processing with structured output and citation mapping. 50% cost savings via Anthropic's batch API with automatic cost tracking, structured output using Pydantic models, and field-level citations.

Batchata is a Python library that provides a simple interface for batch processing with AI models (currently supports Anthropic Claude, OpenAI support coming soon). **The preferred way to use Batchata is through `BatchManager`** as it abstracts most of the work of the lower-level `batch()` function and provides advanced features like parallel processing, state persistence, and cost management.

## Recommended Usage Pattern

**Use `BatchManager` for production workloads** - it handles job splitting, parallel execution, state persistence, cost limits, and retry logic automatically. Only use the lower-level `batch()` function for simple one-off tasks or when you need direct control over the batch processing.

## Supported Models

**Claude 4 Models (Latest & Best Performance):**
- `claude-opus-4-20250514` ⭐ **Best overall performance**
- `claude-sonnet-4-20250514` ⭐ **Best performance for most tasks**

**Claude 3.7 Models:**
- `claude-3-7-sonnet-20250219` (also available as `claude-3-7-sonnet-latest`)

**Claude 3.5 Models:**
- `claude-3-5-sonnet-20241022` (also available as `claude-3-5-sonnet-latest`)
- `claude-3-5-sonnet-20240620`
- `claude-3-5-haiku-20241022` (also available as `claude-3-5-haiku-latest`) - Fast, cost-effective

**Claude 3 Models:**
- `claude-3-haiku-20240307` - Most cost-effective option

**Legacy Models (Deprecated):**
- `claude-3-opus-20240229`
- `claude-3-sonnet-20240229`
- `claude-3-5-haiku-20240307`

**⭐ For best performance, use Claude Sonnet 4 or Opus 4 models for complex tasks, PDF processing, and structured output. These models offer the highest accuracy and capability.**

**Two-Step Process:**
1. **Create and run the batch job** - Use `.run()` method or wait for completion with `.is_complete()`
2. **Get results** - Use `.results()` method to get unified format: `List[{"result": ..., "citations": ...}]`

**Unified Result Format:**
- `BatchJob.results()` returns `List[{"result": ..., "citations": ...}]`
- `BatchManager.results()` returns `List[{"result": ..., "citations": ...}]`
- `load_results_from_disk()` returns `List[{"result": ..., "citations": ...}]`
- `result` field: Pydantic model instance, dict, or string
- `citations` field: Dict (field citations), List (text citations), or None

All models support batch processing with 50% cost savings. PDF/file processing requires file-capable models (all models except claude-3-haiku-20240307 support file input).

## Core API Documentation

- [Main README](README.md): Complete documentation with installation, usage examples, and API reference
- [Core Implementation](src/core.py): Lower-level batch() function implementation and PDF processing utilities  
- [Batch Manager](src/batch_manager.py): **Recommended approach** - Large-scale batch processing with parallel execution, state persistence, and cost management
- [Batch Job](src/batch_job.py): Individual batch job handling and status management (used by both batch() and BatchManager)
- [Utilities](src/utils.py): Helper functions including `load_results_from_disk()` for loading saved results
- [Citations](src/citations.py): Citation data structures and field-level citation mapping
- [Types](src/types.py): Type definitions and data structures used throughout the library

## Examples and Usage Patterns

- [Batch Manager Example](examples/batch_manager_example.py): **Recommended** - Large-scale processing with parallel execution and state management
- [Spam Detection Example](examples/spam_detection.py): Email classification using structured output with confidence scores
- [PDF Extraction Example](examples/pdf_extraction.py): Extract structured data from PDF invoices with citations
- [Citation Example](examples/citation_example.py): Basic citation usage for text analysis
- [Citation with Pydantic](examples/citation_with_pydantic.py): Field-level citations with structured output models
- [Raw Text Example](examples/raw_text_example.py): Simple text processing without structured output

## Provider Architecture

- [Base Provider](src/providers/base.py): Abstract base class for AI providers with batch processing interface
- [Anthropic Provider](src/providers/anthropic.py): Anthropic Claude implementation with batch API support and model definitions
- [Provider Registry](src/providers/__init__.py): Provider selection and initialization utilities

## Key Features and Implementation Details

### BatchManager (Recommended Approach)
- **Automatic Job Splitting**: Breaks large batches into configurable chunks (items_per_job)
- **Parallel Processing**: Concurrent job execution with ThreadPoolExecutor (max_parallel_jobs)
- **State Persistence**: JSON-based state files for resume capability after interruptions
- **Cost Management**: Stop processing when budget limits are reached (max_cost parameter)
- **Progress Monitoring**: Real-time progress updates with statistics and cost tracking
- **Retry Mechanism**: Built-in retry for failed items with exponential backoff
- **Result Management**: Organized directory structure for saving and loading results
- **Results Access**: Direct access via `.results()` method returning unified format `{"result": ..., "citations": ...}`

### Batch Processing Features
- **Cost Optimization**: 50% cost savings through Anthropic's batch API pricing
- **Structured Output**: Full Pydantic model support with automatic validation
- **Citation Mapping**: Field-level citations that map results to source documents
- **Cost Tracking**: Automatic token usage and cost calculation using tokencost library
- **Type Safety**: Full TypeScript-style type annotations and validation

### Citation System
- **Text + Citations Mode**: Flat list of citations for unstructured text responses
- **Structured + Field Citations**: Citations mapped to specific Pydantic model fields
- **Robust JSON Parsing**: Handles complex JSON structures with escaped quotes, nested objects, and special characters
- **Page-Level Citations**: Precise document location tracking with page numbers and text spans

### Response Formats
- **Unified Format**: Consistent `{"result": ..., "citations": ...}` structure across all modes
- **BatchManager Summary**: Processing summary with `total_items`, `completed_items`, `failed_items`, `total_cost`, `jobs_completed`, `cost_limit_reached`
- **Results Loading**: Automatic saving to `results_dir/processed/` and `results_dir/raw/` directories
- **Directory Information**: `.stats()` method shows `results_dir`, `processed_results_dir`, and `raw_results_dir` paths

## Installation and Setup

**Installation**: `pip install batchata`

**Environment Setup**: Requires `ANTHROPIC_API_KEY` environment variable

**Python Version**: Requires Python 3.12+

**Dependencies**: 
- `anthropic>=0.57.1` for Claude API access
- `python-dotenv>=1.1.1` for environment management
- `tokencost>=0.1.24` for cost tracking

## Testing and Development

- [Test Suite](tests/): Comprehensive test coverage including unit, integration, and e2e tests
- [Test Fixtures](tests/fixtures.py): Reusable test utilities and mock data
- [PDF Test Utils](tests/utils/pdf_utils.py): PDF generation utilities for testing
- [E2E Tests](tests/e2e/): End-to-end integration tests with real API calls

## Configuration and Customization

### BatchManager Parameters (Recommended)
- `items_per_job`: Number of items to process per batch job (default: 50)
- `max_parallel_jobs`: Maximum concurrent jobs (default: 10)
- `max_cost`: Budget limit to stop processing (default: None)
- `max_wait_time`: Maximum wait time for job completion (default: 3600 seconds)
- `state_path`: Path to JSON state file for persistence
- `results_dir`: Directory to save results (processed + raw subdirectories created automatically)

### Batch Function Parameters (Lower-level)
- `messages`: List of message conversations for chat-based processing
- `files`: List of PDF file paths or bytes for document processing
- `prompt`: Processing instruction (required for file processing)
- `model`: AI model identifier (recommend: "claude-sonnet-4-20250514")
- `response_model`: Optional Pydantic model for structured output
- `enable_citations`: Boolean to enable citation extraction
- `raw_results_dir`: Directory to save raw API responses

## Error Handling and Limitations

### Comprehensive Error Handling (v0.2.7+)
- **File Size Validation**: Provider-specific limits (32MB for Anthropic) with early validation
- **Empty File Detection**: Clear error messages for empty files or bytes content
- **Image Citation Protection**: UnsupportedContentError when requesting citations on images (PNG, JPEG, GIF, WebP)
- **File Type Detection**: Automatic detection and validation of PDF, PNG, JPEG, GIF, WebP files
- **Early Validation**: Files validated before expensive API operations to save time and costs
- **Clear Error Messages**: Specific, actionable feedback with file names and limits

### Exception Hierarchy
- `BatchataError` (base exception)
- `FileTooLargeError` (file exceeds provider limits)
- `UnsupportedContentError` (e.g., citations on images)
- `BatchManagerError` (state and processing errors)
- `ValueError` (empty files, invalid parameters)

### Known Limitations
- **Citation Limitations**: Only works with flat Pydantic models (no nested models)
- **Model Requirements**: PDFs require file-capable models (use Sonnet 4/Opus 4 for best results)
- **Batch Timing**: Jobs can take up to 24 hours to process
- **Cost Limits**: Best effort enforcement - final costs may slightly exceed max_cost
- **Provider Support**: Currently Anthropic only, OpenAI support planned

## CLI Commands

- `batchata-example`: Run spam detection example
- `batchata-pdf-example`: Run PDF extraction example

## Project Structure

```
batchata/
├── src/                    # Source code
│   ├── core.py            # Lower-level batch() function
│   ├── batch_manager.py   # Recommended BatchManager class
│   ├── batch_job.py       # Individual job handling
│   ├── citations.py       # Citation data structures
│   └── providers/         # AI provider implementations
├── examples/              # Usage examples
├── tests/                 # Test suite
└── specs/                 # Feature specifications
```

## Recent Updates

### Version 0.2.7 (Enhanced Error Handling)
- **Centralized Exception System**: New hierarchical exception classes in `batchata/exceptions.py`
- **Provider-Specific Validation**: File size limits (32MB for Anthropic) with early validation
- **Content Type Detection**: Automatic detection of PDF, PNG, JPEG, GIF, WebP files
- **Image Citation Protection**: Clear errors when requesting citations on images
- **Empty File Handling**: Native ValueError with descriptive messages
- **Comprehensive Test Coverage**: 19 new tests for error scenarios

### Version 0.2.6 (File Validation Improvements)
- **Early File Validation**: Upfront validation in both `batch()` and `BatchManager`
- **Clear Error Messages**: Specific file names in error messages
- **Bytes Content Support**: Proper handling of bytes input without validation

### Version 0.2.5 (BatchManager Memory-First API)
- **Memory-First Results**: `BatchManager.results()` checks memory before disk
- **Unified Result Format**: Consistent `{"result": ..., "citations": ...}` across all methods
- **Example Standardization**: All examples use unified patterns and waiting loops

## Development Status

- **Version**: 0.2.7 (Beta)
- **License**: MIT
- **Repository**: https://github.com/agamm/batchata
- **PyPI**: https://pypi.org/project/batchata/
- **Status**: Active development with regular updates