Metadata-Version: 2.4
Name: bigdata-smart-batching
Version: 0.2.0
Summary: High-performance semantic search with intelligent company grouping and parallel execution
Project-URL: homepage, https://bigdata.com/api
Author-email: "Bigdata.com" <support@ravenpack.com>
License: MIT
License-File: LICENSE
Requires-Python: <4.0,>=3.11
Requires-Dist: matplotlib>=3.7
Requires-Dist: numpy>=1.24
Requires-Dist: pandas>=2.0
Requires-Dist: python-dotenv>=1.2.2
Requires-Dist: requests>=2.31.0
Description-Content-Type: text/markdown

# Smart Batching Search

A high-performance semantic search system that reduces API queries by **67-99%** (varies by topic specificity) through intelligent company grouping and parallel execution.

This module provides a two-step system for efficient semantic search:
1. **Planning**: Organize search using smart batching and return Chunk upper bound estimate
2. **Execution**: Perform search with proportional sampling to preserve distribution

## Key Benefits

- **67-99% Query Reduction**: Search 4,732 companies with only 17-3,699 queries (varies by topic)
- **Parallel Execution**: Rate-limited concurrent requests with semaphore control
- **Proportional Sampling**: Retrieve percentage of results while preserving distribution
- **Production Ready**: Comprehensive error handling, retries, and logging
- **Scalable**: Efficiently handles universes with 10,000+ companies

## Installation

Install the package from PyPI (Python 3.11+):

```bash
pip install bigdata-smart-batching
```

With [uv](https://docs.astral.sh/uv/):

```bash
uv add bigdata-smart-batching
```

### Development

To work on this repository locally, from the project root:

```bash
uv sync
```

### Environment Setup

Set up environment variables:

```bash
export BIGDATA_API_KEY="your_api_key_here"
export BIGDATA_API_BASE_URL="https://api.bigdata.com"  # Optional, defaults to this
```

Or create a `.env` file:

```
BIGDATA_API_KEY=your_api_key_here
BIGDATA_API_BASE_URL=https://api.bigdata.com
```

## Universe CSV file

`plan_search()` loads company **entity IDs** from a UTF-8 CSV. IDs must match the identifiers used by the Bigdata API for your dataset. Two layouts are supported:

**1. Header row with an `id` column** (optional extra columns such as `name` are ignored):

```csv
id,name
B8EF97,Example Corp A
BB07E4,Example Corp B
3461CF,Example Corp C
```

## Quick Start

```python
from bigdata_smart_batching import (
    plan_search,
    execute_search,
    deduplicate_documents,
    convert_to_dataframe,
)

# Step 1: Plan the search
plan = plan_search(
    text="earnings revenue profit",
    universe_csv_path="id_name_mapping_us_top_3000.csv",
    start_date="2023-01-01",
    end_date="2023-12-31",
    api_key="your_api_key",  # or set BIGDATA_API_KEY env var
)

print(f"Chunk upper bound estimate: {plan['chunk_upper_bound_estimate']:,}")

# Step 2: Execute search with 10% of total chunks (preserves distribution)
results_raw = execute_search(
    search_plan=plan,
    chunk_percentage=0.1,
    requests_per_minute=100,
)

# Step 3: Deduplicate and convert to DataFrame
results = deduplicate_documents(results_raw)
print(f"Retrieved {len(results)} documents (deduplicated)")

df = convert_to_dataframe(results)  # one row per chunk
```

### Save and Load Plans

```python
from bigdata_smart_batching import plan_search, execute_search, save_plan, load_plan

# Create and save a plan
plan = plan_search(
    text="merger acquisition",
    universe_csv_path="id_name_mapping_us_top_3000.csv",
    start_date="2023-01-01",
    end_date="2023-12-31",
)
save_plan(plan, "my_search_plan.json")

# Later: reload and run with different sampling
plan = load_plan("my_search_plan.json")
raw_10 = execute_search(plan, chunk_percentage=0.1)
raw_50 = execute_search(plan, chunk_percentage=0.5)
```

## How It Works

### Architecture Overview

```
Step 1: PLANNING
  Universe CSV  -->  Co-mention API Query  -->  Basket Creation  -->  Search Plan

Step 2: EXECUTION
  Proportional Sampling  -->  Parallel Search (Rate Limited)  -->  Collect & Aggregate
```

### Planning (`plan_search()`)

1. Loads the universe of companies from CSV
2. Queries the comention endpoint to get chunk volumes per company
3. Splits date ranges by volume when a company exceeds the chunk limit
4. Creates optimized baskets grouped by volume
5. Returns a plan with Chunk upper bound estimate and basket configurations

### Execution (`execute_search()`)

1. Calculates proportional chunks per basket
2. Ensures minimum of 1 chunk per basket (if expected > 0)
3. Executes searches in parallel with rate limiting and semaphore
4. Collects and returns document results

## API Reference

### `plan_search()`

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `text` | `str` | required | Search query text |
| `universe_csv_path` | `str` | required | Path to CSV with entity IDs |
| `start_date` | `str` | required | Start date (YYYY-MM-DD) |
| `end_date` | `str` | required | End date (YYYY-MM-DD) |
| `api_key` | `str` | env var | API key |
| `api_base_url` | `str` | env var | API base URL |
| `volume_query_mode` | `str` | `"three_pass"` | `"three_pass"` or `"iterative"` |
| `apply_volume_splits` | `bool` | `True` | Use volume time series for period splitting |
| `min_period_days` | `int` | `30` | Minimum days per sub-period |

### `execute_search()`

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `search_plan` | `Dict` | required | Plan from `plan_search()` |
| `chunk_percentage` | `float` | required | 0.0 to 1.0 sampling ratio |
| `requests_per_minute` | `int` | `100` | Rate limit |
| `api_key` | `str` | env var | API key |
| `max_workers` | `int` | `40` | Parallel workers |

### Helper Functions

- **`deduplicate_documents(documents)`** -- Merges duplicate documents by `id`
- **`load_universe_from_csv(csv_path)`** -- Loads entity IDs from CSV
- **`convert_to_dataframe(raw_results)`** -- Converts documents to DataFrame (one row per chunk)
- **`save_plan(plan, path)`** / **`load_plan(path)`** -- Persist plans as JSON
- **`portfolio_backtesting_pipeline(...)`** -- Long-short portfolio backtesting

## Testing

```bash
# Run all tests
uv run pytest

# With coverage
uv run pytest --cov=bigdata_smart_batching --cov-report=term-missing

# Specific test file
uv run pytest tests/test_validation.py -v
```

## Project Structure

```
bigdata-smart-batching/
├── pyproject.toml
├── README.md
├── .python-version
├── src/
│   └── bigdata_smart_batching/
│       ├── __init__.py
│       ├── smart_batching.py
│       ├── smart_batching_config.py
│       ├── search_function.py
│       ├── output_converter.py
│       └── portfolio_backtesting.py
└── tests/
    ├── __init__.py
    ├── test_config.py
    ├── test_output_converter.py
    ├── test_validation.py
    └── test_rate_limiter.py
```

## Configuration

### Environment Variables

- `BIGDATA_API_KEY`: Required -- Your Bigdata API key
- `BIGDATA_API_BASE_URL`: Optional -- API base URL (default: `https://api.bigdata.com`)

### Default Settings

- `requests_per_minute`: 100
- `max_workers`: 40
- `max_chunks_per_basket`: 1000
- `volume_query_mode`: `"three_pass"`

## License

This project is part of the Bigdata.com and WorldQuant Challenge.

**Disclaimer**: This software is provided "as is" without warranty of any kind, express or implied. The authors and contributors assume no responsibility for the accuracy, completeness, or usefulness of any information, results, or processes provided. This software is for educational and research purposes only and is not intended to be used as financial advice.
