Metadata-Version: 2.3
Name: bplusplus
Version: 2.1.2
Summary: A simple method to create AI models for biodiversity, with collect and prepare pipeline
License: MIT
Author: Titus Venverloo
Author-email: tvenver@mit.edu
Requires-Python: >=3.10,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Dist: bugspot (>=0.2.0)
Requires-Dist: numpy (>=1.26.0,<1.26.5) ; sys_platform == "win32"
Requires-Dist: numpy (>=1.26.0,<1.27.0) ; sys_platform == "darwin" and platform_machine == "arm64"
Requires-Dist: numpy (>=1.26.0,<1.27.0) ; sys_platform == "darwin" and platform_machine == "x86_64"
Requires-Dist: numpy (>=1.26.0,<1.27.0) ; sys_platform == "linux" and platform_machine == "aarch64"
Requires-Dist: numpy (>=1.26.0,<1.27.0) ; sys_platform == "linux" and platform_machine == "x86_64"
Requires-Dist: pandas (==2.1.4)
Requires-Dist: pillow (>=10.0.0,<12.0.0) ; sys_platform == "darwin"
Requires-Dist: pillow (>=10.0.0,<12.0.0) ; sys_platform == "linux"
Requires-Dist: pillow (>=10.0.0,<12.0.0) ; sys_platform == "win32"
Requires-Dist: prettytable (==3.7.0)
Requires-Dist: pygbif (==0.6.5)
Requires-Dist: pyyaml (==6.0.1)
Requires-Dist: requests (==2.25.1)
Requires-Dist: scikit-learn (>=1.3.0,<1.7.0) ; sys_platform == "linux" and platform_machine == "aarch64"
Requires-Dist: scikit-learn (>=1.3.0,<1.7.0) ; sys_platform == "win32"
Requires-Dist: scikit-learn (>=1.4.0,<1.8.0) ; sys_platform == "darwin" and platform_machine == "arm64"
Requires-Dist: scikit-learn (>=1.4.0,<1.8.0) ; sys_platform == "darwin" and platform_machine == "x86_64"
Requires-Dist: scikit-learn (>=1.4.0,<1.8.0) ; sys_platform == "linux" and platform_machine == "x86_64"
Requires-Dist: tabulate (==0.9.0)
Requires-Dist: torch (>=2.0.0,<2.8.0) ; sys_platform == "darwin" and platform_machine == "arm64"
Requires-Dist: torch (>=2.0.0,<2.8.0) ; sys_platform == "linux"
Requires-Dist: torch (>=2.0.0,<2.8.0) ; sys_platform == "win32"
Requires-Dist: torch (>=2.2.0,<2.3.0) ; sys_platform == "darwin" and platform_machine == "x86_64"
Requires-Dist: tqdm (==4.66.4)
Requires-Dist: ultralytics (==8.3.173)
Requires-Dist: validators (==0.33.0)
Description-Content-Type: text/markdown

# B++ repository

[![DOI](https://zenodo.org/badge/765250194.svg)](https://zenodo.org/badge/latestdoi/765250194) 
[![PyPi version](https://img.shields.io/pypi/v/bplusplus.svg)](https://pypi.org/project/bplusplus/)
[![Python versions](https://img.shields.io/pypi/pyversions/bplusplus.svg)](https://pypi.org/project/bplusplus/)
[![License](https://img.shields.io/pypi/l/bplusplus.svg)](https://pypi.org/project/bplusplus/)
[![Downloads](https://static.pepy.tech/badge/bplusplus)](https://pepy.tech/project/bplusplus)
[![Downloads](https://static.pepy.tech/badge/bplusplus/month)](https://pepy.tech/project/bplusplus)
[![Downloads](https://static.pepy.tech/badge/bplusplus/week)](https://pepy.tech/project/bplusplus)

This project provides a complete, end-to-end pipeline for building a custom insect classification system. The framework is designed to be **domain-agnostic**, allowing you to train a powerful detection and classification model for **any insect species** by simply providing a list of names.

Using the `Bplusplus` library, this pipeline automates the entire machine learning workflow, from data collection to video inference.

## Key Features

- **Automated Data Collection**: Downloads hundreds of images for any species from the GBIF database.
- **Intelligent Data Preparation**: Uses a pre-trained model to automatically find, crop, and resize insects from raw images, ensuring high-quality training data.
- **Hierarchical Classification**: Trains a model to identify insects at three taxonomic levels: **family, genus, and species**.
- **Video Inference & Tracking**: Processes video files to detect, classify, and track individual insects over time, providing aggregated predictions.

## Pipeline Overview

The process is broken down into five main steps, all detailed in the `full_pipeline.ipynb` notebook:

1.  **Collect Data**: Select your target species and fetch raw insect images from the web.
2.  **Prepare Data**: Filter, clean, and prepare images for training.
3.  **Train Model**: Train the hierarchical classification model.
4.  **Validate Model**: Evaluate the performance of the trained model.
5.  **Run Inference**: Run the full pipeline on a video file for real-world application.

## How to Use

### Prerequisites

- Python 3.10+

### Setup

1.  **Create and activate a virtual environment:**
    ```bash
    python3 -m venv venv
    source venv/bin/activate
    ```

2.  **Install the required packages:**
    ```bash
    pip install bplusplus
    ```

### Running the Pipeline

The pipeline can be run step-by-step using the functions from the `bplusplus` library. While the `full_pipeline.ipynb` notebook provides a complete, executable workflow, the core functions are described below.

#### Step 1: Collect Data
Download images for your target species from the GBIF database. You'll need to provide a list of scientific names.

```python
import bplusplus
from pathlib import Path

# Define species and directories
names = ["Vespa crabro", "Vespula vulgaris", "Dolichovespula media"]
GBIF_DATA_DIR = Path("./GBIF_data")

# Define search parameters
search = {"scientificName": names}

# Run collection
bplusplus.collect(
    group_by_key=bplusplus.Group.scientificName,
    search_parameters=search,
    images_per_group=200,  # Recommended to download more than needed
    output_directory=GBIF_DATA_DIR,
    num_threads=5
)
```

#### Step 2: Prepare Data
Process the raw images to extract, crop, and resize insects. This step uses a pre-trained model to ensure only high-quality images are used for training.

```python
PREPARED_DATA_DIR = Path("./prepared_data")

bplusplus.prepare(
    input_directory=GBIF_DATA_DIR,
    output_directory=PREPARED_DATA_DIR,
    img_size=640,        # Target image size for training
    conf=0.6,            # Detection confidence threshold (0-1)
    valid=0.1,           # Validation split ratio (0-1), set to 0 for no validation
    blur=None,           # Gaussian blur as fraction of image size (0-1), None = disabled
)
```

**Note:** The `blur` parameter applies Gaussian blur before resizing, which can help reduce noise. Values are relative to image size (e.g., `blur=0.01` means 1% of the smallest dimension). Supported image formats: JPG, JPEG, and PNG.

#### Step 3: Train Model
Train the hierarchical classification model on your prepared data. The model learns to identify family, genus, and species.

```python
TRAINED_MODEL_DIR = Path("./trained_model")

bplusplus.train(
    batch_size=4,
    epochs=30,
    patience=3,
    img_size=640,
    data_dir=PREPARED_DATA_DIR,
    output_dir=TRAINED_MODEL_DIR,
    species_list=names,
    backbone="resnet50",  # Choose: "resnet18", "resnet50", or "resnet101"
    # num_workers=0,      # Optional: force single-process loading (most stable)
    # train_transforms=custom_transforms,  # Optional: custom torchvision transforms
)
```

**Note:** The `num_workers` parameter controls DataLoader multiprocessing (defaults to 0 for stability). The `backbone` parameter allows you to choose between different ResNet architectures—use `resnet18` for faster training or `resnet101` for potentially better accuracy.

#### Step 4: Validate Model
Evaluate the trained model on a held-out validation set. This calculates precision, recall, and F1-score at all taxonomic levels.

```python
HIERARCHICAL_MODEL_PATH = TRAINED_MODEL_DIR / "best_multitask.pt"

results = bplusplus.validate(
    species_list=names,
    validation_dir=PREPARED_DATA_DIR / "valid",
    hierarchical_weights=HIERARCHICAL_MODEL_PATH,
    img_size=640,           # Must match training
    batch_size=32,
    backbone="resnet50",    # Must match training
)
```

#### Step 5: Run Inference on Video

Processes a video through a multi-phase pipeline: motion-based detection (GMM), Hungarian tracking, path topology confirmation, and hierarchical classification. Detection and tracking are powered by [BugSpot](bugspot/), a lightweight core that runs on any platform including edge devices.

The species list is automatically loaded from the model checkpoint.

```python
HIERARCHICAL_MODEL_PATH = TRAINED_MODEL_DIR / "best_multitask.pt"

results = bplusplus.inference(
    video_path="my_video.mp4",
    output_dir="./output",
    hierarchical_model_path=HIERARCHICAL_MODEL_PATH,
    backbone="resnet50",        # Must match training
    img_size=60,                # Must match training
    # --- Optional ---
    # species_list=names,       # Override species from checkpoint
    # fps=None,                 # None = all frames, or set target FPS
    # config="config.yaml",     # Custom detection parameters (YAML/JSON)
    # classify=False,           # Detection only, NaN for classification
    # save_video=True,          # Annotated + debug videos
    # crops=False,              # Save crop per detection per track
    # track_composites=False,   # Composite image per track (temporal trail)
)

print(f"Confirmed: {results['confirmed_tracks']} / {results['tracks']} tracks")
```

**Output files:**

| File | Description | Flag |
|------|-------------|------|
| `{video}_results.csv` | Aggregated results per confirmed track | Always |
| `{video}_detections.csv` | Frame-by-frame detections | Always |
| `{video}_annotated.mp4` | Video with detection boxes and paths | `save_video=True` |
| `{video}_debug.mp4` | Side-by-side with GMM motion mask | `save_video=True` |
| `{video}_crops/` | Crop images per track | `crops=True` |
| `{video}_composites/` | Composite images per track | `track_composites=True` |

**Detection configuration** can be customized via a YAML/JSON file passed as `config=`. Download a template from the [releases page](https://github.com/Tvenver/Bplusplus/releases).

<details>
<summary><b>Full Configuration Parameters</b> (click to expand)</summary>

| Parameter | Default | Description |
|-----------|---------|-------------|
| **GMM Background Subtractor** | | *Motion detection model* |
| `gmm_history` | 500 | Frames to build background model |
| `gmm_var_threshold` | 16 | Variance threshold for foreground detection |
| **Morphological Filtering** | | *Noise removal* |
| `morph_kernel_size` | 3 | Morphological kernel size (NxN) |
| **Cohesiveness** | | *Filters scattered motion (plants) vs compact motion (insects)* |
| `min_largest_blob_ratio` | 0.80 | Min ratio of largest blob to total motion |
| `max_num_blobs` | 5 | Max separate blobs allowed in detection |
| `min_motion_ratio` | 0.15 | Min ratio of motion pixels to bbox area |
| **Shape** | | *Filters by contour properties* |
| `min_area` | 200 | Min detection area (px²) |
| `max_area` | 40000 | Max detection area (px²) |
| `min_density` | 3.0 | Min area/perimeter ratio |
| `min_solidity` | 0.55 | Min convex hull fill ratio |
| **Tracking** | | *Controls track behavior* |
| `min_displacement` | 50 | Min net movement for confirmation (px) |
| `min_path_points` | 10 | Min points before path analysis |
| `max_frame_jump` | 100 | Max jump between frames (px) |
| `max_lost_frames` | 45 | Frames before lost track deleted (e.g., 45 @ 30fps = 1.5s) |
| `max_area_change_ratio` | 3.0 | Max area change ratio between frames |
| **Tracker Matching** | | *Hungarian algorithm cost function* |
| `tracker_w_dist` | 0.6 | Weight for distance cost (0-1) |
| `tracker_w_area` | 0.4 | Weight for area cost (0-1) |
| `tracker_cost_threshold` | 0.3 | Max cost for valid match (0-1) |
| **Path Topology** | | *Confirms insect-like movement patterns* |
| `max_revisit_ratio` | 0.30 | Max ratio of revisited positions |
| `min_progression_ratio` | 0.70 | Min forward progression |
| `max_directional_variance` | 0.90 | Max heading variance |
| `revisit_radius` | 50 | Radius (px) for revisit detection |

</details>

### Customization

To train the model on your own set of insect species, you only need to change the `names` list in **Step 1**. The pipeline will automatically handle the rest.

```python
# To use your own species, change the names in this list
names = [
    "Vespa crabro",
    "Vespula vulgaris",
    "Dolichovespula media",
    # Add your species here
]
```

#### Handling an "Unknown" Class
To train a model that can recognize an "unknown" class for insects that don't belong to your target species, add `"unknown"` to your `species_list`. You must also provide a corresponding `unknown` folder containing images of various other insects in your data directories (e.g., `prepared_data/train/unknown`).

```python
# Example with an unknown class
names_with_unknown = [
    "Vespa crabro",
    "Vespula vulgaris",
    "unknown"
]
```

## Directory Structure

The pipeline will create the following directories to store artifacts:

- `GBIF_data/`: Stores the raw images downloaded from GBIF.
- `prepared_data/`: Contains the cleaned, cropped, and resized images ready for training (`train/` and optionally `valid/` subdirectories).
- `trained_model/`: Saves the trained model weights (`best_multitask.pt`).
- `output/`: Inference results including annotated videos and CSV files.

# Citation

All information in this GitHub is available under MIT license, as long as credit is given to the authors.

**Venverloo, T., Duarte, F., B++: Towards Real-Time Monitoring of Insect Species. MIT Senseable City Laboratory, AMS Institute.**

