Metadata-Version: 2.4
Name: inferential
Version: 0.3.2
Summary: Robotics-aware inference orchestration on top of Ray Serve
Author: inferential.sh
License-Expression: Apache-2.0
License-File: LICENSE
Requires-Python: >=3.11
Requires-Dist: numpy>=1.24
Requires-Dist: protobuf>=5.0
Requires-Dist: pyzmq>=26.0
Provides-Extra: dev
Requires-Dist: grpcio-tools>=1.60; extra == 'dev'
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pre-commit>=3.7; extra == 'dev'
Requires-Dist: pydantic>=2.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ray[serve]>=2.9; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Provides-Extra: server
Requires-Dist: pydantic>=2.0; extra == 'server'
Requires-Dist: ray[serve]>=2.9; extra == 'server'
Description-Content-Type: text/markdown

# Inferential

Multi-client inference orchestration on top of Ray Serve.

Inferential sits between your clients and your ML models. It receives observations over ZMQ, schedules inference requests using cadence-aware priority scoring, dispatches to Ray Serve, and streams results back — all with sub-millisecond transport overhead. Built for any scenario where multiple clients need concurrent access to shared models: robotics fleets, game agents, IoT devices, real-time ML pipelines.

![Inferential data flow](https://raw.githubusercontent.com/nalinraut/inferential/main/assets/InferentialFlow.gif)

## Features

- **ZMQ transport** — ROUTER/DEALER sockets with automatic reconnection and zero-copy tensor payloads
- **Pluggable schedulers** — Deadline-aware (default), batch-optimized, priority-tiered, round-robin
- **Cadence learning** — EMA-based tracking of each client's request pattern to predict urgency
- **Protobuf wire protocol** — Typed tensor metadata (dtype, shape, encoding) with binary payload
- **Queue management** — Request TTL, drop-oldest overflow policy, dispatch retry
- **In-memory metrics** — Ring-buffer storage with label filtering and percentile stats (p50/p95/p99)
- **Lightweight client SDK** — No Ray dependency; just `pyzmq`, `protobuf`, and `numpy`

## Install

```bash
# Client SDK only
pip install inferential

# Server with Ray Serve
pip install inferential[server]

# Development
pip install inferential[dev]
```

## Quick Start

### Server

```python
import asyncio
from inferential import Server

server = Server(bind="tcp://*:5555", models=["policy-v2"])

@server.on_metric
def log(name, value, labels):
    if name == "inference_latency_ms":
        print(f"Client {labels.get('client')}: {value:.1f}ms")

asyncio.run(server.run())
```

### Client

```python
import numpy as np
from inferential import Connection

conn = Connection(server="tcp://localhost:5555", client_id="agent-01", client_type="sensor")
model = conn.model("policy-v2", latency_budget_ms=30.0)

state = np.random.randn(7).astype(np.float32)
model.observe(urgency=0.8, state=state)

result = model.get_result(timeout_ms=50)
if result is not None:
    actions = result["actions"]  # np.ndarray
```

## Documentation

- **[Quick Start](https://github.com/nalinraut/inferential/blob/main/docs/quickstart.md)** — Install, run server + client, get your first result
- **[Architecture](https://github.com/nalinraut/inferential/blob/main/docs/architecture.md)** — System design, wire protocol, schedulers, queue management, metrics, configuration
- **[Examples](https://github.com/nalinraut/inferential/blob/main/docs/examples.md)** — Multi-client demos, metric callbacks, extending with custom models and schedulers
- **[Contributing](https://github.com/nalinraut/inferential/blob/main/docs/contributing.md)** — Commit conventions, branching, code style, pre-commit hooks, releases

## Development

```bash
# Generate protobuf code
make proto

# Run tests
make test

# Lint
make lint
```

## License

Apache-2.0
