import numpy as np
from numpy.typing import NDArray
import atdata
@atdata.packable
class ImageSample:
image: NDArray
label: str
confidence: float
# Write sharded tar files (returns a typed Dataset)
samples = [ImageSample(image=np.random.rand(224, 224, 3).astype(np.float32),
label="cat", confidence=0.95) for _ in range(100)]
ds = atdata.write_samples(samples, "data.tar", maxcount=50)
# Iterate with automatic batching
for batch in ds.shuffled(batch_size=32):
images = batch.image # (32, 224, 224, 3) numpy array
labels = batch.label # list of 32 strings
breakatdata
A loose federation of distributed, typed datasets built on WebDataset
What is atdata?
atdata is a Python library for typed, serializable datasets that flow seamlessly from local development to managed storage to federated sharing—all with the same sample types and consistent APIs.
Typed Samples
Define dataclass-based sample types with @packable. Automatic msgpack serialization and NDArray handling.
One-Call Writes
write_samples() handles sharding, key generation, and optional manifest sidecar files.
Batch Aggregation
NDArray fields are stacked automatically. Scalar fields become lists. No manual collation code.
Lens Transformations
View datasets through different schemas without duplicating data. Bidirectional with putters.
Managed Storage
Index with pluggable providers (SQLite, Redis, PostgreSQL) and data stores (local disk, S3). Schema auto-resolution on load.
ATProto Federation
Publish datasets to the decentralized AT Protocol network. Cross-organization discovery via your Bluesky identity.
Quick Example
Scaling Up
Managed storage with Index
index = atdata.Index(data_store=atdata.LocalDiskStore())
entry = index.write(samples, name="training-v1")
# Load by name --- schema auto-resolved, no sample_type needed
ds = atdata.load_dataset("@local/training-v1", split="train")Federation with ATProto
from atdata.atmosphere import Atmosphere
client = Atmosphere.login("handle.bsky.social", "app-password")
index = atdata.Index(atmosphere=client)
uri = index.promote_entry("training-v1")The Architecture
┌─────────────────────────────────────────────────────────────┐
│ Federation: ATProto Atmosphere │
│ Decentralized discovery, cross-org sharing │
└─────────────────────────────────────────────────────────────┘
↑ promote
┌─────────────────────────────────────────────────────────────┐
│ Managed Storage: Index + SQLite/Redis + Disk/S3 │
│ Schema registry, versioned datasets, persistent data │
└─────────────────────────────────────────────────────────────┘
↑ index.write
┌─────────────────────────────────────────────────────────────┐
│ Local Development │
│ write_samples(), typed iteration, lenses │
└─────────────────────────────────────────────────────────────┘
Installation
pip install atdata
# With ATProto support
pip install atdata[atmosphere]Next Steps
- Quick Start — Define samples, write datasets, iterate with batching
- Examples — Five end-to-end worked examples
- Local Workflow — Index-managed storage for teams
- Atmosphere Publishing — Publish to the ATProto network
- API Reference — Architecture overview and API docs
- Benchmarks — Performance benchmark results