atdata

A loose federation of distributed, typed datasets built on WebDataset

What is atdata?

atdata is a Python library for typed, serializable datasets that flow seamlessly from local development to managed storage to federated sharing—all with the same sample types and consistent APIs.

Typed Samples

Define dataclass-based sample types with @packable. Automatic msgpack serialization and NDArray handling.

One-Call Writes

write_samples() handles sharding, key generation, and optional manifest sidecar files.

Batch Aggregation

NDArray fields are stacked automatically. Scalar fields become lists. No manual collation code.

Lens Transformations

View datasets through different schemas without duplicating data. Bidirectional with putters.

Managed Storage

Index with pluggable providers (SQLite, Redis, PostgreSQL) and data stores (local disk, S3). Schema auto-resolution on load.

ATProto Federation

Publish datasets to the decentralized AT Protocol network. Cross-organization discovery via your Bluesky identity.

Quick Example

import numpy as np
from numpy.typing import NDArray
import atdata

@atdata.packable
class ImageSample:
    image: NDArray
    label: str
    confidence: float

# Write sharded tar files (returns a typed Dataset)
samples = [ImageSample(image=np.random.rand(224, 224, 3).astype(np.float32),
                        label="cat", confidence=0.95) for _ in range(100)]
ds = atdata.write_samples(samples, "data.tar", maxcount=50)

# Iterate with automatic batching
for batch in ds.shuffled(batch_size=32):
    images = batch.image          # (32, 224, 224, 3) numpy array
    labels = batch.label          # list of 32 strings
    break

Scaling Up

Managed storage with Index

index = atdata.Index(data_store=atdata.LocalDiskStore())
entry = index.write(samples, name="training-v1")

# Load by name --- schema auto-resolved, no sample_type needed
ds = atdata.load_dataset("@local/training-v1", split="train")

Federation with ATProto

from atdata.atmosphere import Atmosphere

client = Atmosphere.login("handle.bsky.social", "app-password")
index = atdata.Index(atmosphere=client)
uri = index.promote_entry("training-v1")

The Architecture

┌─────────────────────────────────────────────────────────────┐
│  Federation: ATProto Atmosphere                             │
│  Decentralized discovery, cross-org sharing                 │
└─────────────────────────────────────────────────────────────┘
                              ↑ promote
┌─────────────────────────────────────────────────────────────┐
│  Managed Storage: Index + SQLite/Redis + Disk/S3            │
│  Schema registry, versioned datasets, persistent data       │
└─────────────────────────────────────────────────────────────┘
                              ↑ index.write
┌─────────────────────────────────────────────────────────────┐
│  Local Development                                          │
│  write_samples(), typed iteration, lenses                   │
└─────────────────────────────────────────────────────────────┘

Installation

pip install atdata

# With ATProto support
pip install atdata[atmosphere]

Next Steps