Promotion Workflow

Migrate datasets from local storage to ATProto

This tutorial demonstrates the workflow for migrating datasets from local Index-managed storage to the federated ATProto atmosphere network. Promotion is the bridge between Layer 2 (managed storage) and Layer 3 (federation).

Why Promotion?

A common pattern in data science:

  1. Start private: Develop and validate datasets within your team
  2. Go public: Share successful datasets with the broader community

Promotion handles this transition without re-processing your data. Instead of creating a new dataset from scratch, you’re lifting an existing local dataset entry into the federated atmosphere.

The workflow handles several complexities automatically:

  • Schema deduplication: If you’ve already published the same schema type and version, promotion reuses it
  • URL preservation: Data stays in place (unless you explicitly want to copy it)
  • CID consistency: Content identifiers remain valid across the transition

Overview

The promotion workflow moves datasets from local storage to the atmosphere:

LOCAL                           ATMOSPHERE
-----                           ----------
Index (SQLite/Redis)            ATProto PDS
LocalDiskStore / S3   -->       (same storage or new location)
atdata://local/schema/...       at://did:plc:.../schema/...

Key features:

  • Schema deduplication: Won’t republish identical schemas
  • Flexible data handling: Keep existing URLs or copy to new storage
  • Metadata preservation: Local metadata carries over to atmosphere

Setup

import numpy as np
from numpy.typing import NDArray
import atdata
from atdata.atmosphere import Atmosphere

Prepare a Local Dataset

First, set up a dataset in local storage using index.write():

# 1. Define sample type
@atdata.packable
class ExperimentSample:
    """A sample from a scientific experiment."""
    measurement: NDArray
    timestamp: float
    sensor_id: str

# 2. Create samples
samples = [
    ExperimentSample(
        measurement=np.random.randn(64).astype(np.float32),
        timestamp=float(i),
        sensor_id=f"sensor_{i % 4}",
    )
    for i in range(1000)
]

# 3. Write through index (handles sharding, schema, and storage)
index = atdata.Index(data_store=atdata.LocalDiskStore())
local_entry = index.write(samples, name="experiment-2024-001", maxcount=500)

print(f"Local entry name: {local_entry.name}")
print(f"Local entry CID: {local_entry.cid}")
print(f"Data URLs: {local_entry.data_urls}")

Basic Promotion

Promote the dataset to ATProto using index.promote_entry():

# Connect to atmosphere and attach to the index
client = Atmosphere.login("myhandle.bsky.social", "app-password")
index = atdata.Index(atmosphere=client, data_store=atdata.LocalDiskStore())

# Promote by entry name
at_uri = index.promote_entry("experiment-2024-001")
print(f"Published: {at_uri}")

Promotion with Metadata

Add description, tags, and license:

at_uri = index.promote_entry(
    "experiment-2024-001",
    name="experiment-2024-001-v2",   # Override name
    description="Sensor measurements from Lab 302",
    tags=["experiment", "physics", "2024"],
    license="CC-BY-4.0",
)
print(f"Published with metadata: {at_uri}")

Schema Deduplication

The promotion workflow automatically checks for existing schemas on the atmosphere. When you promote multiple datasets with the same sample type, the schema is only published once:

# First promotion: publishes schema to atmosphere
uri1 = index.promote_entry("experiment-batch-1")

# Second promotion with same schema type + version: reuses existing schema
uri2 = index.promote_entry("experiment-batch-2")

Data Migration Options

By default, promotion keeps the original data URLs:

# Data stays in original storage location
at_uri = index.promote_entry("experiment-2024-001")

Benefits:

  • Fastest option, no data copying
  • Dataset record points to existing URLs
  • Requires original storage to remain accessible

To copy data to a different storage location, use promote_dataset() with a Dataset loaded from the entry’s URLs:

# Load the dataset and promote directly
entry = index.get_entry_by_name("experiment-2024-001")
ds = atdata.Dataset[ExperimentSample](entry.data_urls[0])

at_uri = index.promote_dataset(
    ds,
    name="experiment-2024-001",
    description="Sensor measurements from Lab 302",
)

Benefits:

  • Data is copied to new bucket
  • Good for moving from private to public storage
  • Original storage can be retired

Verify on Atmosphere

After promotion, verify the dataset is accessible:

entry = index.get_dataset(at_uri)

print(f"Name: {entry.name}")
print(f"Schema: {entry.schema_ref}")
print(f"URLs: {entry.data_urls}")

# Load and iterate — schema auto-resolved
ds = atdata.load_dataset(at_uri, split="train")

for batch in ds.ordered(batch_size=32):
    print(f"Measurement shape: {batch.measurement.shape}")
    break

Error Handling

try:
    at_uri = index.promote_entry("experiment-2024-001")
except KeyError as e:
    # Entry or schema not found in index
    print(f"Not found: {e}")
except ValueError as e:
    # Entry has no data URLs or atmosphere not available
    print(f"Invalid state: {e}")

Requirements Checklist

Before promotion:

Complete Workflow

# Complete local-to-atmosphere workflow
import numpy as np
from numpy.typing import NDArray
import atdata
from atdata.atmosphere import Atmosphere

# 1. Define sample type
@atdata.packable
class FeatureSample:
    features: NDArray
    label: int

# 2. Create samples
samples = [
    FeatureSample(
        features=np.random.randn(128).astype(np.float32),
        label=i % 10,
    )
    for i in range(1000)
]

# 3. Write through index (schema persisted automatically)
index = atdata.Index(data_store=atdata.LocalDiskStore())
entry = index.write(samples, name="feature-vectors-v1", maxcount=500)

# 4. Promote to atmosphere
client = Atmosphere.login("myhandle.bsky.social", "app-password")
index = atdata.Index(atmosphere=client, data_store=atdata.LocalDiskStore())

at_uri = index.promote_entry(
    "feature-vectors-v1",
    description="Feature vectors for classification",
    tags=["features", "embeddings"],
    license="MIT",
)

print(f"Dataset published: {at_uri}")

# 5. Others can now discover and load
# ds = atdata.load_dataset("@myhandle.bsky.social/feature-vectors-v1", split="train")

What You’ve Learned

You now understand the promotion workflow:

Concept Purpose
index.promote_entry() Lift local entries to federated network by name
index.promote_dataset() Promote a Dataset object directly
Schema deduplication Avoid publishing duplicate schemas
Data URL preservation Keep data in place or copy to new storage
Metadata enrichment Add description, tags, license during promotion

Promotion completes atdata’s three-layer story: you can now move seamlessly from local experimentation to team collaboration to public sharing, all with the same typed sample definitions.

The Complete Journey

┌──────────────────┐   index.write  ┌──────────────────┐   promote    ┌──────────────────┐
│  Local Files     │ ────────────→  │  Managed Storage │ ───────────→ │  Federation      │
│                  │                │                  │              │                  │
│  write_samples() │                │  Index (SQLite)  │              │  Index +         │
│  Dataset[T]      │                │  LocalDiskStore  │              │  Atmosphere      │
└──────────────────┘                └──────────────────┘              └──────────────────┘

Next Steps