local.Index

local.Index(
    provider=None,
    *,
    path=None,
    dsn=None,
    redis=None,
    data_store=None,
    repos=None,
    atmosphere=_ATMOSPHERE_DEFAULT,
    auto_stubs=False,
    stub_dir=None,
    **kwargs,
)

Unified index for tracking datasets across multiple repositories.

Implements the AbstractIndex protocol. Maintains a registry of dataset entries across named repositories (always including a built-in "local" repository) and an optional atmosphere (ATProto) backend.

The "local" repository is always present and uses the storage backend determined by the provider argument. When no provider is given, defaults to SQLite (zero external dependencies). Pass a redis connection or Redis **kwargs for backwards-compatible Redis behaviour.

Additional named repositories can be mounted via the repos parameter, each pairing an IndexProvider with an optional data store.

An Atmosphere is available by default for anonymous read-only resolution of @handle/dataset paths. Pass an authenticated client for write operations, or atmosphere=None to disable.

Attributes

Name Type Description
_repos dict[str, _Repo] All repositories keyed by name. "local" is always present.
_atmosphere _AtmosphereBackend | None Optional atmosphere backend for ATProto operations.

Methods

Name Description
add_entry Add a dataset to the local repository index.
clear_stubs Remove all auto-generated stub files.
decode_schema Reconstruct a Python PackableSample type from a stored schema.
decode_schema_as Decode a schema with explicit type hint for IDE support.
get_dataset Get a dataset entry by name or prefixed reference.
get_entry Get an entry by its CID.
get_entry_by_name Get an entry by its human-readable name.
get_import_path Get the import path for a schema’s generated module.
get_schema Get a schema record by reference (AbstractIndex protocol).
get_schema_record Get a schema record as LocalSchemaRecord object.
insert_dataset Insert a dataset into the index (AbstractIndex protocol).
list_datasets Get dataset entries as a materialized list (AbstractIndex protocol).
list_entries Get all index entries as a materialized list.
list_schemas Get all schema records as a materialized list (AbstractIndex protocol).
load_schema Load a schema and make it available in the types namespace.
promote_dataset Publish a Dataset directly to the atmosphere.
promote_entry Promote a locally-indexed dataset to the atmosphere.
publish_schema Publish a schema for a sample type to Redis.
write Write samples and create an index entry in one step.

add_entry

local.Index.add_entry(ds, *, name, schema_ref=None, metadata=None)

Add a dataset to the local repository index.

Parameters

Name Type Description Default
ds Dataset The dataset to add to the index. required
name str Human-readable name for the dataset. required
schema_ref str | None Optional schema reference. If None, generates from sample type. None
metadata dict | None Optional metadata dictionary. If None, uses ds._metadata if available. None

Returns

Name Type Description
LocalDatasetEntry The created LocalDatasetEntry object.

clear_stubs

local.Index.clear_stubs()

Remove all auto-generated stub files.

Only works if auto_stubs was enabled when creating the Index.

Returns

Name Type Description
int Number of stub files removed, or 0 if auto_stubs is disabled.

decode_schema

local.Index.decode_schema(ref)

Reconstruct a Python PackableSample type from a stored schema.

This method enables loading datasets without knowing the sample type ahead of time. The index retrieves the schema record and dynamically generates a PackableSample subclass matching the schema definition.

If auto_stubs is enabled, a Python module will be generated and the class will be imported from it, providing full IDE autocomplete support. The returned class has proper type information that IDEs can understand.

Parameters

Name Type Description Default
ref str Schema reference string (atdata://local/schema/… or legacy local://schemas/…). required

Returns

Name Type Description
Type[Packable] A PackableSample subclass - either imported from a generated module
Type[Packable] (if auto_stubs is enabled) or dynamically created.

Raises

Name Type Description
KeyError If schema not found.
ValueError If schema cannot be decoded.

decode_schema_as

local.Index.decode_schema_as(ref, type_hint)

Decode a schema with explicit type hint for IDE support.

This is a typed wrapper around decode_schema() that preserves the type information for IDE autocomplete. Use this when you have a stub file for the schema and want full IDE support.

Parameters

Name Type Description Default
ref str Schema reference string. required
type_hint type[T] The stub type to use for type hints. Import this from the generated stub file. required

Returns

Name Type Description
type[T] The decoded type, cast to match the type_hint for IDE support.

Examples

>>> # After enabling auto_stubs and configuring IDE extraPaths:
>>> from local.MySample_1_0_0 import MySample
>>>
>>> # This gives full IDE autocomplete:
>>> DecodedType = index.decode_schema_as(ref, MySample)
>>> sample = DecodedType(text="hello", value=42)  # IDE knows signature!

Note

The type_hint is only used for static type checking - at runtime, the actual decoded type from the schema is returned. Ensure the stub matches the schema to avoid runtime surprises.

get_dataset

local.Index.get_dataset(ref)

Get a dataset entry by name or prefixed reference.

Supports repository-prefixed lookups (e.g. "lab/mnist"), atmosphere paths ("@handle/dataset"), AT URIs, and bare names (which default to the "local" repository).

Parameters

Name Type Description Default
ref str Dataset name, prefixed name, or AT URI. required

Returns

Name Type Description
'IndexEntry' IndexEntry for the dataset.

Raises

Name Type Description
KeyError If dataset not found.
ValueError If the atmosphere backend is required but unavailable.

get_entry

local.Index.get_entry(cid)

Get an entry by its CID.

Parameters

Name Type Description Default
cid str Content identifier of the entry. required

Returns

Name Type Description
LocalDatasetEntry LocalDatasetEntry for the given CID.

Raises

Name Type Description
KeyError If entry not found.

get_entry_by_name

local.Index.get_entry_by_name(name)

Get an entry by its human-readable name.

Parameters

Name Type Description Default
name str Human-readable name of the entry. required

Returns

Name Type Description
LocalDatasetEntry LocalDatasetEntry with the given name.

Raises

Name Type Description
KeyError If no entry with that name exists.

get_import_path

local.Index.get_import_path(ref)

Get the import path for a schema’s generated module.

When auto_stubs is enabled, this returns the import path that can be used to import the schema type with full IDE support.

Parameters

Name Type Description Default
ref str Schema reference string. required

Returns

Name Type Description
str | None Import path like “local.MySample_1_0_0”, or None if auto_stubs
str | None is disabled.

Examples

>>> index = Index(auto_stubs=True)
>>> ref = index.publish_schema(MySample, version="1.0.0")
>>> index.load_schema(ref)
>>> print(index.get_import_path(ref))
local.MySample_1_0_0
>>> # Then in your code:
>>> # from local.MySample_1_0_0 import MySample

get_schema

local.Index.get_schema(ref)

Get a schema record by reference (AbstractIndex protocol).

Parameters

Name Type Description Default
ref str Schema reference string. Supports both new format (atdata://local/schema/{name}@version) and legacy format (local://schemas/{module.Class}@version). required

Returns

Name Type Description
dict Schema record as a dictionary with keys ‘name’, ‘version’,
dict ‘fields’, ‘$ref’, etc.

Raises

Name Type Description
KeyError If schema not found.
ValueError If reference format is invalid.

get_schema_record

local.Index.get_schema_record(ref)

Get a schema record as LocalSchemaRecord object.

Use this when you need the full LocalSchemaRecord with typed properties. For Protocol-compliant dict access, use get_schema() instead.

Parameters

Name Type Description Default
ref str Schema reference string. required

Returns

Name Type Description
LocalSchemaRecord LocalSchemaRecord with schema details.

Raises

Name Type Description
KeyError If schema not found.
ValueError If reference format is invalid.

insert_dataset

local.Index.insert_dataset(ds, *, name, schema_ref=None, **kwargs)

Insert a dataset into the index (AbstractIndex protocol).

The target repository is determined by a prefix in the name argument (e.g. "lab/mnist"). If no prefix is given, or the prefix is "local", the built-in local repository is used.

If the target repository has a data_store, shards are written to storage first, then indexed. Otherwise, the dataset’s existing URL is indexed directly.

Parameters

Name Type Description Default
ds Dataset The Dataset to register. required
name str Human-readable name for the dataset, optionally prefixed with a repository name (e.g. "lab/mnist"). required
schema_ref str | None Optional schema reference. None
**kwargs Additional options: - metadata: Optional metadata dict - prefix: Storage prefix (default: dataset name) - cache_local: If True, cache writes locally first {}

Returns

Name Type Description
'IndexEntry' IndexEntry for the inserted dataset.

list_datasets

local.Index.list_datasets(repo=None)

Get dataset entries as a materialized list (AbstractIndex protocol).

Parameters

Name Type Description Default
repo str | None Optional repository filter. If None, aggregates entries from "local" and all named repositories. Use "local" for only the built-in repository, a named repo key, or "_atmosphere" for atmosphere entries. None

Returns

Name Type Description
list['IndexEntry'] List of IndexEntry for each dataset.

list_entries

local.Index.list_entries()

Get all index entries as a materialized list.

Returns

Name Type Description
list[LocalDatasetEntry] List of all LocalDatasetEntry objects in the index.

list_schemas

local.Index.list_schemas()

Get all schema records as a materialized list (AbstractIndex protocol).

Returns

Name Type Description
list[dict] List of schema records as dictionaries.

load_schema

local.Index.load_schema(ref)

Load a schema and make it available in the types namespace.

This method decodes the schema, optionally generates a Python module for IDE support (if auto_stubs is enabled), and registers the type in the :attr:types namespace for easy access.

Parameters

Name Type Description Default
ref str Schema reference string (atdata://local/schema/… or legacy local://schemas/…). required

Returns

Name Type Description
Type[Packable] The decoded PackableSample subclass. Also available via
Type[Packable] index.types.<ClassName> after this call.

Raises

Name Type Description
KeyError If schema not found.
ValueError If schema cannot be decoded.

Examples

>>> # Load and use immediately
>>> MyType = index.load_schema("atdata://local/schema/MySample@1.0.0")
>>> sample = MyType(field1="hello", field2=42)
>>>
>>> # Or access later via namespace
>>> index.load_schema("atdata://local/schema/OtherType@1.0.0")
>>> other = index.types.OtherType(data="test")

promote_dataset

local.Index.promote_dataset(
    dataset,
    *,
    name,
    sample_type=None,
    schema_version='1.0.0',
    description=None,
    tags=None,
    license=None,
)

Publish a Dataset directly to the atmosphere.

Publishes the schema (with deduplication) and creates a dataset record on ATProto. Uses the index’s atmosphere backend.

Parameters

Name Type Description Default
dataset Dataset The Dataset to publish. required
name str Name for the atmosphere dataset record. required
sample_type type | None Sample type for schema publishing. Inferred from dataset.sample_type if not provided. None
schema_version str Semantic version for the schema. Default: "1.0.0". '1.0.0'
description str | None Optional description for the dataset. None
tags list[str] | None Optional tags for discovery. None
license str | None Optional license identifier. None

Returns

Name Type Description
str AT URI of the created atmosphere dataset record.

Raises

Name Type Description
ValueError If atmosphere backend is not available.

Examples

>>> index = Index(atmosphere=client)
>>> ds = atdata.load_dataset("./data.tar", MySample, split="train")
>>> uri = index.promote_dataset(ds, name="my-dataset")

promote_entry

local.Index.promote_entry(
    entry_name,
    *,
    name=None,
    description=None,
    tags=None,
    license=None,
)

Promote a locally-indexed dataset to the atmosphere.

Looks up the entry by name in the local index, resolves its schema, and publishes both schema and dataset record to ATProto via the index’s atmosphere backend.

Parameters

Name Type Description Default
entry_name str Name of the local dataset entry to promote. required
name str | None Override name for the atmosphere record. Defaults to the local entry name. None
description str | None Optional description for the dataset. None
tags list[str] | None Optional tags for discovery. None
license str | None Optional license identifier. None

Returns

Name Type Description
str AT URI of the created atmosphere dataset record.

Raises

Name Type Description
ValueError If atmosphere backend is not available, or the local entry has no data URLs.
KeyError If the entry or its schema is not found.

Examples

>>> index = Index(atmosphere=client)
>>> uri = index.promote_entry("mnist-train")

publish_schema

local.Index.publish_schema(sample_type, *, version=None, description=None)

Publish a schema for a sample type to Redis.

Parameters

Name Type Description Default
sample_type type A Packable type (@packable-decorated or PackableSample subclass). required
version str | None Semantic version string (e.g., ‘1.0.0’). If None, auto-increments from the latest published version (patch bump), or starts at ‘1.0.0’ if no previous version exists. None
description str | None Optional human-readable description. If None, uses the class docstring. None

Returns

Name Type Description
str Schema reference string: ‘atdata://local/schema/{name}@version’.

Raises

Name Type Description
ValueError If sample_type is not a dataclass.
TypeError If sample_type doesn’t satisfy the Packable protocol, or if a field type is not supported.

write

local.Index.write(
    samples,
    *,
    name,
    schema_ref=None,
    description=None,
    tags=None,
    license=None,
    maxcount=10000,
    maxsize=None,
    metadata=None,
    manifest=False,
)

Write samples and create an index entry in one step.

This is the primary method for publishing data. It serializes samples to WebDataset tar files, stores them via the appropriate backend, and creates an index entry.

The target backend is determined by the name prefix:

  • Bare name (e.g., "mnist"): writes to the local repository.
  • "@handle/name": writes and publishes to the atmosphere.
  • "repo/name": writes to a named repository.

When the local backend has no data_store configured, a LocalDiskStore is created automatically at ~/.atdata/data/ so that samples have persistent storage.

.. note::

This method is synchronous. Samples are written to a temporary
location first, then copied to permanent storage by the backend.
Avoid passing lazily-evaluated iterators that depend on external
state that may change during the call.

Parameters

Name Type Description Default
samples Iterable Iterable of Packable samples. Must be non-empty. required
name str Dataset name, optionally prefixed with target. required
schema_ref str | None Optional schema reference. Auto-generated if None. None
description str | None Optional dataset description (atmosphere only). None
tags list[str] | None Optional tags for discovery (atmosphere only). None
license str | None Optional license identifier (atmosphere only). None
maxcount int Max samples per shard. Default: 10,000. 10000
maxsize int | None Max bytes per shard. Default: None. None
metadata dict | None Optional metadata dict stored with the entry. None
manifest bool If True, write per-shard manifest sidecar files alongside each tar. Default: False. False

Returns

Name Type Description
'IndexEntry' IndexEntry for the created dataset.

Raises

Name Type Description
ValueError If samples is empty.

Examples

>>> index = Index()
>>> samples = [MySample(key="0", text="hello")]
>>> entry = index.write(samples, name="my-dataset")