DataSource

DataSource()

Protocol for data sources that stream shard data to Dataset.

Implementations (URLSource, S3Source, BlobSource) yield (identifier, stream) pairs fed to WebDataset’s tar expander, bypassing URL resolution. This enables private S3, custom endpoints, and ATProto blob streaming.

Examples

>>> source = S3Source(bucket="my-bucket", keys=["data-000.tar"])
>>> ds = Dataset[MySample](source)

Attributes

Name Description
shards Lazily yield (shard_id, stream) pairs for each shard.

Methods

Name Description
list_shards Shard identifiers without opening streams.
open_shard Open a single shard for random access (e.g., DataLoader splitting).

list_shards

DataSource.list_shards()

Shard identifiers without opening streams.

open_shard

DataSource.open_shard(shard_id)

Open a single shard for random access (e.g., DataLoader splitting).

Raises

Name Type Description
KeyError If shard_id is not in list_shards().