DataSource
DataSource()Protocol for data sources that stream shard data to Dataset.
Implementations (URLSource, S3Source, BlobSource) yield (identifier, stream) pairs fed to WebDataset’s tar expander, bypassing URL resolution. This enables private S3, custom endpoints, and ATProto blob streaming.
Examples
>>> source = S3Source(bucket="my-bucket", keys=["data-000.tar"])
>>> ds = Dataset[MySample](source)Attributes
| Name | Description |
|---|---|
| shards | Lazily yield (shard_id, stream) pairs for each shard. |
Methods
| Name | Description |
|---|---|
| list_shards | Shard identifiers without opening streams. |
| open_shard | Open a single shard for random access (e.g., DataLoader splitting). |
list_shards
DataSource.list_shards()Shard identifiers without opening streams.
open_shard
DataSource.open_shard(shard_id)Open a single shard for random access (e.g., DataLoader splitting).
Raises
| Name | Type | Description |
|---|---|---|
| KeyError | If shard_id is not in list_shards(). |