DatasetLoader
atmosphere.DatasetLoader(client)
Loads dataset records from ATProto.
This class fetches dataset index records and can create Dataset objects from them. Note that loading a dataset requires having the corresponding Python class for the sample type.
Examples
>>> atmo = Atmosphere.login("handle" , "password" )
>>> loader = DatasetLoader(atmo)
>>>
>>> # List available datasets
>>> datasets = loader.list ()
>>> for ds in datasets:
... print (ds["name" ], ds["schemaRef" ])
>>>
>>> # Get a specific dataset record
>>> record = loader.get("at://did:plc:abc/ac.foundation.dataset.record/xyz" )
Methods
get
Fetch a dataset record by AT URI.
get_blob_urls
Get fetchable URLs for blob-stored dataset shards.
get_blobs
Get the blob references from a dataset record.
get_metadata
Get the metadata from a dataset record.
get_s3_info
Get S3 storage details from a dataset record.
get_storage_type
Get the storage type of a dataset record.
get_typed
Fetch a dataset record and return as a typed object.
get_urls
Get the WebDataset URLs from a dataset record.
list_all
List dataset records from a repository.
to_dataset
Create a Dataset object from an ATProto record.
get
atmosphere.DatasetLoader.get(uri)
Fetch a dataset record by AT URI.
Parameters
uri
str | AtUri
The AT URI of the dataset record.
required
Returns
dict
The dataset record as a dictionary.
get_blob_urls
atmosphere.DatasetLoader.get_blob_urls(uri)
Get fetchable URLs for blob-stored dataset shards.
This resolves the PDS endpoint and constructs URLs that can be used to fetch the blob data directly.
Parameters
uri
str | AtUri
The AT URI of the dataset record.
required
Returns
list [str ]
List of URLs for fetching the blob data.
Raises
ValueError
If storage type is not blobs or PDS cannot be resolved.
get_blobs
atmosphere.DatasetLoader.get_blobs(uri)
Get the blob references from a dataset record.
Parameters
uri
str | AtUri
The AT URI of the dataset record.
required
get_s3_info
atmosphere.DatasetLoader.get_s3_info(uri)
Get S3 storage details from a dataset record.
Parameters
uri
str | AtUri
The AT URI of the dataset record.
required
Returns
dict
Dict with keys: bucket, keys, region (optional), endpoint (optional).
get_storage_type
atmosphere.DatasetLoader.get_storage_type(uri)
Get the storage type of a dataset record.
Parameters
uri
str | AtUri
The AT URI of the dataset record.
required
Returns
str
One of “http”, “s3”, “blobs”, or “external” (legacy).
get_typed
atmosphere.DatasetLoader.get_typed(uri)
Fetch a dataset record and return as a typed object.
Parameters
uri
str | AtUri
The AT URI of the dataset record.
required
get_urls
atmosphere.DatasetLoader.get_urls(uri)
Get the WebDataset URLs from a dataset record.
Supports storageHttp, storageS3, and legacy storageExternal formats.
Parameters
uri
str | AtUri
The AT URI of the dataset record.
required
list_all
atmosphere.DatasetLoader.list_all(repo= None , limit= 100 )
List dataset records from a repository.
Parameters
repo
Optional [str ]
The DID of the repository. Defaults to authenticated user.
None
limit
int
Maximum number of records to return.
100
to_dataset
atmosphere.DatasetLoader.to_dataset(uri, sample_type)
Create a Dataset object from an ATProto record.
This method creates a Dataset instance from a published record. You must provide the sample type class, which should match the schema referenced by the record.
Supports HTTP, S3, blob, and legacy external storage.
Parameters
uri
str | AtUri
The AT URI of the dataset record.
required
sample_type
Type [ST ]
The Python class for the sample type.
required
Returns
Dataset [ST ]
A Dataset instance configured from the record.
Examples
>>> loader = DatasetLoader(client)
>>> dataset = loader.to_dataset(uri, MySampleType)
>>> for batch in dataset.shuffled(batch_size= 32 ):
... process(batch)