DatasetPublisher
atmosphere.DatasetPublisher(client)
Publishes dataset index records to ATProto.
This class creates dataset records that reference a schema and point to HTTP storage, S3 storage, or ATProto blobs.
Examples
>>> dataset = atdata.Dataset[MySample]("https://example.com/data-000000.tar" )
>>>
>>> atmo = Atmosphere.login("handle" , "password" )
>>>
>>> publisher = DatasetPublisher(atmo)
>>> uri = publisher.publish(
... dataset,
... name= "My Training Data" ,
... description= "Training data for my model" ,
... tags= ["computer-vision" , "training" ],
... )
Methods
publish
atmosphere.DatasetPublisher.publish(
dataset,
* ,
name,
schema_uri= None ,
description= None ,
tags= None ,
license= None ,
auto_publish_schema= True ,
schema_version= '1.0.0' ,
rkey= None ,
)
Publish a dataset index record to ATProto.
Parameters
dataset
Dataset [ST ]
The Dataset to publish.
required
name
str
Human-readable dataset name.
required
schema_uri
Optional [str ]
AT URI of the schema record. If not provided and auto_publish_schema is True, the schema will be published.
None
description
Optional [str ]
Human-readable description.
None
tags
Optional [list [str ]]
Searchable tags for discovery.
None
license
Optional [str ]
SPDX license identifier (e.g., ‘MIT’, ‘Apache-2.0’).
None
auto_publish_schema
bool
If True and schema_uri not provided, automatically publish the schema first.
True
schema_version
str
Version for auto-published schema.
'1.0.0'
rkey
Optional [str ]
Optional explicit record key.
None
Returns
AtUri
The AT URI of the created dataset record.
Raises
ValueError
If schema_uri is not provided and auto_publish_schema is False.
publish_with_blobs
atmosphere.DatasetPublisher.publish_with_blobs(
blobs,
schema_uri,
* ,
name,
description= None ,
tags= None ,
license= None ,
metadata= None ,
mime_type= 'application/x-tar' ,
rkey= None ,
)
Publish a dataset with data stored as ATProto blobs.
This method uploads the provided data as blobs to the PDS and creates a dataset record referencing them. Suitable for smaller datasets that fit within blob size limits (typically 50MB per blob, configurable).
Parameters
blobs
list [bytes ]
List of binary data (e.g., tar shards) to upload as blobs.
required
schema_uri
str
AT URI of the schema record.
required
name
str
Human-readable dataset name.
required
description
Optional [str ]
Human-readable description.
None
tags
Optional [list [str ]]
Searchable tags for discovery.
None
license
Optional [str ]
SPDX license identifier.
None
metadata
Optional [dict ]
Arbitrary metadata dictionary.
None
mime_type
str
MIME type for the blobs (default: application/x-tar).
'application/x-tar'
rkey
Optional [str ]
Optional explicit record key.
None
Returns
AtUri
The AT URI of the created dataset record.
Note
Blobs are only retained by the PDS when referenced in a committed record. This method handles that automatically.
publish_with_s3
atmosphere.DatasetPublisher.publish_with_s3(
bucket,
keys,
schema_uri,
* ,
name,
region= None ,
endpoint= None ,
description= None ,
tags= None ,
license= None ,
metadata= None ,
checksums= None ,
rkey= None ,
)
Publish a dataset record with S3 storage.
Parameters
bucket
str
S3 bucket name.
required
keys
list [str ]
List of S3 object keys for shard files.
required
schema_uri
str
AT URI of the schema record.
required
name
str
Human-readable dataset name.
required
region
Optional [str ]
AWS region (e.g., ‘us-east-1’).
None
endpoint
Optional [str ]
Custom S3-compatible endpoint URL.
None
description
Optional [str ]
Human-readable description.
None
tags
Optional [list [str ]]
Searchable tags for discovery.
None
license
Optional [str ]
SPDX license identifier.
None
metadata
Optional [dict ]
Arbitrary metadata dictionary.
None
checksums
Optional [list [ShardChecksum ]]
Per-shard checksums.
None
rkey
Optional [str ]
Optional explicit record key.
None
Returns
AtUri
The AT URI of the created dataset record.
publish_with_urls
atmosphere.DatasetPublisher.publish_with_urls(
urls,
schema_uri,
* ,
name,
description= None ,
tags= None ,
license= None ,
metadata= None ,
checksums= None ,
rkey= None ,
)
Publish a dataset record with explicit HTTP URLs.
This method allows publishing a dataset record without having a Dataset object, useful for registering existing WebDataset files. Each URL should be an individual shard (no brace notation).
Parameters
urls
list [str ]
List of individual shard URLs.
required
schema_uri
str
AT URI of the schema record.
required
name
str
Human-readable dataset name.
required
description
Optional [str ]
Human-readable description.
None
tags
Optional [list [str ]]
Searchable tags for discovery.
None
license
Optional [str ]
SPDX license identifier.
None
metadata
Optional [dict ]
Arbitrary metadata dictionary.
None
checksums
Optional [list [ShardChecksum ]]
Per-shard checksums. If not provided, empty checksums are used.
None
rkey
Optional [str ]
Optional explicit record key.
None
Returns
AtUri
The AT URI of the created dataset record.