kedro.io.DataCatalog¶
-
class
kedro.io.DataCatalog(data_sets=None, feed_dict=None, transformers=None, default_transformers=None, journal=None)[source]¶ Bases:
objectDataCatalogstores instances ofAbstractDataSetimplementations to provideloadandsavecapabilities from anywhere in the program. To use aDataCatalog, you need to instantiate it with a dictionary of data sets. Then it will act as a single point of reference for your calls, relaying load and save functions to the underlying data sets.Methods
DataCatalog.__init__([data_sets, feed_dict, …])DataCatalogstores instances ofAbstractDataSetimplementations to provideloadandsavecapabilities from anywhere in the program.DataCatalog.add(data_set_name, data_set[, …])Adds a new AbstractDataSetobject to theDataCatalog.DataCatalog.add_all(data_sets[, replace])Adds a group of new data sets to the DataCatalog.DataCatalog.add_feed_dict(feed_dict[, replace])Adds instances of MemoryDataSet, containing the data provided through feed_dict.DataCatalog.add_transformer(transformer[, …])Add a DataSetTransformer to the:class:~kedro.io.DataCatalog.DataCatalog.exists(name)Checks whether registered data set exists by calling its exists() method. DataCatalog.from_config(catalog[, …])Create a DataCataloginstance from configuration.DataCatalog.list()List of DataSetnames registered in the catalog.DataCatalog.load(name)Loads a registered data set. DataCatalog.release(name)Release any cached data associated with a data set DataCatalog.save(name, data)Save data to a registered data set. DataCatalog.shallow_copy()Returns a shallow copy of the current object. -
__init__(data_sets=None, feed_dict=None, transformers=None, default_transformers=None, journal=None)[source]¶ DataCatalogstores instances ofAbstractDataSetimplementations to provideloadandsavecapabilities from anywhere in the program. To use aDataCatalog, you need to instantiate it with a dictionary of data sets. Then it will act as a single point of reference for your calls, relaying load and save functions to the underlying data sets.Parameters: - data_sets (
Optional[Dict[str,AbstractDataSet]]) – A dictionary of data set names and data set instances. - feed_dict (
Optional[Dict[str,Any]]) – A feed dict with data to be added in memory. - transformers (
Optional[Dict[str,List[AbstractTransformer]]]) – A dictionary of lists of transformers to be applied to the data sets. - default_transformers (
Optional[List[AbstractTransformer]]) – A list of transformers to be applied to any new data sets. - journal (
Optional[Journal]) – Instance of Journal.
Raises: DataSetNotFoundError– When transformers are passed for a non existent data set.Example:
from kedro.io import CSVLocalDataSet cars = CSVLocalDataSet(filepath="cars.csv", load_args=None, save_args={"index": False}) io = DataCatalog(data_sets={'cars': cars})
Return type: None- data_sets (
-
add(data_set_name, data_set, replace=False)[source]¶ Adds a new
AbstractDataSetobject to theDataCatalog.Parameters: - data_set_name (
str) – A unique data set name which has not been registered yet. - data_set (
AbstractDataSet) – A data set object to be associated with the given data set name. - replace (
bool) – Specifies whether to replace an existingDataSetwith the same name is allowed.
Raises: DataSetAlreadyExistsError– When a data set with the same name has already been registered.Example:
from kedro.io import CSVLocalDataSet io = DataCatalog(data_sets={ 'cars': CSVLocalDataSet(filepath="cars.csv") }) io.add("boats", CSVLocalDataSet(filepath="boats.csv"))
Return type: None- data_set_name (
-
add_all(data_sets, replace=False)[source]¶ Adds a group of new data sets to the
DataCatalog.Parameters: - data_sets (
Dict[str,AbstractDataSet]) – A dictionary ofDataSetnames and data set instances. - replace (
bool) – Specifies whether to replace an existingDataSetwith the same name is allowed.
Raises: DataSetAlreadyExistsError– When a data set with the same name has already been registered.Example:
from kedro.io import CSVLocalDataSet, ParquetLocalDataSet io = DataCatalog(data_sets={ "cars": CSVLocalDataSet(filepath="cars.csv") }) additional = { "planes": ParquetLocalDataSet("planes.parq"), "boats": CSVLocalDataSet(filepath="boats.csv") } io.add_all(additional) assert io.list() == ["cars", "planes", "boats"]
Return type: None- data_sets (
-
add_feed_dict(feed_dict, replace=False)[source]¶ Adds instances of
MemoryDataSet, containing the data provided through feed_dict.Parameters: - feed_dict (
Dict[str,Any]) – A feed dict with data to be added in memory. - replace (
bool) – Specifies whether to replace an existingDataSetwith the same name is allowed.
Example:
import pandas as pd df = pd.DataFrame({'col1': [1, 2], 'col2': [4, 5], 'col3': [5, 6]}) io = DataCatalog() io.add_feed_dict({ 'data': df }, replace=True) assert io.load("data").equals(df)
Return type: None- feed_dict (
-
add_transformer(transformer, data_set_names=None)[source]¶ Add a
DataSetTransformer to the:class:~kedro.io.DataCatalog. Transformers can modify the way Data Sets are loaded and saved.Parameters: - transformer (
AbstractTransformer) – The transformer instance to add. - data_set_names (
Union[str,Iterable[str],None]) – The Data Sets to add the transformer to. Or None to add the transformer to all Data Sets.
Raises: DataSetNotFoundError– When a transformer is being added to a non existent data set.TypeError– When transformer isn’t an instance ofAbstractTransformer
- transformer (
-
exists(name)[source]¶ Checks whether registered data set exists by calling its exists() method. Raises a warning and returns False if exists() is not implemented.
Parameters: name ( str) – A data set to be checked.Return type: boolReturns: Whether the data set output exists. Raises: DataSetNotFoundError– When a data set with the given name has not yet been registered.
-
classmethod
from_config(catalog, credentials=None, load_versions=None, save_version=None, journal=None)[source]¶ Create a
DataCataloginstance from configuration. This is a factory method used to provide developers with a way to instantiateDataCatalogwith configuration parsed from configuration files.Parameters: - catalog (
Optional[Dict[str,Dict[str,Any]]]) – A dictionary whose keys are the data set names and the values are dictionaries with the constructor arguments for classes implementingAbstractDataSet. The data set class to be loaded is specified with the keytypeand their fully qualified class name. Allkedro.iodata set can be specified by their class name only, i.e. their module name can be omitted. - credentials (
Optional[Dict[str,Dict[str,Any]]]) – A dictionary containing credentials for different data sets. Use thecredentialskey in aAbstractDataSetto refer to the appropriate credentials as shown in the example below. - load_versions (
Optional[Dict[str,str]]) – A mapping between dataset names and versions to load. Has no effect on data sets without enabled versioning. - save_version (
Optional[str]) – Version string to be used forsaveoperations by all data sets with enabled versioning. It must: a) be a case-insensitive string that conforms with operating system filename limitations, b) always return the latest version when sorted in lexicographical order. - journal (
Optional[Journal]) – Instance of Journal.
Return type: DataCatalogReturns: An instantiated
DataCatalogcontaining all specified data sets, created and ready to use.Raises: DataSetError– When the method fails to create any of the data sets from their config.Example:
config = { "cars": { "type": "CSVLocalDataSet", "filepath": "cars.csv", "save_args": { "index": False } }, "boats": { "type": "CSVS3DataSet", "filepath": "boats.csv", "bucket_name": "mck-147789798-bucket", "credentials": "boats_credentials" "save_args": { "index": False } } } credentials = { "boats_credentials": { "aws_access_key_id": "<your key id>", "aws_secret_access_key": "<your secret>" } } catalog = DataCatalog.from_config(config, credentials) df = catalog.load("cars") catalog.save("boats", df)
- catalog (
-
list()[source]¶ List of
DataSetnames registered in the catalog.Return type: List[str]Returns: A List of DataSetnames, corresponding to the entries that are registered in the current catalog object.
-
load(name)[source]¶ Loads a registered data set.
Parameters: name ( str) – A data set to be loaded.Return type: AnyReturns: The loaded data as configured. Raises: DataSetNotFoundError– When a data set with the given name has not yet been registered.Example:
from kedro.io import CSVLocalDataSet, DataCatalog cars = CSVLocalDataSet(filepath="cars.csv", load_args=None, save_args={"index": False}) io = DataCatalog(data_sets={'cars': cars}) df = io.load("cars")
-
release(name)[source]¶ Release any cached data associated with a data set
Parameters: name ( str) – A data set to be checked.Raises: DataSetNotFoundError– When a data set with the given name has not yet been registered.
-
save(name, data)[source]¶ Save data to a registered data set.
Parameters: - name (
str) – A data set to be saved to. - data (
Any) – A data object to be saved as configured in the registered data set.
Raises: DataSetNotFoundError– When a data set with the given name has not yet been registered.Example:
import pandas as pd from kedro.io import CSVLocalDataSet cars = CSVLocalDataSet(filepath="cars.csv", load_args=None, save_args={"index": False}) io = DataCatalog(data_sets={'cars': cars}) df = pd.DataFrame({'col1': [1, 2], 'col2': [4, 5], 'col3': [5, 6]}) io.save("cars", df)
Return type: None- name (
-