API Reference

dataset

etcetera.dataset(name: str, auto_pull=False, config=None)

Returns etcetera.Dataset object, given a dataset name.

Parameters:
  • name (str) – the name of the dataset
  • auto_pull (bool) – if set, automatically pulls the dataset from the cloud
  • config (etcetera.Config) – configuration to use
Returns:

etcetera.Dataset representing this

Dataset

class etcetera.Dataset(location: str, name: Optional[str] = None)

Represents locally installed dataset

file(*av)

convenience method to build a file path relative to dataset root.

Example: dataset.file('README.md')

data

Path to the data directory within the dataset

partitions()

Returns sorted list of partition names

__len__()

Dataset length is the number of partitions

__getitem__(partition)

Returns pathlib.Path object for the partition directory

Parameters:partition (str) – name of the partition
Example::

dataset = … partition = dataset[‘train’]

for filename in partition.iterdir():
print(filename)

Config

class etcetera.Config(url: str, home=None, **conf)

Holds configuration for the etcetera

__init__(url: str, home=None, **conf)

Creates new configuration from url and (optionally) other values

Parameters:
  • url (str) – repository URL, for example “s3://my-datasets”
  • home (str) – directory to be used as local dataset cache. If not specified, ~/.etc/ is used. If directory does not exist, it will be created.
  • conf (str) – other parameters, specific to the cloud provider.
classmethod load(filename=None)

Loads configuration from a TOML file

ls

etcetera.ls(remote=False, config=None)

Lists datasets.

By the default, local datasets are listed.

Parameters:
  • remote (bool) – if True, list remote datasets
  • config (etcetera.Config) – configuration to use

register

etcetera.register(dirname: str, name: Optional[str] = None, force=False, config=None)

Register local directory as a dataset.

Parameters:
  • dirname (str) – path to the local directory with data
  • name (str) – dataset name (if not specified, directory name is used)
  • force (bool) – allows overriding existing dataset
  • config (etcetera.Config) – configuration to use

pull

etcetera.pull(name: str, force=False, config=None)

Pull dataset from cloud storage.

Parameters:
  • name (str) – dataset name
  • force (bool) – if True, overrides the existing local dataset
  • config (etcetera.Config) – configuration to use

push

etcetera.push(name: str, force=False, config=None)

Pushes dataset to the cloud.

Parameters:
  • name (str) – dataset name
  • force (bool) – if true, overrides remote dataset
  • config (etcetera.Config) – configuration to use

purge

etcetera.purge(name: str, config=None)

Deletes local dataset.

Parameters: