Etcetera¶
Dataset sharing via cloud storage (S3, Google Cloud).
Standard Dataset Structure¶
Dataset in etcetera
is a collection of files organized into partitions. File content and type
can be anything. Optionally, one can attach a meta information to this collection.
A directory sample/
is a valid dataset iff:
- There is a sub-directory
data/
data/
contains only directories (we call them “partitions”)- there is at least one partition in
data/
Optionally, one can have a file named meta.json
in the root of directory sample/
with
arbitraty JSON content.
Other files and directories can be there as well. They will be stored and transported as-is.
Here is a sample dataset:
sample/
data/
train/
data00001.xml
data00002.xml
...
test/
data00001.xml
data00002.xml
...
meta.json
README.md
assets/
DataPreparation.ipynb
Installation¶
We recommend installing with pip
into virtual environment:
python3 -m venv .venv # create virtual environment
. .venv/bin/activate # activate it
pip3 install 'etcetera[s3]' # install etcetera from PyPI with s3 backend
Installed Python package provides:
- a command-line utility
etc
- Python package
etcetera
for programmatic access to datasets
Quick Start¶
First, lets create a directory with a minimal dataset:
$ mkdir -p sample/data/train
$ touch sample/data/train/data00001.txt
Now, lets register the dataset with the etcetera
:
$ etc register sample
We can list available local datasets:
$ etc list
>> sample
Form your Python code, accessing dataset is easy:
import etcetera as etc
dataset = etc.dataset('sample')
dataset.partitions()
>> train
for fname in dataset['train'].iterdir():
print(fname)
>> sample/train/data00001.txt
dataset.meta
>> {}
Configure access to cloud storage¶
Configuration is stored in ~/.etc.toml
and should specify at least url
key:
url = "s3://my-datasets"
Now remote repository is set to s3://my-datasets
. To be able to pull and push you
need to set the authentication parameters. For example:
url = "s3://my-datasets"
aws_access_key_id = "AAasdsffDF12SDASD"
aws_secret_access_key = "fgT6Dfr8Bhfgt4fdr5asdffd7"
public = false
Note the public
value here. When set to true
, pushed datasets will be publicly-readable.
Programmatic Configuration¶
etcetera
api can be used to access remote repositories without ever creating a
configuration file. This is convenient if you run code on a disposable machines (like
cloud workers), avoiding extra provisioning steps. Here is an example:
import etcetera as etc
config = etc.Config('s3://my-datasets')
# following will pull dataset from the repository if it is not found locally
dataset = etc.dataset('sample', auto_pull=True, config=config)
Using Config
one can configure the location of the local datasets, repository
authentication parameters, and so on. For the detail, check the Reference section below.