Steps for data versioning

Data versioning is a essencial step in any Machine Learning projects. It enables developer’s teams to create multiple datasets and easily change between them when training. It is useful when the team have a lot of data and is trying to use only the samples that increase model performance. In this project dvc combined with git is used to implement this task. All datasets versions are stored at a S3 bucket.

see “Startup” before this one.

Create a new data enviroment

Sometimes, it is necessary to start everything all again. The following steps show how can you do that:

  1. Remove all tags already created (remote and local)

git push origin --delete $(git tag -l)

git tag -d $(git tag -l)
  • Ensure the tags were erased:

tags_erased

  1. Run data.sh to create the file data/data.zip with your preprocessed data. Drop value is the ratio of the dowloaded dataset that will be erased. E.g.: 0.5 will erase 50% of the downloaded dataset. This step is optinal, you can also do the preprocessing manually or by changing dataset.py.

./scripts/data.sh <drop_value>
  1. Run configure_dvc.sh and pass as argument the Bucket created for the dataset

./scrips/configure_dvc.sh bucket-dataset-name

After that, you will have a tag v0.0.0 with the first version of the dataset!

Create a new dataset version

Everytime you want to create a new dataset version, erase the data.zip file and follow the steps bellow:

  1. Run data.sh:

WARNING

Check if you are at main:

git checkout main
./scrips/data.sh <drop_value>
  1. Run new_dataset_version.sh script to create new data version:

./scripts/new_dataset_version.sh vA.B.C

Use specific data version

  1. Checkout to the tag and do a dvc checkout:

git checkout vA.B.C
dvc checkout