DVC - Data Version Control Cheatsheet

Setting up

First thing to do in a brand new directory is to initialise git and dvc.

git init
dvc init

Next, we create a data directory and then use dvc get to get data from a data registry into our local machine. dvc get is like a wrapper for wget or curl where it downloads data from dvc repository

mkdir data
dvc get https://github.com/iterative/dataset-registry \get-started/data.xml -o data/data.xml

Now that we have the file data/data.dvc, we can add it to tracking

dvc add data/data.xml

As soon as we run this, dvc will instruct us to add the change to git. These two files are generated when we do dvc add

git add data/.gitignore data/data.xml.dvc

We will then commit these two files using git

git commit -m "add raw data"

If we take a look at data/data.xml.dvc, we will see something like the following. This file contains the metadata required to track the data file and will go into your git repository.

outs:
- md5: a304afb96060aad90176268345e10355
  path: data.xml

Add a remote storage, in this case I'm adding a S3 bucket and using the path "test"

dvc remote add -d storage s3://derekchia/test

Then we commit the configuration file containing the configuration for our remote storage

git commit .dvc/config -m "Configure remote storage"

Next, we can push the data file into our remote storage

dvc push

Removing data and pulling it back

We can now try to remove the data file and then pull it back again. We also need to remove .dvc/cache as well since this is where our data files are actually stored. See https://dvc.org/doc/command-reference/cache for more information

rm -f data/data.xml
rm -rf .dvc/cache

Using dvc pull , we then pull the data back from our remote storage

dvc pull

Making changes to your data and reverting to previous version

To mimic the change in data, we double the data size using the following command

cp data/data.xml /tmp/data.xml
cat /tmp/data.xml >> data/data.xml

When we change our data file, the .dvc file also changes. This means that we need to track it with git before pushing the changed file to our remote storage

dvc add data/data.xml
git add data/data.xml.dvc
git commit -m "Dataset update"
dvc push

We can confirm that the updated file is pushed into the remote storage by verifying that our remote storage now has two folders - each representing the different version.

If we look at our git commit log, we will see that we have several commits.

$ git log --oneline
b3330e4 (HEAD -> master) Dataset update
b74143a Configure remote storage
b1ef2ae add raw data

To revert back to previous version of data/data.xml.dvc commit, we run the checkout command

$ git checkout HEAD^1 data/data.xml.dvc
Updated 1 path from 522ae3f

Next we run dvc checkout  for the right data file to appear. We can see that the data/data.xml file has been modified

$ dvc checkout
M       data/data.xml

To keep this version of data/data.xml.dvc, we can do a git commit. Since we already have a version of the dataset in dvc, we do not need to do another dvc add.

git commit -m data/data.xml.dvc -m "Revert data update"