DVC - Data Version Control Cheatsheet
Setting up
First thing to do in a brand new directory is to initialise git
and dvc
.
git init
dvc init
Next, we create a data directory and then use dvc get
to get data from a data registry into our local machine. dvc get
is like a wrapper for wget or curl where it downloads data from dvc repository
mkdir data
dvc get https://github.com/iterative/dataset-registry \get-started/data.xml -o data/data.xml
Now that we have the file data/data.dvc
, we can add it to tracking
dvc add data/data.xml
As soon as we run this, dvc will instruct us to add the change to git. These two files are generated when we do dvc add
git add data/.gitignore data/data.xml.dvc
We will then commit these two files using git
git commit -m "add raw data"
If we take a look at data/data.xml.dvc
, we will see something like the following. This file contains the metadata required to track the data file and will go into your git repository.
outs:
- md5: a304afb96060aad90176268345e10355
path: data.xml
Add a remote storage, in this case I'm adding a S3 bucket and using the path "test"
dvc remote add -d storage s3://derekchia/test
Then we commit the configuration file containing the configuration for our remote storage
git commit .dvc/config -m "Configure remote storage"
Next, we can push the data file into our remote storage
dvc push
Removing data and pulling it back
We can now try to remove the data file and then pull it back again. We also need to remove .dvc/cache
as well since this is where our data files are actually stored. See https://dvc.org/doc/command-reference/cache for more information
rm -f data/data.xml
rm -rf .dvc/cache
Using dvc pull
, we then pull the data back from our remote storage
dvc pull
Making changes to your data and reverting to previous version
To mimic the change in data, we double the data size using the following command
cp data/data.xml /tmp/data.xml
cat /tmp/data.xml >> data/data.xml
When we change our data file, the .dvc
file also changes. This means that we need to track it with git before pushing the changed file to our remote storage
dvc add data/data.xml
git add data/data.xml.dvc
git commit -m "Dataset update"
dvc push
We can confirm that the updated file is pushed into the remote storage by verifying that our remote storage now has two folders - each representing the different version.
If we look at our git commit log, we will see that we have several commits.
$ git log --oneline
b3330e4 (HEAD -> master) Dataset update
b74143a Configure remote storage
b1ef2ae add raw data
To revert back to previous version of data/data.xml.dvc
commit, we run the checkout
command
$ git checkout HEAD^1 data/data.xml.dvc
Updated 1 path from 522ae3f
Next we run dvc checkout
for the right data file to appear. We can see that the data/data.xml
file has been modified
$ dvc checkout
M data/data.xml
To keep this version of data/data.xml.dvc
, we can do a git commit
. Since we already have a version of the dataset in dvc, we do not need to do another dvc add
.
git commit -m data/data.xml.dvc -m "Revert data update"