Get one more story in your member preview when you sign up. It’s free.

Using DVC to create an efficient version control system for data projects

Basile Guerrapin
Jul 10 · 9 min read

Challenging our initial setup

1. Structure the project

2. Reproduce a previous state of the project

3. Keep on tracking files metrics

Our use case

Using DVC to track project’s data (and increase productivity)

pip install dvc
Dvc flow for a file model.pkl and its associated pointer model.pkl.dvc : the pointer is versioned using git while model.pkl is synced with a remote storage. (source:https://github.com/iterative/dvc)

Versioning data files

Data directory structure after tracking both dataset.csv and documents/. It created a pointer for each of the resources and a .gitignore file.

Define project steps as stages

dvc \
-d data/dataset.csv \
-d data/documents/ \
-o vat_detection/has_vat_amount/assets/model.pkl \
-M metrics/has_vat_amount.json \
-f train.dvc \
python train.py
deps: dependencies of the stage ; outs: outputs including metrics, if cached then untracked by git ; md5: checksum of stage
git add train.dvc

Bundle stages into a pipeline

Pipeline example for the project. Each blue box represents a stage ; extraction.dvc: pre-compute html version of documents ; split_dataset.dvc: split train and test data ; train.dvc: produce a learned model ; evaluate.dvc: assess performance on the test set.

Usage Limits

dvc repro evaluate.dvc
Example of Makefile for the VAT auto-detection project

An extra pinch of DVC features

git checkout old_state
dvc checkout

Packaging as a library

A word about our production system

git+https://{host}/vat_detection@{reference}#egg=vat_detection{version}# host is the name of the git remote server
# reference can be a git hash or a tag or a branch name
# version is the version of the package as defined in setup.py

Takeaways


Qonto ∙ Blog

Easy business banking