Large-Scale Data Version Control for HPC and HTC with Git and DataLad

DataLad is a domain agnostic data management system based on the version control tools Git (git-scm.com) and git-annex (https://git-annex.branchable.com). Its core data structure, the DataLad Dataset, is a joint Git/git-annex repository that provides version control for data, code, and software containers. Unlike default Git this combination is suitable for large and binary files. In addition, DataLad offers computational reproducibility by capturing the outcome of process executions in a machine-actionable reproducibility record.

In high performance and high throughput computing, version control and reproducibility management conflict with efficient and highly concurrent processing. [3] developed a large-scale processing framework centered around DataLad, and prototyped it on different HPC systems. With [2], this work has been extended to a direct integration with the SLURM job scheduler and to avoid further inefficient behavior patterns which may emerge on parallel file systems.

This tutorial shall enable participants to understand the importance and difficulties of version control and reproducibility management in HPC and, in a hands-on fashion, introduce them to DataLad and the DataLad-SLURM extension to bring these valuable concepts to their own HPC systems and use cases.