Conquaire Continuous quality control for research data to ensure reproducibility

Why we use Git (technical note no. 1)

One of the inspirations for this project came from the observation that GitHub had become popular not just among software developers, but also among other knowledge workers such as scientists.

GitHub, as the name suggests, is built around Git (although second-class support for SVN was added later). So of course we looked at Git first, but we avoided committing ourselves to Git in the project proposal because a fair evaluation of all options was to be part of the project.

It’s time to admit that we did not spend much time looking at alternatives. Git is the dominant versioning software today, and there is no foreseeable competitor. According to a survey by the popular question and answer website StackOverflow in 2015, of 16,694 participants who answered this question, 69.3% used Git, 36.9% used SVN, 12.2% used TFS, 7.9% used Mercurial, 4.2% used CVS, 3.3% used Perforce, 5.8% used some other versioning software, and 9.3% used no versioning software at all.

Other studies and trends point in the same direction.

Teaching researchers how to use a versioning software that is not widely used (such as Mercurial or Perforce) or is limited to one operating system (such as TFS) or is obsolete (such as CVS) was out of the question as we will not always be there to support them. Eventually, when they require help from other colleagues or their system administrator, Git will most likely be one of the versioning software they will know.

Of course, there are other criteria besides popularity that must be considered. We really like the distributed versioning systems. These can be used offline because they maintain the full history, including branches, locally. This is great for a project that has to keep an eye on long-term availability because lots of copies keep stuff safe, as the saying goes. As SVN is not a distributed versioning system, this alternative is ruled out.

It goes without saying that any software used for archiving should be open source and freely licensed (FOSS). At the very least, its storage format must be documented openly. Freely available source code is a very precise way of documenting a storage format.

The following table summarises the above:

software name popularity actively maintained distributed cross-platform FOSS
CVS low no no yes yes
Git high yes yes yes yes
Mercurial low yes yes yes yes
Perforce low yes no yes no
SVN medium yes no yes yes
TFS low yes no no no

Distributedness incurs some added usage complexity, however. Which brings us to possible roadblocks on the way to successful Git use. Two come to mind: (1) learnability/usability and (2) large files.

Regarding (1): Finding out how hard it is for non-technical users to learn to use Git will be one of the outcomes of this project. Our working hypothesis is that for versioning research data, it is sufficient to learn a small subset of Git, which should not be too challenging. It is too early to report our experiences at this stage.

Regarding (2): Large files have given us quite a headache. Git was originally not intended to be used with large files. The same is true for most versioning systems. They are intended for tracking changes that are caused by intellectual efforts: these rarely result in large files directly. Still, we want to include large files such as video recordings when documenting research projects.

How large is large? GitHub warns people when pushing a file larger than 50 MB, and does not accept files larger than 100 MB. A video recording will often be larger than that.

Fortunately, a free (MIT-licensed) and open-source extension to Git called Git Large File Storage (or Git LFS) can be used to alleviate this issue. It works around Git’s size limitations by uploading large files to a separate storage area while tracking only metadata about these large files inside Git.

Using Git on the command line can be demanding. In our experience, graphical user interfaces (GUIs) that promise a more intuitive interaction style with Git often do not live up to expectations. Instead, we recommend a web interface. We will write more about these in Conquaire technical note no. 2.