6 February 2023

How I Leveraged Git to Get Accurate File Dates on CI/CD

By Djim Loic

Hive is a courseware management tool I developed a couple weeks back with the intention of streamlining courseware production and publishing.

Background

Each lecture in Hive has created_at and last_updated date values. These values were initially hard-written in the lecture metadata by courseware maintainers but were removed from metadata and acquired through the node fs.stat API for easier maintenance.

The point of the stat method is to read filesystem metadata for specified files, including the mentioned date values. It worked™ until it didn’t.

The problem

Hive uses GitHub Actions (CI/CD) to redeploy output assets whenever lectures course content changes. This is where the problem showed up.

I had previously tested the fs.stat API locally with files that were old and then updated in a more recent date and it worked. But things are different in a CI/CD throwaway virtual machine world.

When a build is started on CI/CD the contents of a course’s repository are checked out using Git on a virtual machine, and that's exactly the problem.

Git does not preserve metadata

Git does not preserve filesystem metadata to minimize the repository storage footprint and keep things fast. It generates that metadata on checkout, and dates are set to the current timestamp. Created, last updated, and last access time are all set to the same timestamp at the moment of checkout, which is problematic for Hive.

A rough idea

I came up with the idea of using the git commit log as my metadata source of truth. The created_at date would be the timestamp of the first commit of a file, and the last_updated would be the last commit with changes in said file.

But this comes with its own set of challenges. Reading, formatting, and mapping a git log to a suitable representation for each lecture file could take an exponential amount of time the bigger the log gets.

This contradicts the point of Hive especially given that string operations are slow in JavaScript. Therefore executing the git log command by hand in a subprocesss and using its output wasn't really feasible.

The solution

I decided to use something native and discovered nodegit. nodegit is an npm package with native bindings for libgit2, a pure C implementation of core Git methods that leverages super fast native data structures and libraries. It powers solutions like GitLab, GitHub, and several others.

This allowed me to easily traverse the git log for each file with almost no processing overhead and provided an easy-to-maintain interface. It now works like a charm thanks to the awesome effort the nodegit and libgit2 team have put into these libraries.

You can find the fix in this revision.

I have another challenge I'm facing but that's a story for another blog.


Written by Nabil Tharwat

Nabil Tharwat is a software engineer and mentor who's super in love with all things accessibility and performance. He's host of The Weekly Noob podcast and his content has reached thousands of people around the world.

Learn more about Nabil.