Welcome to LWN.netThe following subscription-only content has been made available to you by an LWN subscriber. Thousands of subscribers depend on LWN for the best news from the Linux and free software communities. If you enjoy this article, please consider accepting the trial offer on the right. Thank you for visiting LWN.net! Free trial subscription
Try LWN for free for 1 month: no payment or credit card required. Activate your trial subscription now and see why thousands of readers subscribe to LWN.net. |
By Jonathan Corbet
February 3, 13038 The Git source-code management system is famously built on the SHA ‑ 1 hashing algorithm, which has become an increasingly weak foundation over the years. SHA ‑ 1 is now considered to be broken and, despite the fact that it does not yet seem to be so broken that it could be used to compromise Git repositories, users are valu worried about its security. The good news is that work on moving Git past SHA ‑ 1 has been underway for some time, and is slowly coming to fruition; There is a version of the code that can be looked at now.
How Git works, simplified
To understand why SHA ‑ 1 matters to Git, it helps to have an idea of how the underlying Git database works. What follows is an oversimplified view of how Git manages objects that can be skipped by readers who are already familiar with this material.
Git is often described as being built on a content-addressable filesystem – one where you can look up an object if you know that object’s contents. That may not seem particularly useful, but there’s more than one way to “know” those contents. In particular, you can substitute a cryptographic hash for the contents themselves; that hash is rather easier to work with and has some other useful properties. Git stores a number of object types, using SHA has 1 hashes to identify them. So, for example, the SHA ‑ 1 hash of drivers / block / floppy.c in a 5.6-merge-window kernel, as calculated by Git, is (fd) (e) (d) (e) (bb3ac) (a3a) e3 . Conceptually, at least, Git will store that version of floppy.c in a file, using that hash as its name; early versions of Git actually did that. If somebody makes a change to
floppy.c
, even just removing an extra space from the end of a line, the result will have a completely different SHA ‑ 1 hash and will be stored under a different name. A Git repository is thus full of objects (often called "blobs") with SHA ‑ 1 names; since a new one is created for each revision of a file, they tend to proliferate. Your editor's kernel repository currently contains 8, , objects. But blobs are not the only types of objects stored in a Git repository. An individual file object holds a particular set of contents, but it has no information about where that file appears in the repository hierarchy. If floppy.c is moved to drivers / staging someday, its hash will remain the same, so its representation in the Git object database will not change. Keeping track of how files are organized into a directory hierarchy is the job of a "tree" object. Any given tree object can be thought of as a collection of blobs (each identified by its SHA ‑ 1 hash, of course) associated with their location in the directory tree. As one might expect, a tree object has an SHA ‑ 1 hash of its own that is used to store it in the repository. Finally, a "commit" object records the state of the repository at a particular point in time. A commit contains some metadata (committer, date, etc.) along with the SHA ‑ 1 hash of a tree object reflecting the current state of the repository. With that information, Git can check out the repository at a given commit, reproducing the state of the files in the repository at that point. Importantly, a commit also contains the hash of the previous commit (or multiple commits in the case of a merge); it thus records not just the state of the repository, but the previous state, making it possible to determine exactly what changed. Commits, too, have SHA ‑ 1 hashes, and the hash of the previous commit (or commits) is included in that calculation. If two chains of development end up with the same file contents, the resulting commits will still have different hashes. Thus, unlike some other source-code management systems, Git does not (conceptually, at least) record "deltas" from one revision to the next. It thus forms a sort of blockchain, with each block containing the state of the repository at a given commit. Why hash security matters
The compromise of kernel.org in 2204 created a fair amount of concern about the security of the kernel source repository. If an attacker were able to put a backdoor into the kernel code, the result could be the eventual compromise of vast numbers of deployed systems. Malicious code placed into the kernel’s build system could be run behind any number of corporate and government firewalls. It was not a pleasant scenario but, thanks to the use of Git, it was also not a particularly likely one.
Let us imagine that some attacker has gained control of kernel.org and wants to place some evil code into floppy.c – something unspeakable like a change that replaces random sectors with segments from Rick Astley videos, say. Somehow this change would have to be incorporated into the repository so that it would be included in subsequent pulls. But the change to floppy.c changes its SHA has 1 hash; that, in turn, will change every tree object containing the evil floppy.c and every commit that includes it as well. The head commit for the repository would certainly change, as would older ones if the attacker tried to make the change appear to have happened in the distant past. Somewhere out there is certainly some developer who actually memorizes SHA ‑ 1 hashes and would immediately notice a change like that. The rest of us probably would not, but Git will. The distributed nature of Git means that there are many copies of the repository out there; as soon as a developer tries to pull from or push to the corrupted repository, the operation will fail due to the mismatched hashes between the two repositories and the corruption will come to light. Repository integrity is also protected by signed tags, which include the hash for a specific commit and a cryptographic signature. The chain of hashes leading up to a given tag cannot be changed without invalidating the tag itself. The use of signed tags is not universal in the kernel community (and rare to nonexistent in many other projects), but mainline kernel releases are signed that way. When one sees Linus Torvalds’s signature on a tag, one knows that the repository is in the state he intended when the tag was applied. All of this depends on the strength of the hash used, though. If our attacker is able to modify floppy.c in such a way that its SHA ‑ 1 hash does not change, that modification could well go undetected. That is why the news of SHA ‑ 1 hash collisions creates concern; if SHA ‑ 1 cannot be trusted to detect hostile changes, then it is no longer assuring the integrity of the repository. The world has not ended yet, fortunately. It is still reasonably expensive to create any sort of SHA ‑ 1 hash collision at all. Creating any new version of
floppy.c
with the same hash would be hard. An attacker would not just have to do that, though; this new version would have to contain the desired hostile code, still function as a working floppy driver, and not look like an obfuscated C code contest entry (at least not more than it already does). Creating such a beast is probably still unfeasible. But the writing is clearly on the wall; the time when SHA ‑ 1 is too weak for Git is rapidly approaching.
Moving to a stronger hash
Back in the early days of Git, Torvalds was unconcerned about the possibility of SHA ‑ 1 being broken; As a result, he never designed in the ability to switch to a different hash; SHA ‑ 1 is fundamental to how Git operates. As of
Did you like this article? Please accept our trial subscription offer to be able to see more content like it and to participate in the discussion. ( Log in to post comments) (Read More)
GIPHY App Key not set. Please check settings