Git : Under the hood

Devon Wijesinghe
8 min readSep 30, 2018

Hey everyone😛, are you guys excited to learn how GIT works internally? If the answers is yes, you are in the right place. In this blog article, I will explain in simple terms how Git actually functions.

Section 1: Quick Refresher

Before I dive right into the internals of Git, in this section lets have a quick refresher on version control and git basics so that even a beginner can follow along. (Feel free to skip this section if you are confident on Git basics)

What is a Version Control System (VCS)

In general terms, ‘Version Control system’ is a software tool that allows you to keep a track of changes you made to your project files and give you the ability to retrieve the changes you made to those files over time.

Why do we need VCS

Does this screenshot below remind you of something? 😆

Hahaha I know 😏, we have all been there! At some point in our developer journey, we have created copies of the project files before we made changes, either because we were scared that we will mess up everything or we just wanted to keep a specific version of the project as a backup. But as you guys must have realized, it gets really confusing which project is which. If you are smart enough, you might have used time stamps, but still, his is a huge pain. But no worries, Version Control is here to save the day 🙌

This is not the only use case of VCS, it also makes the life of software development teams much easier by letting them manage changes and contribute to the source code together.

Types of VCS

There are 3 main types of Version Control Systems:

  1. Local VCS
  • Runs fully locally
  • Keeps patch sets and will be able to recreate what any file looks like at a given point in time using those patch sets (This type of VCS cannot be used by collaborative teams)

2. Centralized VCS

  • A single centralized server contains all the version files
  • Users will be able to commit/update a particular version of the file(s) from the server and use locally (will not have all versions locally)
Source : https://code.snipcademy.com/tutorials/git/introduction/how-version-control-works

3. Distributed VCS

  • Contains a ‘main server repository’ which contains all the versions of the source code
  • Users can pull the main server repository to their local machine and work with that repository without internet connectivity (will have all versions locally). And then, the user can push the changes/additions back to the main server repository.
Source : https://code.snipcademy.com/tutorials/git/introduction/how-version-control-works

What is Git?

Git is a distributed version control systems that use snapshots to keep track of the changes in the source code. Most of the other VCSs keep track of the differences in the source code, while Git handles it in a different manner. To explain in simple terms, at each stage you save(commit) your source code to the repository, Git will take a kind of a picture (snapshot) of what the source code looks likes and assign a unique (SHA-1 hash) value to each of these pictures (snapshots) to identify them.

Any Git project will have four main stages (including the remote repository). To understand more about these stages and what the git workflow is like, please watch the following video :

Before going any further, I recommend you to try out setting up Git and testing some basic commands and get used to the workflow if you already havent !

Learning resource:

Git Cheatsheet :

You can use the following cheat sheet to get familiar with the Git commands

Section 2: Git Internals

Okay!!! Now comes the interesting part 😁. At this point, I assume that you guys have a basic understanding of what Git is and knows the basic Git workflow.

If you navigate into any project, which is initialized by Git using git initand see the hidden directories in it, you will come across a directory named .git , this is where all the magic happens 😉

Let's have a look what is inside this .git directory

The screenshot shows the contents inside the .git directory and we will have a closer look into the highlighted files/directories because they are the most important to learn to understand how Git works internally. (Note: The ‘branches’ directory is not used by newer versions of Git, it is deprecated)

objects

  • Act as a database and store all the content (blobs and trees which we will look into) in key-value pairs

refs

  • Stores pointers into commit objects

HEAD

  • Contains a reference that points to the branch you currently are currently using /checked out
Inside HEAD file

config

  • Contains project specific configurations
Inside the config file

hooks

  • Contains your client or server-side hook scripts

What are Plumbing commands?

If you take a look at commands like git add,git commit or almost all of the other Git commands you use (which are called or “Porcelain” commands), they are made up by chaining up low-level core commands. Theses core commands are a what we call “plumbing commands”

Let's break it down!

I will explain in simple terms how changes in a file are tracked in Git internally! The below diagram will be used to explain the steps 1, 2 and 3 clearly.

Step 1 (Setup):

After initializing a Git project, when a file is made or put into the project directory, Git creates the working tree.

At this point, if you run a git status command it would have the following output

Step 2 (Adding to staging):

When the file is staged, Git creates a blob which contains the content of the file in binary format and also gets assigned with a 40 character Hash value (Sha 1 hash) which is used to uniquely identify that blob. This Blob will be then saved to the objects database in the Git repo.

git update-index --add index.js plumbing command can be used to stage the file

If you take a look at the objects directory now, you will see the following 40 char hash. The first two chars of the hash are used to create a subfolder to keep the objects more organised

The above hash shown is a reference to a blob which contains the binary data of the file we added. We can confirm this by running the following commands to see the type and the contents.

git cat-file -t <HASH> and git cat-file blob <HASH>

Note: ‘-t’ flag is to get the type

Step 3(Committing):

To prepare the blob to be committed, Git will create a tree. This is because there can be multiple blobs in the staging area at a given point in time, so a tree will contain pointers to those blobs a will be a snapshot of all the changes/files you want to commit.

Before we commit the files, we need to verify the user's identity for git to track who made the commit. We can do this by using the following commands which will update the config file.

$ git config user.name devon 
$ git config user.email wdevon99@gmail.com

You can verify if it got set correctly by having the look at the config file. It should look something like this.

The following plumbing commands can be used to commit the file, instead of git commit -m ‘Message’

$ git write-tree

The above command creates a tree object in the objects directory

The to add the commit message and finalize the commit, we will run the below command (The hash of the tree object can be extracted from the objects directory)

$ echo "my very first commit using plumbing commands" | git commit-tree <HASH>

Finally, to check if the commit is successful, run

$ git cat-file -p <HASH>

Conclusion

Hopefully, by now you have an idea of how the git handles your changes internally and how the commands you use in your day to day work life works under the hood.

To dive deep into git internal further, have a look at the following links :

Thank you for reading :)

--

--

Devon Wijesinghe

Senior Software Engineer | Full Stack Javascript Developer | Certified Scrum Master™ | AWS Certified Solutions Architect Associate