lxhome/orgs/essays/git-etiquette.org

#+SETUPFILE: ../../config.org
#+TAGS: Git(g)
#+CATEGORY: Engineering
#+DATE: 2022-02-19
#+TITLE: Git Etiquette
#+DESC: A rather long essay on how to use Git in a civilized way

* Rational                                                              :Git:
The ability of making tools and using them is one of the many things that makes us special
and the skills to use tools properly is what makes some of us
elites.

As software engineers, interacting with [[https://git-scm.com/][Git]] is an important part of our daily life. These days
*Git* is the de facto standard of version control systems and almost everyone uses it. *Git* is
of one those special tools that every engineer has to be familiar with, since it's widely spread
in the tech world. It will be a big surprise if you find a new project or company that is not using
*Git*.

As a free software contributor, I spend all my professional career in FOSS communities and projects
and proper use of *Git* seems so natural to me. But to my surprise, every now and the I witness
how some "commercial engineers" (air quote) uses git and it makes me sad that in a commercial space
which you get paid to build technology people do so poorly. After a lot of these type of incidents
I've decided to put together a document to help improving my team's *Git* workflows. While there
are plenty of reading materials up on the internet dedicated to *Git best practices*, I thought it
might be useful to publish that document publicly to help others as well. For the lack of a better
word I've chose the title *"Git Etiquette"*. Following Git etiquette help teams to get more out of
their git workflows and avoid frustration.

I'll try to keep it short and refer to essays from others who explained it much better that me. I
borrowed some of the words from the others and I included most of them in the resources section to
best of my ability, but since I wrote the original document so long ago and that suppose to be a
private doc for few people some of the resources might have been lost.

Also, I'll add more items to the least overtime.

* Git Commits
Commits are the building blocks of version controlling via Git. It's obvious that improving the commit
quality will result in improvement in the overall quality of the repository.

** Single purpose commits
Oftentimes engineers working on something get sidetracked into doing too many things when working on
one particular thing like when you are trying to fix one particular bug and you spot another one,
and you can’t resist the urge to fix that as well. And another one. Soon, it snowballs and you end
up with so many changes all going together in one commit.

This is problematic, and it is better to keep commits as small and focused as possible for many
reasons, including:

- It makes it easier for other people in the team looking at your change, making code reviews
  more efficient.
- If the commit has to be rolled back completely, it’s far easier to do so.
- It's straightforward to track these changes with your ticketing system.
- It helps you mentally parse changes you’ve made using git log.

A commit should be a wrapper for related changes. For example, fixing two different bugs should
produce two separate commits. Small commits make it easier for other team members to understand
the changes and roll them back if something went wrong. With tools like the staging area and the
ability to stage only parts of a file, Git makes it easy to create very granular commits.

** Commit Messages
On many occasions we need to inspect the *Git* history to find something. A commit, specific changes,
find clues about errors or even to find the engineer who made a certain change. I have bittersweet
experience when it comes to dealing with commit messages in the Git history of projects. Let me
demonstrate with real examples.

I saw it many times in commercial teams that engineers don't bother with writing a proper and useful
Git commit message. For some reason that is beyond my understanding, they think having *"I hate my
life!"* as commit message for a commit with ~1200 lines of change in a repository with more than
~300k commits (at the time) that is used by about 200 engineers is a cool thing to do. I came across
this commit message long ago when I was trying to figure out why a service malfunctions. This commit
message wasn't helpful at all and I had to read through the diff to figure out whether or not that
commit is the root of the issue. I can tell so many stories like this one but for the sake of this
essay one would be enough.

But let's have a look at real Git history of a repository that I don't like at all (using =--one-line=
flag):

#+BEGIN_SRC
    2683332a333a Update tests
    3315442a4983e Remove icon from manage header
    aa234e8aa83f8 test fix
    29c35ba3adcee Class migration
    fbde3a265ab3f Migrate header styles
    01eaac4b4cc13 tests
    8d004a970eef7 fix tests
    d2890dfdc360 add tests
    91c2aa31720f2 add test for notice variable
    135a2df25e86a fix tests
    3aa4101546a93 refactor
    0eaae58006f51 add test for global variable
    3ae7ee7297104 remove unnecessary check
#+END_SRC

These commits are taken from a repository with more than 400k commits and many active contributors
in a commercial space (Don't worry, the SHAs are not the original SHAs).

In the other hand, few weeks ago I pulled from the [[https://llvm.org/][LLVM]] repository and built in again (I do this weekly)
and tried to build the [[https:://serene-lang.org][Serene compiler]] (a programming language that I'm working on) against that.
But the compilation failed with an error like "Identifier is unknown". I grepped the Git logs of LLVM
repository and saw a commit and all of a sudden smiled and praised the author in my mind. Here is
the commit message (I removed the commit details):

#+BEGIN_SRC
Date:   Wed Jan 12 11:20:18 2022 -0800

    [mlir] Finish removing Identifier from the C++ API

    There have been a few API pieces remaining to allow for a smooth transition for
    downstream users, but these have been up for a few months now. After this only
    the C API will have reference to "Identifier", but those will be reworked in a followup.

    The main updates are:
    * Identifier -> StringAttr
    * StringAttr::get requires the context as the first parameter
      - i.e. `Identifier::get("...", ctx)` -> `StringAttr::get(ctx, "...")`
#+END_SRC

It was so obvious how to fix my issue by looking at this fantastic commit message.

Which one would you rather read? Which one helps you understand what happened in any specific commit ?

According to [[https://cbea.ms/git-commit/][Chris Beams]], A well-crafted Git commit message is the best way to communicate the context
about a change to other engineers (and our future selves). A diff will tell you what changed,
but only the commit message can properly tell you why.

Peter Hutterer [[https://who-t.blogspot.com/2009/12/on-commit-messages.html][makes this point]] well:

#+begin_quote
Re-establishing the context of a piece of code is wasteful. We can’t avoid it completely, so our
efforts should go to [[https://www.osnews.com/story/19266/wtfsm/][reducing it]] [as much] as possible. Commit messages can do exactly that and
as a result, a commit message shows whether a developer is a good collaborator.
#+end_quote

If you ever used =git log= or any other Git sub command that requires interactions with commits
(which many of them do), you'll understand what a valuable asset, a well written commit message
is.

The Git history is just bunch of commits in a certain order. It's up to the engineers to make the
most of it. With the growth of any project, maintenance becomes an issue and the messier your history
is the harder it is to maintain the project. Also it would be painful for other to be involved in the
project too.


There are seven easy rules that you can follow to rock your commit messages:

1. Separate subject from body with a blank line
2. Limit the subject line to 50 characters
3. Capitalize the subject line
4. Do not end the subject line with a period
5. Use the imperative mood in the subject line
6. Wrap the body at 72 characters
7. Use the body to explain what and why vs. how

I highly recommend to read the [[https://cbea.ms/git-commit/][How to Write a Git Commit Message]] post from Chris Beams that
explain these rules in depth.

** Commit early, commit often
Git works best, and works in your favor, when you commit your work often. Instead of waiting to
make the commit perfect, it is better to work in small chunks and keep committing your work. Personally,
I have found it much easier to have smaller commits that group together related changes. This way
you can easily revert commits that you don't like and cherry pick those that you want and avoid dealing
with un-necessary changes that comes in a commit.

If you are working on a feature branch that could take some time to finish, it helps you keep
your code updated with the latest changes so that you avoid conflicts.

Also, Git only takes full responsibility for your data when you commit. It helps you from losing work,
reverting changes, and helping trace what you did when using =git-reflog=.


** Don’t commit generated files
This one is fairly obvious, but many times I had to look at the history to figure out who has committed
an auto generated file or a massive file into the repository.

Generally, only those files should be committed that have taken manual effort to create, and cannot
be re-generated. Files can be re-generated at will, can be generated any time, and normally don’t
work with line-based diff tracking as well. It is useful to add a =.gitignore= file in your
repository’s root to automatically tell Git which files or paths you don’t want to track.

* Don’t alter published history
Once a commit has been merged to an upstream default branch (and is visible to others), it is strongly
advised not to alter history. Git and other VCS tools to rewrite branch history, but doing so is
problematic for everyone who has access to the repository. While =git-rebase= is a useful feature,
it should only be used on branches that only you are working with (Private branches).

One of the key aspects of Git is its distributed nature. Meaning that everyone can have their own
repositories and push their commits to their own fork and send pull requests to others to pull from
their repositories. This process is centralized these days via Git hosting services (While the
provide the forking functionality, that is not a common thing to do in a commercial and closed source
project) specially in the commercial space that causes engineers to share feature branches. It
happens to me many time in different roles that some one force pushed to a public (within the org)
branch and screwed everyone's workflow. For your the sake of your peace of mind and others sanity,
*DO NOT CHANGE THE PUBLIC HISTORY*.

It's kind of a joke, but if you are a public force pusher, I'll end my friendship with you.


Having said that, there would inevitably be occasions where there’s a need for a history rewrite
on a published branch. Extreme care must be practiced while doing so.

* Merge VS Rebase

The golden rule is to never rebase on public branches and always merge to public branches.
When it comes to merge vs rebase, there are two simple rules.
*Note:* It's better to use squash and merge instead of normal merge because in projects with
many contributors, it is easier to maintain a Git history on the main branch that contains
one commit per feature.

** Don’t change other people’s history
You must never ever destroy other peoples history. You must not rebase commits other people did.
Basically, if it is not your branch you can't rebase it. Notice that this really is about other
people's history, not about other people's code. If you want to pull down some changes from other
developers into your branch, it’s fine to rebase, because it’s their code but it’s your history.
So you can go wild on the rebase thing on it, even though you didn't write the code, as long as
the commit itself is your private one.

Minor clarification: once you've published your history in a public branch, other people may be
using it, and so now it's clearly not your private history anymore. So the minor clarification
really is that it's not just about *your commit*, it's also about it being private to your tree,
and you haven't pushed it out and announced it yet.

** Don’t expose your unfinished work to public
Keep your own history readable. Some people do this by just working things out in their head first,
and not making mistakes. but that's very rare, and for the rest of us, we use =git rebase= etc
while we work on our problems. So =git rebase= is not wrong. But it's right only if it's
*YOUR VERY OWN PRIVATE* git tree.

If you're still in the =git rebase= phase, you don't push it out. If it's not ready, you don't
tell the public at large about it. Don’t push your changes to a shared feature branch or the main
branch.

Don’t merge upstream changes at random points. If you’re working on a shared feature branch,
don’t pull down the changes when they are not verified and finalized. It will put your history
in an inconsistent state because your history will contain some changes which might get
removed upstream and later on when you push your changes you’re going to put back those removed
changes again.

* Conclusion
This essay was just a superficial try to explain some of the etiquette of Git that we need to
follow when we're collaborating on a project with others. At the end of the day we are looking
to make it easier for ourselves to develop software and following certain rules will help us to
get there faster and makes the process more pleasant.

* References and Resources
- https://www.kernel.org/doc/html/v4.10/process/submitting-patches.html
  The kernel community is one of the biggest communities of paid and volunteer contributors
  that are using Git intensively with a really high traffic. In order to manage the development
  process and keep the productivity that has really strict guidelines which some of them can
  be useful for us.


- https://chris.beams.io/posts/git-commit/
  Chris Beams made a research about the best practices around the commit messages
  By reviewing many projects, his article is one the most referenced articles in this field.


- https://tbaggery.com/2008/04/19/a-note-about-git-commit-messages.html
  Another short but widely referenced article on best practices around Git commit messages


- https://yarchive.net/comp/linux/commit_messages.html
  Who can be better to follow on Git best practices rather than Linus Torvalds himself?


- https://lwn.net/Articles/328438/
  A famous email from Linus Torvalds describing how to maintain a git tree from merge vs
  rebase perspective


- https://www.atlassian.com/git/tutorials/merging-vs-rebasing
  Atlasians guidelines on merge vs rebase