Tuesday, December 7, 2010

Thou Shalt Not Lie: git rebase, amend, squash, and other lies

As I've used git more, and used more advanced features, my opinions about the merit and uses of certain features have changed. At work we're constantly re-evaluating our git workflow in light of the problems we encounter with the way we're doing things. I may yet re-evaluate these opinions, but rest assured they come not from some theoretical nit-pick, but from real experience.

Rewriting History is a Lie

Using git to rewrite history is a sin. It's called lying. Don't do it. With git there are several forms in which lying comes, but usually you know you're lying if you break the functionality of `git branch --contains some-branch`, `git blame`, `git bisect`, or if you have to use `git push -f origin some-branch`.

Squash merging is a lie

A squash merge (`git merge --squash some-branch`) takes all the commits from a topic branch, combines them into a single commit, and applies that commit. The history now looks as if you had a flash of brilliance, made a ton of changes, and did everything right the first time. It makes you look good, but it's a lie.

Since, you have not actually merged any of the work you did on the topic branch, but instead have merged another totally new commit, `git branch --contains some-branch` cannot tell you that you have merged your branch anywhere. When a QA person or your boss says, "Hey, is some-feature {merged into QA, deployed}" you have to resort to `git log` spelunking.

Also, anytime you are creating a new commit with the same changes as another commit, you are destroying `git blame`'s ability to tell you who to flog publicly. And as we all know, public floggings are the lifeblood of software development teams.

Finally, when you squash merge, the totally awesome rerere feature will not help you re-resolve conflicts. Instead you will have to re-resolve conflicts when you cherry pick the squash commit, or when you squash merge your topic branch to the UA branch or to master. Why? Because rerere depends on being able to detect that you are resolving conflicts between the same SHAs as before, but every time you squash merge it creates a totally new commit. It creates a totally new commit even if you squash merge, `git reset --hard HEAD^`, and then immediately squash merge again.

Rebasing is a lie

Rebasing using `git rebase foo` allows you to rebase your topic branch on foo, instead of whatever it was based on before. This makes it look like you were working from foo the whole time. However, each commit to your topic branch was birthed in a context and by a sequence of events that was unique to that time and that topic branch. You are yanking those commits out of their context and putting them into a totally new context.

Rebasing is effectively a retroactive merge. It is pretending that you "merged" foo into your topic branch 3 days ago, but you didn't. You merged it in today, and you are lying to everyone.

In fact, the commits from your topic branch don't exist anymore, because every single rebased commit gets remade into a totally new commit with a totally different SHA. Pick any one of those rebased commits: Does the commit message make sense anymore? Does the code compile? Do the tests pass?

The answer to these last two questions will tell you whether you can use the totally awesome bisect feature of git. This feature does a binary search through your history to help you discover exactly where some bug or problem was introduced into your code. If you cannot compile and run tests on every commit, then you cannot use `git bisect`, which is a shame.

Being able to compile and run tests for each commit is also useful for resolving conflicts. Say I'm trying to decide whether your change will work with my code, or my change with your code, or maybe some totally different code that "resolves" the conflict, how can I know "resolved" means resolved? The most basic thing I can do (though it's not foolproof) is compile the code and run the tests, but if your code doesn't even compile or run the tests by itself, then I can't intelligently resolve the conflict. Your lie has hurt me and the rest of the team.

Finally, if for some reason you wanted to cherry-pick a commit, but it doesn't compile or run the tests, then cherry-picking it into another branch would break that branch. Thanks for that! Your lie has now become destruction of property, which is a misdemeanor.

Amending a commit is a lie

Amending a commit (`git commit --amend`) to add changes, or to change the commit message is a lie. It breaks `git branch --contains some-branch`. It forces you to do a `git push -f origin some-branch`. It's a sin. See above arguments, enough said.

Selective, retroactive commits are lies

In this last form of lying git is not an accomplice, because you are lying to git. What happens here is that you have been working for a while (20 minutes, 30 minutes, an hour, whatever), and you decide to commit your work, but instead of committing all of your work as you have come to it naturally, you decide to break your work up into several small, "logical" commits. This makes you look good, but it's a lie.

You've changed both "foo.clj" and "bar.clj" and you think there is no dependency between them, so you commit "foo.clj" separately from "bar.clj". The problem is that you don't really know this unless you compile the code and run the tests. Since you don't know for sure that these selective commits can compile and run the tests, you've broken `git bisect`, undermined trust in conflict resolution, and break my stuff if I cherry-pick your commit.

You may think that your commits are "logical", but I would say if you are not committing your work the way that you come to it naturally, then it is not logical. Sure you started solving problem A and then discovered problem B, if they are related and you have to solve problem B in order to solve A, then commit the changes together. They are related so its logical to commit them at the same time.

If you want to solve them separately, then make a WIP commit or a stash, create another branch and solve B, then come back to your original branch and rebase it on B (gasp, see below for explanations of this inconsistency!).

Conclusion

There are several awesome features of git: blame, branch --contains, rerere, and bisect. These features are borked if you lie. If you're going to lie, you may as well use SVN, since these awesome features don't exist there anyway. But don't do that! Let's just all agree not to lie.

Epilogue

Some people may get the impression that I do not think you should ever use `git rebase`, `git commit --amend`, or `git merge --squash`, but that would not be true. These are powerful tools that may be necessary sometimes, but should be used sparingly and judiciously. Perhaps its OK to lie sometimes. If someone comes up to you and says, "How does this make me look?" You choose your words carefully.

One case for history rewriting is integration branches. At work we have integration branches for qa, ua, etc. We have a workflow where we move a single feature through the pipeline kanban style, so we have to be able to merge a feature to an integration branch, rebase it out, etc. In this case no one has an expectation that they should be able to see a clean history for these branches, and no one expects to base their work on one of them.

Other than integration branches, a good rule of thumb is that you should not rewrite history for things that are already push out into the world. This would limit these rewriting tools to uses locally to "fix" things up before pushing them. I still cannot whole-heartedly advise use of rewriting in this case, but if you need to, and the rewriting your doing is limited enough that you're sure the new commits will still compile and pass tests, then fine. Just do it carefully, and don't make it habitual. The End.

15 comments:

Mahmoud said...

Could you speak a little to what one _should_ do if these options are to be avoided?

Jlouis said...

The trick is to have rules on certain branches where the contract is you wont lie.

On other branches, personal ones for instance, you can lie to yourself to a certain extent. Blindly rebasing your own branches to "keep them up to date" is bounds for trouble though.

Douglas said...

Re: retroactive commits: you can use git stash to hide away bar while you prepare to commit foo. Much safer than committing blindly.

eat-a-git said...

Point 1:

Cherry-picking to produce commits is not lying. It is organization. Darcs lets you do this for each hunk it encounters and it is an incredibly useful feature which allows to make very concise patches rather than balls of mud.

Previously you bemoaned balls of mud patches from squashing, but to have work uncommitted and then to choose which files you're committing at one time is not lying.

Point 2:

Really most of this lying could go away if there was better annotation and organization, allowing you to abstract what happened while still allowing users to dig down into the history.

The article is fine otherwise.

paul said...

@Mahmoud IMO, what one should do instead of rebasing is just merge into your topic branch. You resolve all the conflicts at once, instead of having git repeatedly flog you with conflicts for each commit. And git can still easily tell you where commits came from and where they went.

@Jlouis I think this is kind what I was getting at with my talk about integration branches at the end.

@Douglas I agree, I think I did mention stashes as a way to set aside some changes to go make changes elsewhere.

@eat-a-git I don't think I ever said cherry-picking is a lie. Perhaps I implied it? Cherry-picking is useful, and having commits that compile and pass tests make cherry-picking much more pleasant.

I did not bemoan squashing because of their mud patch nature. I bemoaned them because they create new commits which destroys cool features of git. I bemoaned choosing to commit individual files out of a bunch of changes because it undermines the ability to compile and test commits, which is useful (essential?) in `git bisect` and in resolving conflicts.

And I guess I'm glad to know I have your seal-of-approval. Thanks, dude!

sigi said...

Some remarks on what eat-a-git said above:

Hunk based commits are supported by Git. I agree that they are useful for organizing your history. However, one must be disciplined about stashing before every commit ("git stash --keep-index") and running the test suite in order to make sure the partial commits are legit.

Otherwise one does break "git bisect", as the article correctly points out.

Fully agree on your second point.

Very nice article, worth thinking about.

sigi said...

"I bemoaned choosing to commit individual files out of a bunch of changes because it undermines the ability to compile and test commits, which is useful (essential?)"

It does not undermine anything: you can do a partial commit (to the index), stash the rest of the changes while keeping the index.

Then run the test suite to make sure that your (staged) commit is OK. Commit and repeat.

The same goes when splitting commits after the fact ("lying"). One can always make sure that the test suite completes after every new commit made.

Of course this is impractical with a long running test suite, but then in this case so is committing frequently.

paul said...

>It does not undermine anything: you can do a partial commit (to the index), stash the rest of the changes while keeping the index.

>Then run the test suite to make sure that your (staged) commit is OK. Commit and repeat.

@sigi uhhhh...I think we agree. The behavior that I said undermined `git bisect` and the ability to verify conflict resolution by compiling and running tests is breaking things into commits *without* running the test suite for each commit.

If you are running tests on (or at least trying to compile) each commit, then cherry-pick and bisect are much more useful, and you can trust your conflict resolution a little more than not.

chnicola said...

It is worth really stressing (which I missed in your entire post) that lies only matter if you are sharing a repository. If you are working locally these are useful and sensible tools to use and work with.

paul said...

@chnicola It certainly seems uncontroversial to say "don't rewrite public history", since even my critics agree with that. I did say this in the Epilogue.

However, there are things which even when done in private can make things harder for you, and others once you make it public.

For instance, rebasing a topic branch on a newer version of master. If there are lots of changes and conflicts, then it is certainly easier on yourself to just merge master in, instead of rebasing. That way you save yourself from the conflict flogging that git gives you for each commit that you are rebasing. Technically you could rebase because its not public, but the argument for merging the new master in is entirely practical and the argument for rebasing on the new master is entirely cosmetic.

Second, the last form of lying (retroactive commits), if done without verifying that those commits compile and/or pass tests can cause problems with `git bisect` and conflict resolution when those commits are made public later.

Ra'Shaun said...

This sounds to me as more of a Rant than anything. Take it from Linus Torvalds himself. he told me to tell you that you are doing it wrong. http://lwn.net/Articles/328438/

I think when people rant about rebasing being bad it's because they've never been in a situation of more than 2-3 developers where erroneous merge commits and endless feature branches pollute the repo. I for one (along with Linus) am for the single commit per feature. It's EXTREMELY EASY to revert a single commit. A feature doesn't do what is expected in production? Great yank it out (git revert) keep moving along. Try unwinding a merge. Things get ugly.

My $0.02

Snuggs

Innovative-Studios.com

paul said...

Thanks for sharing your thoughts. I appreciate you sharing Linus' opinion on the subject, but his opinion is not binding.

I work daily on a project with 9 other developers where we have lots of topic branches and merges, and they don't "pollute" the repo, or if the do its an issue of personal taste. Preserving the history of particular lines of code is important.

There are different opinions on the topic of one large commit vs. lots of small commits. Personally I find it easier to digest commits when the are complete ideas, instead of half-thoughts, so I tend to fall on the side of largeish feature commits (as you do).

By the way, it is totally unnecessary to have a single commit per feature if you are doing it only to have an easy way to revert that work. You can do `git rebase -i -p -s subtree SHA` and it will allow you to remove a no fast-forward merge all at once, in other words, you don't have to remove every commit individually.

This is one of the benefits of doing merges vs rebasing. If you rebase your work into mine and we've moved on to other changes *then* it is impossibly messy to revert your work later. However, if you do a no fast-forward merge, I can easily come back later and remove your whole branch from the history in one fell swoop, if you're into that sort of thing.

ashirazi said...

Excellent post Paul.

I have had misgivings about rebasing ever since our team began losing commits. It's a way to easily get in trouble, especially if you resolve conflicts on the fly. I had outlined some of these problems in this post:
http://blog.codesherpas.com/on_the_path/2010/10/git-rebase-in-anger.html

If I were to disagree with anything, I would say that amending a commit that has not been pushed is fine. It's the do-over for a commit that was a mistake (mostly if I ever forget to include some important details in a commit message).

Git is currently my preferred version control, but the ability to change repository history worries me. That's why I believe the next great version control utility will be something like git, but without the dangers of rewriting the repository history.

Jason Rogers said...

Thanks for the post. I know it's old, but I still refer back to it from time to time.

John said...

A counter to this article is izs's git-rebase.

Rebase is a tool that can be used to arbitrarily edit the commit history. If that sounds universally scary or bad, I’d argue that your understanding of “edit” and “history” are perhaps a bit limited.

tl;dr: Embrace the difference between draft & published history. Don't edit published history.