On mistakes, and the tools needed to learn from your mistakes
The excitement around releasing something was in the air that morning, and by the time the code was rolled out, everything seemed to be running smoothly and according to plan. This was also expected, as we have a staging site and a range of automated tests. And with that confidence, one would think a bug never makes it all the way to production, right? Wrong, unfortunately. Many of the solutions we build are of a certain complexity. And in these solutions, some parts might be very customized, or have a range of edge cases our tests do not cover. This morning it did not take long before we received reports of a bug that prevented a specific part of the site from working as it should.
One reaction to this could be to find the bug, assume who wrote that specific part was responsible for the bug, complain publicly in Slack about this person, and hotfix it on production.
Another reaction would be to analyze how the bug was introduced, fix the bug while carefully making sure the changes that introduced the bug is kept, and write some tests that make sure we do not roll out a release that breaks this part of the site again.
Let’s try the latter approach!
Analyze how the bug was introduced
We use git on all of our projects, which has a version history of all changes introduced. This means we can pinpoint what went wrong, and where. To find out when this bug was introduced, we use git bisect, which is an integrated tool in git. I first learned of this tool in a blog post from Webchick more than 10 (!) years ago, and have since used that blog post several times as a cheat sheet. So please don’t take down that site (or article), but just in case, here is a brief summary on the steps needed:
Start with finding a commit in the history where the functionality was working. In our case I found one at the SHA 0c6ec6e2330cf7fb89f1aee7bb059edf764fd695.
Then take a note of where it's not working. In our case, the tip of the production branch, which was the SHA a0c161c0067442fa028d80f19f3a5642c653b820. I could now start bisecting:
git bisect start git bisect good 0c6ec6e2330cf7fb89f1aee7bb059edf764fd695 git bisect bad a0c161c0067442fa028d80f19f3a5642c653b820
This will start the bisect, and it will say something like this:
Bisecting: 25 revisions left to test after this (roughly 5 steps)
git finds the most effective path to pinpoint the commit that introduced the error. To investigate each step, we have to tell git what the current state is. I run the build step on this commit SHA, and see that the error is there. That is bad. Let’s tell git:
git bisect bad
git checks out another SHA, I build the project again, and check if the bug is there. Yay, it’s gone! Let’s tell git:
git bisect good
These steps will repeat until you find the commit that introduced the error. It will look something like this:
72624098ee091143d5b1318e0912c0e1c8a65406 is the first bad commit commit 72624098ee091143d5b1318e0912c0e1c8a65406 Author: xxx <email@example.com> Date: Wed Feb 10 19:51:54 2021 +0100 Commit message
This can give us enough info to blame someone, but that is really not the point. Especially since the author ended up being me.
Fix the bug in a careful way
Now that we know what introduced the bug, it might actually be easier to spot the error and fix the bug. More importantly, you now know why the change was introduced. This means you can fix the error, while making sure this will not revert whatever change the author wanted to achieve with the commit. This step will be left to the reader, since the process of fixing the bug in a careful way will vary on the bug, and the contents of the commit that introduced it.
Analyze what went wrong
The bug was introduced, and we fixed it in a timely manner. But what actually got us in this situation? There can be several answers to this question:
- Missing test coverage
- Undocumented functionality
- Undocumented dependencies
- Misuse of functionality that introduced a bug when you fixed something else
- Misunderstandings in code reviews or pull requests
An analysis of the bug should be able to uncover if one or several of these faults were at play. However, it could also be that there are other structural issues with this project that need to be addressed. In my opinion, we should now take a step back, evaluate the situation, and make a plan to tackle as many issues as possible. In our case, one issue was missing test coverage; the change would never have been committed, had we known that it would make the specific functionality break. Which brings us to the last step of this story:
Add the missing test coverage, and make sure it covers the bug introduced
Fixing the bug is always priority number one, and in this case it was done quite fast. Next, we want to write a test that illustrates what went wrong, and confirms that the fix actually fixes the bug. In practice, this means we will perform these steps:
- Fix the bug and create a pull request
- Create a test, working from the develop branch without the fix applied
- Confirm that this test is failing locally
We write functional tests like this with Behat, which means running the test locally would look something like this:
As expected, our test fails. Now we try running it with the bug fixed:
Our next step is to push this branch to Github. We use Github for code, and Github Actions for continuous integration. Now to prove our test is actually testing the bug we have fixed, we revert the fix and create a pull request from this:
As we can see, the pull request succeeded with just the test. However, we want to make sure the added test would have prevented us to push the broken code to production. First, we remove the fix in this reverted commit, which we applied like this:
git revert ba31d6c9f4ffaa5508642a23a598b854124ac572 # <- Our commit sha for the fix.
Second, as the test failed like we expected it to, we add back the fix. One way to do this is by reverting the reverted commit, which looks something like this:
git revert 3d3866985e24a19a65c05abaaa1afab9242bc76d # <- Same SHA as the screenshot
Finally, let's go back to our pull request, and verify that it now passes the tests again.
We can now merge the pull request containing the test. This means we have accomplished the following:
- Fixed the bug
- Analyzed why the bug was introduced
- Written a test that illustrates the problem
- Made sure the functionality in question will not be broken by releases in the future
If you are looking for an agency with a focus on Quality Assurance, automation and stability, we can help! Our clients benefit from our focus on quality, and we always have a long-term cooperation in mind when working with clients. This way we deliver solutions that are of high quality, especially when the complexity of the project increases. Contact us today if you're looking for a technical partner for your project!