I saw a question on Linkedin recently about how to debug build failures. The context of the question was in response to a post my employer made about remote working tips. One of my fellow engineers, Lukas, shared a great tip about collaborating with your colleagues when working remotely and trying to troubleshoot problems. The advice was to set up an open call and invite other engineers to join, rather than direct messaging people. This has a two fold benefit, firstly more eyes on the problem hopefully means a quicker resolution. Your colleagues often have different skills and experience and may be able to diagnose the issue more quickly than trying to fix it by yourself. Secondly even if someone joins and doesn’t contribute directly to the fix, you are sharing knowledge and skills which benefit your team. This helps you and your colleagues become more well rounded engineers.
The post prompted the question
Is there any tips on troubleshooting build failure, specific quick points to find out the bug/failure?
Veer
So I thought I’d jot down my tips for troubleshooting a broken build.
- Read the build logs! It sounds obvious but I’ve been helping people troubleshoot builds for many years, and often people panic when they see a build has failed. Sometimes the answer is staring you in the face. Whilst the error message itself may not be transparent, copying the error in to google can often get you to an answer without much effort.
- If the logs don’t help then, if possible, increase their verbosity. Most build tools provide a useful level of information, but often the logs are not complete. This is to prevent massive log files and to reduce the amount of “noise” in the logs. Increasing the verbosity of the logs might give you that extra insight in to the problem.
- Understand what’s changed. If you’re trying to diagnose a problem on a previously working build, what has changed? When doing this I tend to look at the code first. Have you or a colleague committed a change that broke something? Most build failures are down to code changes, that’s simply the build doing it’s job, and alerting you to a problem. These kinds of failures tend to fall on to the following categories.
- The code doesn’t compile or is otherwise syntactically incorrect.
- The linter is unhappy with the code structure.
- The static code analysis tool is unhappy with something like how you’re casting from one type to another.
- Your unit tests are failing.
- There is a missing reference to a dependency.
- Your dependency management step has failed to pull down a dependency.
- The developer only committed a subset of the code that was changed.
- The code or build configuration contains an absolute path based on the developers desktop setup, and the structure of the file system on the build agent is different.
- Spaces in a path or file name.
- You’ve updated your code to a newer version of a framework and didn’t inform the admins of the build environment, so the agents performing the build simply didn’t have the tools necessary to perform the task! (I’ve seen this more than once!).
- A typo.
- If the code is good then has the build environment changed? In modern build automation setups there are a lot of moving parts. Think about what makes up the build and think about which of them could be causing the problem.
- Have the build agents been updated recently?
- Has someone installed some new software on to the agent?
- Has someone removed some old software from the agent?
- Is there a network issue?
- Is your artifact repository available?
- Has a previous build messed up the build agents working directory?
- Does another process have a handle on the file system in the build agents working directory?
- Isolate the issue by removing steps in the build that don’t directly relate to the problem you’re trying to troubleshoot. This is useful for three reasons. Firstly if your build process is long running, removing some of the steps will speed things up so you can test your fixes rapidly, secondly it can provide clarity on the issue by simplifying the process, and finally sometimes it fixes the problem. That step you didn’t think was having an impact, we’ll actually it was!
- Look at the build steps that immediately proceed the step that’s broken. and then work backwards to the first step of the build. Working in a logical and methodical way can help you understand the issue.
- When troubleshooting, if you’re trying a fix, change one thing and one thing only. You don’t want to be in a situation where you’ve tried three things and now the build works, and you can’t be certain which of your changes solved the problem.
- Talk to a colleague. This was the gist of the post I mentioned earlier. If you’re in the office, you might just call someone over to get their opinion or advice. In that environment someone might overhear the conversation and provide a solution. Right now many of us are in isolation due to COVID-19 and we don’t have that immediacy of communication that working in an office brings. This means that we need to work harder at communication, sticking a message in to group chat in slack / teams or inviting your colleagues on to a call can really help. Building software can be complicated and having a second pair of eyes is really useful. Sometimes just the act of explaining the problem to someone else can help you with your mental model of the problem and you’ll come up with the solution half way through the explanation.
- Take a break. Trying to diagnose a problem can be stressful and as engineers we can sometime have too much focus. Go and make a drink or, if possible, go for a short walk. Sometimes changing your environment, or applying yourself to a different task, can give your brain the space it needs to process the information it’s gathered about the problem you’re trying to solve.
- Ask for help in an online community. Tools like stackoverflow, google groups and github provide places for communities of interest gather to help one another. No one can be expected to know every nuance of your build toolchain, even trivial applications these days use multiple languages, frameworks, dependencies and tools. There is a good chance that someone else out there has encountered your problem before. Even if google doesn’t provide an answer, someone out there may be able to help. Don’t be afraid of asking for help, it’s not a weakness, it you showing your willingness to learn. If you do ask for help though, please give as much information as possible in your question. The exact text of any error messages, what languages, tools and frameworks are in the build, ideally with version numbers. Code snippets or screenshots of the build steps. All of these will help people answer your question quickly and fully.
Those are my 10 tips for troubleshooting broken builds. Hopefully you find them useful. Is there anything I’ve missed that you use as part of your troubleshooting process?