The Difficulty of Debugging

2010年8月31日
Debugging is an activity that is commonly done during the development of software. The activity that is called "debugging" is not simply a process of eliminating bugs. Actually debugging itself doesn’t eliminate bugs, but is a process of trying to find out what caused the bug.
 
There are many ways to eliminate software bugs, but no matter which way we use, we first need to find out what caused the buggy behavior. Usually we use the following ways to eliminate bugs: code reviewing, unit testing, logging and debugging. Code reviewing identify bugs by finding out illogical or wrong code. Unit testing identify bugs by testing functions at a deep level. Logging helps us see the flow of the program and identify bugs during the operation. Debugging is to use a debugger, such as Visual Studio or WinDbg, to diagnose a program, and it helps us find ad hoc bugs better than simply logging.
 
However, even with all these techniques, bug investigation is still hard. Firstly, code reviewing can eliminate almost all logical bugs except for ones that have external dependency, but code reviewing is a tedious job because we can only do it on a large-enough portion of code, so that we can eliminate almost all bugs in that portion. We can’t use code reviewing to eliminate a single bug effectively. Secondly, as long as code reviewing being not practical, building new unit tests would also be impractical. In this situation, we can only rely on existing unit tests. However, existing unit tests are sometimes based on prototyping code, and these prototyping code not necessarily reflect real code. Even if the unit test is based on real code, it may still not reveal the buggy behavior, if the test is not comprehensive enough. Thirdly, some product lacks logging, so in such products we can’t benefit from logging very much. Fourthly, debugging can help us easily see the code logic after the buggy behavior, but it’s not enough to identify the bug based only on this.
 
Based on the buggy behavior and the code logic at the time immediately before the buggy behavior, we can know some near cause of the bug. However, to get the root cause of the bug, we need some way to find it out. Possible ways are: guess the possible root cause; reproduce multiple times and debug backwards; guess how the code works and read code to see whether it is how it works, and then find out the root cause; ask the code owner about how the code works and debug the code to find out exactly how it works.
 
Directly guessing the possible root cause is based on experience and what we see during debugging. It doesn’t have a high chance that we can do this, but whenever we can, it can be very effective. Such as memory leak of a particular type of object, we can use umdh or some tool to identify the place of leak directly. In this way, we don’t need to understand the whole logic. Another kind may be a common programming mistake, such as forgetting to maintain a correct number of reference count. As a comparison, this kind of bugs can be easily identified by code reviewing, though we don’t do code reviewing often. The probability to find the bug through code reviewing is high, but the percent of time spent on code reviewing without discovering any bug is also high. The time spent on guessing the root cause is low, but the chance that such guess can be made or is correct is also low, unless the bugs are of a particular kind.
 
Debugging backwards is a nightmare if we can’t see the overall program logic. One way to find out program logic is to check the stack trace in the debugger, and see how higher-level functions work. However, it’s not guaranteed that higher-level functions do contain the workflow — sometimes the workflow is contained in data, and the higher-level function just interprets the workflow in the data (this is especially the case if the function is an interpreter function that interprets a high-level language). In this case, unless we understand the data, we can’t find out the problem easily through debugging. We can reproduce the issue multiple times with the debugger attached to it, but doing binary search to pinpoint the issue is not that easy — we need to record how many steps we spend on the interpreter function, and we need counted breakpoint to debug. With these, we can see in each step what current data is, and whether there is any error in it, but we still need to understand the program to know which data can reflect the bug. If we can’t guess the data that can reflect the bug, then we still need to go nearer to the point of buggy behavior and go backwards. If the bug is not reproducible, then it’s even harder to find out the root cause. TTT is a great tool for doing effective backtracing, in that it makes a time-line recording of the whole running process, and we can use this recording to analyze the behavior of the program statically, and it is a great tool against non-reproducible bugs.
 
Guessing how the code works is based on the developer’s past experience and design capability. For this, begin by setting up several models through which the program logic can be implemented. Then, read the code to see which model the patterns match. If lucky, there is a model match and then we can go deeper into the code to see whether the code path related to the buggy behavior is implemented correctly. If unlucky, the actual code doesn’t match any of the models we imagine or the code is much more complex than the model we imagine, we need to do something else. If the code doesn’t match the model we imagine, we need to read the code to understand its model, and this process is much slower than designing a model. If the code matches the model we imagine but is much more complex, we need to read code and add missing parts to our model, and then figure out the code path that leads to the buggy behavior. After that, debug on the code path to see where it went wrong. Binary search can be used to find the erring code between the beginning and the end of the code path. However, sometimes even when the model is determined, there can be a "backtrace tree" that is not just a single code path that may contain the error. Anyway, this modeling process, when successful, can help us in understanding the problem better.
 
If we are the code owner and we still remember how it works, we should be able to identify the problem quite quickly. However, often it’s not the case. The one who does the maintenance is not the one who owns the code. In other words, this maintenance developer is not that feature team developer. The way to contact the code owner is:
 
1. Find out the code owner. Usually we can check who changed the related file in the source control system, and there is his/her name.
2. Ask whether he/she can take some time to explain in general how the code overall works.
3. If he/she happily replied, go on telling him/her that you’re working on a bug with what kind of behavior. Otherwise, go on investigating on your own.
4. If he/she responded with his/her guess to the bug, that’s a good chance to explore deeper. If not, then we should guess how the bug can be based on the overall working mechanism.
 
Anyway, if the code owner is available, whenever we are blocked by a question of the original design or we have made progress in investigation, we can get great help from the code owner. Keep asking questions (but ones that won’t take too much time), if he/she is willing to answer.
 
One important point we should note is that sometimes we may have ignored important components that are related to a bug. Sometimes the result of a bug is a result of component interaction. Maybe the root cause is in one component, component A, but that component, when examined alone, does not seem to have the bug apparently. It’s that during its interaction with another component, component B, that behaves wrongly. It’s possible that the assumption to component B is incorrectly made, and we also misunderstood the expected behavior of component B. By fixing the code in component A to conform to component B, we eliminate the bug.
 
Further reading:
 

留下您的评论