Failure Sketching

Failure sketching is an automated debugging technique that provides developers with an explanation (“failure sketch”) of the root cause of a failure that occurred in production. A failure sketch only contains program statements that lead to the failure, and it clearly shows the differences between failing and successful runs; these differences guide developers to the root cause.

Classical debugging

Developers spend a lot of time searching for the root causes of software failures. For this, they traditionally try to reproduce those failures, but unfortunately many failures are so hard to reproduce in a test environment that developers spend days or weeks as ad-hoc detectives. The shortcomings of many solutions proposed for this problem prevent their use in practice.

Failure sketching

Failure sketching automates the detective work of developers by mimicking the manual debugging process using a combination of static analysis and dynamic crowdsourced analysis.

Key insights

Only a few program statements and a small number of program properties (e.g., order of certain instructions, data values, etc) are necessary for the purpose of understanding the root cause of a failure

In-house heavy static analysis can allow subsequent dynamic analysis to be more lightweight and thereby efficient. Efficient dynamic analysis improves user experience. This hybrid static-dyanmic approach can determine the statements and information relevant to the failure.


To help developers pinpoint root causes, we showed in our HotOS’13 paper that for some bugs, we can perform what we call reverse execution synthesis. RES is a technique that takes a coredump obtained after a failure and automatically computes the suffix of an execution that leads to that coredump.

We initially proposed the idea of failure sketches in our HotOS’15 paper, where we validated the feasibility of failure sketching using a hardware simulator.

Our SOSP’15 paper presents the detailed design and formalization of our failure sketching technique. We report on our prototype implementation called Gist that uses real hardware (Intel Processor Trace). We evaluated the prototype on real-world systems, and described the insights we gained from the design and implementation effort.

More recently, together with Cristiano Pereira and Gilles Pokam from Intel, we are looking into using failure sketches and hardware tracing for security auditing.


Resources

  • Bugbase : Bugbase is the framework we used to reproduce all the failures we examined for evaluating failure sketching. Gist, our failure sketching prototype can operate as a plugin to Bugbase. More details can be found in the documentation.

  • Gist’s static analyzer : Gist’s static analyzer computes a static backward slice starting from a failing instruction. Navigate to the directory lib/Transforms/StaticSlicer/ for details.

  • Intel PT driver source code : We used a custom Intel PT kernel driver in order to manage Intel PT from user space.

  • Intel PT decoder library : The Intel PT driver uses the Intel PT decoder library to decode Intel PT traces. This library is available as a standalone distribution as well.

Contact

Feel free to contact Baris Kasikci with questions or comments.