The past 10 years or so has seen a large amount of research on how to create end-to-end traces of distributed-systems’ activity.  Such traces show the workflow of causally-related activity (e.g., activity required to service a request) across every component of the distributed system.  For example, one end-to-end trace might show the functions executed by a request as it traverses a front-end gateway, a load balancer, a database, and the local filesystem where the requested data is stored. The trace might also show detailed timing information, such as the overall response-time of the associated request and the execution times of each individual function.   Some examples of tracing-related research efforts include Magpie (OSDI 2004), Stardust (Sigmetrics 2006), and X-Trace (NSDI 2007).  Recently, several industry implementations have also emerged, including Google’s Dapper and Twitter’s Zipkin.  This year’s NSDI included two papers that could be classified as end-to-end tracing infrastructures: NetSight and FlowTags.

As one might expect, end-to-end traces can be invaluable for many use cases, such as steady-state-problem diagnosis (e.g., understanding why the overall response time of a large number of requests are slow), anomaly detection (e.g., understanding why one request out of 10,000 is excessively slow), correctness debugging (e.g., understanding why a request has failed), and resource attribution (e.g., understanding how much to charge a client for work done several level components deep in the distributed system).  Unfortunately, most existing literature treats end-to-end tracing as a “one-size-fits-all” solution for all of these use cases.  Our experiences developing Stardust, X-Trace, and developing tools for Google’s Dapper indicate it is not.  For example, traces that are useful for resource attribution will not necessarily show critical paths and so will not be useful for diagnosis.  Tracing infrastructures used primarily for steady-state-problem diagnosis can use sampling techniques to reduce overhead, whereas those used primarily for anomaly detection cannot easily use sampling.

We don’t think there’s such a thing as a generic tracing infrastructure.  Rather, key-tracing-design axes and choices for them dictate the utility of end-to-end tracing for its various important use cases.  These axes are not well understood today.

To help, we’ve prepared a paper that tries to distill the key design axes of end-to-end tracing.  In addition to identifying these axes, we describe the design choices for each axis best suited to particular use cases.  This paper is based on our collective experiences developing and building tracing infrastructures.  Check it out and let us know what you think in the comments. :)

Here’s the paper: So, you want to trace your distributed system?  Key design insights from years of practical experience.

Update (October 2016): We updated the above paper with information about recent tracing infrastructures (e.g., PivotTracing) and published it in SoCC’16. You can find it here: Principled workflow-centric tracing of distributed systems.