The past 10 years or so has seen a large amount of research on how to create end-to-end traces of distributed-systems’ activity. Such traces show the workflow of causally-related activity (e.g., activity required to service a request) across every component of the distributed system. For example, one end-to-end trace might show the functions executed by a request as it traverses a front-end gateway, a load balancer, a database, and the local filesystem where the requested data is stored. The trace might also show detailed timing information, such as the overall response-time of the associated request and the execution times of each individual function. Some examples of tracing-related research efforts include Magpie (OSDI 2004), Stardust (Sigmetrics 2006), and X-Trace (NSDI 2007). Recently, several industry implementations have also emerged, including Google’s Dapper and Twitter’s Zipkin. This year’s NSDI included two papers that could be classified as end-to-end tracing infrastructures: NetSight and FlowTags.
This is an experiment in live blogging, so beware ;). My immediate impressions: the focused on advertising systems was interesting. Large-scale analysis was a focus of many talks and (as one would expect), the answer was always to use a map-reduce-style infrastructure.