Live blog of debugging complex systems session at NSDI’14

This is an experiment in live blogging, so beware ;).  My immediate impressions: the focused on advertising systems was interesting.  Large-scale analysis was a focus of many talks and (as one would expect), the answer was always to use a map-reduce-style infrastructure.

Attributor: revenue debugging in advertising systems

Overview

  • Advertising systems are complex.  Composed of users, publishers (bing.com), lots of services (publisher interfaces, auctioning algorithms ,etc.), and advertising systems
  • Want to answer why is revenue/revenue-per-search down anomalously?
  • Need to detect that the anomaly is occurring and then the root cause?  Talk focuses on root-cause analysis
  • Some reasons for root-cause analysis:
    • Datacenter in Dublin had latency issues that resulted in fewer ads being served
    • More people searched for non-monetizable terms (papal election!)
  • Contributions: novel algorithms for root cause analysis in ad systems, attribution for derived measures, the tool itself.

Data used & algorithm

  • Presenter says aggregate performance counters must be used because of sheer scale of ad systems.  Not sure I buy this, since there’s been lots of work on getting detailed data at scale (Dapper, X-Trace, etc.)
  • Aggregate performance counters can include fundamental metrics and ‘derived metrics.’  There are lots of metrics
  • Algorithm basics: root cause chosen should explain as much of the change possible and should be as simple as possible.
  • When choosing between multiple root causes, look at ‘surprise’ by looking at entropy between actual and expected result.
  • Multi-objective optimization problem that tries to maximize suprise
  • There is something interesting about attributing root causes for derived measures, but I didn’t quite get the intuition.  Will need to look at paper.  I wish the presenter had spent more time on this…it seems like it’s a key element of the talk.

Evaluation

  • Presenter points out that tool identified the right root cause in some cases in which operators performing a manual analysis identified an incorrect  root cause

Questions

  • How would you generalize this to more markets?  Answer: the evaluation is on four markets, but it could be used on more.
  • Can the data collected by the tool be used to identify ways to maximize advertisers’ revenue?  Answer: There are lots of tools that can be created.
  • Can you apply this system to regular online services?  Answer: tool is not necessarily specific to advertising.  Need experience applying it to different systems before making any definitive statements

DECAF: Detecting and Characterizing Ad Fraud in Mobile Apps

Overview

  • Ad network (microsoft networking) provides an ad plugin, which mobile App developer includes in his app.  Ad plugin displays ad based on users’ preferred interests, etc.
  • Ad network pays developer based on number of ads delivered or click throughs
  • Develops have incentive to commit fraud by inflating clicks and views.  This is a big problem due to the size of the ad market.
  • 1 billion dollars lost due to ad fraud in 2013 (wow…I wish some of that money were going my way!)
  • Ways to create ad fraud: manipulate visual layout to create invisible impressions  or unintentional clicks.  For example, developer can include too many ads in a single screen.   Valid things the user might want to click on might ‘cover up’ an ad, forcing the user to mistakenly click on the ad.  (Are there more??)
  • Want to create automated system to detecting ad fraud

Challenges/overview for creating an automated system

  • Challenge 1: Scale.  Need to handle many visually complex apps.  A single page of an application to contains hundreds of clickable elements.
  • Challenge 2: accurately and quickly identify fraud.  Screen may contain many ads, but only a subset may be visible at a time.  Need to identify if any visible subset violated ad provider’s rules on number of ads that can be displayed at a time.  This is called the ‘sliding window’ problem.

Approach & tool (DECAF)

  • DEAF includes an automated UI navigation layer, which feeds visible pages into a fraud checker that looks for different Ad classes
  • UI navigator extracts UI structure (what can be clicked or swiped or scrolled and what happens next — this is a tree).  UI navigator performs each action, and feeds result to the checker.  Search space can be huge.
  • How to reduce search space?  DECAF uses observation that structurally similar pages need not be revisited even if content differs.
  • To determine structural similarity, a feature vector is used that encodes level and type of structural element.  Is there a graph isomorphism problem.
  • The fraud checker checks for “many ads in one display screen,” “the sliding screen” problem (and perhaps more)

Evaluation

  • Ran DECAF on many applications for 20 minutes and detect how prevalent different fraud types are.  They also compare a basic UI navigator to one with optimizations (avoid similar structures, etc.)
  • Found that small number of developers are responsible for a lot of ad fraud

Questions

  • Can the UI extract all types of actions?  What about custom actions?  Answer: Can’t extract all of them
  • How will application developers circumvent DECAF?   Answer: The checker can always be improved, so it’s a cat and mouse game.  Person asking question: Worried more about Monkey – can developers hide ads better?
  • Do you need to use structural similarity?  Can you just use page class?  Answer: You have to define similarity some way.  Answer: [I didn’t quite get this]

I know what your packet did last hop

Overview

  • Presenter starts out with a few examples of network problems.
  • States existing tools (tcpdump, ping, trace route) aren’t sufficient.   They really don’t give that much visibility into the network — want to see information about every event that ever happened to every packet.

Approach & tool (NetSight)

  • Obtain packet history (switches/routers a packet was routed through, forwarding state of those routers, etc.)
  • NetSight adds rules to SDN switches that tells it to forward postcards of packets to a special repository.  postcard includes header, switch id, output port, and ‘version of flow table.’
  • Need to reconstruct packet paths (packet histories) by analyzing postcard headers.
  • Netsight allows for querying on packet histories.  Some applications built on top of it: nib (like gdb for networks), net watch (live network invariant monitor), nprof  (hierarchical network profiler), netshark (??)

Scaling

  • Each postcard is 64 bytes and Stanford network is 5 hops w/1031 byte packets on average.  Naive implementation can result in 31% network overhead
  • Their answer for collecting data at scale is to use a map-reduce style architecture.
  • HostAssist where hpervisor can insert an unique ID into a packet on the source allows packet postcards to avoid including the header (as the hypervisor can store the header separately with the same ID).  Reduces network overhead from 31% to 7%.

Questions

  • If you use IDs to reduce postcard header size by omitting the the header from postcards, how do you deal with when malicious switches strip the ID?  Answer: this has to be taken care of somehow — for example, the proxy on the switches has to guarantee it.
  • Question: How do you use this information?  Answer: forwarding loops, reachability, etc.  Follow up: Are there standard queries?   Answer: Have the power of regular expressions.

Libra: Divide and Conquer to verify forwarding tables in huge networks

Overview

  • Data plane problems in dat centers are common — missing forwarding rule, loops, and blackholes happen frequently at Google.
  • Getting forwarding right is hard because of complex interactions between protocols and different states indifferent switches.  Also, writers of forwarding rules usually don’t coordinate with each other.
  • One approach to deal with this: static data-plane verification — tells you whether loops or blackholes can occur  However, this doesn’t tell you why these incorrect rules were installed (this would require debugging the control plane).
  • Data-plane verification is hard because rules change often (perhaps due to failures, etc.), so periodic snapshots of forwarding table rules may be inconsistent.  

Approach to avoiding inconsistent forwarding table states

  • Only capture forwarding table state when network is stable (no new forwarding rules are being sent or being installed).  Authors claim that 99.9% of the time, network is stable, (Where does this data come from?  I think Google…)

How to scalable verify forwarding table state?

  • 10,000 switches, 1000s of rules, lots of destinations…
  • Forwarding graph: How networks routes packet when routing to destination.  Changing forwarding rules amounts to adding and deleting edges from this graph.
  • Forwarding graph size can help deal with scale.  Each forwarding graph contains only relevant subset of rules for forwarding to a given destination.  These forwarding graphs can be verified independently

Libra tool

  • Map-reduce-like framework for capturing forwarding state, matching forwarding graphs to relevant rules, and determining if forwarding graphs match the rules
  • Libra framework for evaluating forwarding-table state captures forwarding table state, matches it to relevant rules

Leave a Reply

Your email address will not be published. Required fields are marked *