Who Knows Where I’ve Been?
March 27, 2011 § Leave a comment
Just how much of our behavior online is being tracked, collated, and data-mined has been a subject of some public debate recently. The Department of Commerce, at the urging of consumer privacy advocates like the Electronic Frontier Foundation, has been discussing requiring advertisers to honor consumer opt-outs. Meanwhile, Google and a host of other large advertising networks have pushed a cookie-based opt-out mechanism (which advertising networks would voluntarily comply with), Internet Explorer 9 has implemented a more aggressive third-party resource filter, and Firefox has announced a plan for a different, also-voluntary opt-out mechanism that is distinct from the cookie-based approach supported by Google and others. The industry seems likely to accept some voluntary limits. Yet little empirical data quantifying the extent of behavioral tracking exists.
In July of last year, The Wall Street Journal published a survey of the prevalence of behavioral tracking networks among the top fifty websites, called What They Know. (A subsequent piece explored the privacy implications of popular smartphone apps.) The Journal‘s manual approach yielded deep insight into the data gathered–including a detailed view of the privacy and data-retention policies of the most prevalent networks–but it also limited the data gathered to a very small subset of the entire World Wide Web.
In a more scalable way, I have attempted to gather data to answer the question of just how much of all online behavior is visible to a handful of advertising networks.
What the data represent
Looking at the top websites, it is clear that content owned by Google–primarily hosted at Google-Analytics.com, GoogleAdServices.com, GoogleSyndication.com, and 2mdn.net, but also including YouTube, Blogger, and more–utterly dominates the list of top content networks. Google-owned content is on over half of the top 100,000 websites; the next nearest competitor, Facebook, is only less than a tenth as many.
|Network||% of webpages|
A total of twelve content networks are present on at least 1% of the top 100,000 webpages; the number rises to 26 when we consider networks present on at least half a percent (500 webpages or more). Yet nobody is on as many webpages as Google.
Not all of Google’s content is necessarily designed to track usage, though it’s hard to know just what data Google retains.
Segregating Google’s content networks into those which are designed for, and described publicly as, advertising or behavioral tracking networks (to wit, 2mdn.net, google-analytics.com, googladservices.com, googlesyndication.com, and doubleclick.net) shows that 95% of all webpages that contain any Google content contain one or more of these Google trackers.
Another interesting detail is the relative frequencies of Google’s various advertising and tracking products. Breaking apart webpages with at least one of those products into those with Google Analytics on them and those without, it turns out that 88% of all those websites contain the tracking bug from Google Analytics. Google Analytics is particularly interesting, because it’s a product provided to webmasters for free that allows them to gather data on the makeup of their audience. By giving webmasters an easy way to gather data on what users do while on a given webpage, Google has found perhaps the perfect way to themselves gather data on a much wider range of behavior.
Tracking each click
Unfortunately, looking at the prevalence of tracking bugs per webpage doesn’t give a great idea of how much of our online behavior is actually visible to a given advertising network. It may be that Google sees half of the top 100,000 webpages, but most people probably spend most of their time on a much smaller set of that list.
A much better approach is to normalize the data by the number of page views spent on each website. Fortunately, for a small fee Alexa will share these data, too, in the form of “page views per million” per website. Page views per million is just Alexa’s estimation of the number of page views a given website receives out of one million random page views on the Web. (More on Alexa’s methodology here.) To pick on Google again, Google.com, Blogspot.com, and YouTube.com together receive, according to Alexa, 93,790 views per million, or about 9.4% of all web traffic.
Here it is worth noting that because Google.com, Blogspot.com, and YouTube.com do not have third party content from a different Google network on their homepages, I have instead summed page views on all the pages which are either Google-owned or contain Google’s third-party resources. This means that in addition to the approximately 25% of all clicks directed at non-Google webpages that contain Google resources, I also include the 14% of traffic directed at Google.com, YouTube.com, and various other Google properties. This sum represents the total amount of web traffic that Google may be aware of, directly or via third-party resources. (The relative balance of these parts–first- vs. third-party traffic–obviously can vary immensely for different networks. Google and Facebook generate a lot of first-party traffic, whereas, for example, Quantserve.com doesn’t generate any.)
Another methodological issue here is that the top 100,000 webpages only account for about 75% of Alexa’s reported statistics. Thus I can reliably state the lower bound of all pageviews a given ad network might be able to capture, but I cannot say for certain what the more precise number really is. Thus for accuracy I have given, again for the top ten networks, the number of page views that network sees among the top 100,000 webpages. The lower bound one can derive from this number is to divide page views by 1,000,000; out of page views we are aware of one can divide by 749554.25, the number of page views per 1,000,000 which go to any of the top 100,000 websites.
|Network||Page views per 749,554.25|
Surprisingly, most of the numbers drop. In fact, as a portion of pageviews directed to the top 100,000 websites, the top 10 content networks only saw only a combined 57.19%.
My hypothesis is that the very large sites tend to negotiate directly with advertisers and thus have fewer large ad networks present on their homepages. For example, cnn.com receives 0.073% of all traffic (nearly 0.1% of the traffic in the top 100,000), yet contains no third party trackers.
So what does it all mean?
And do I care if they know that? Eric Schmidt isn’t looking over my personalized file and laughing at me. Google engineers don’t have access to anything personally identifiable as mine. Employers can’t pay Google to give them the inside scoop on me and any embarrassing diseases I might have contracted. Even governments, at least at times, have had to fight an uphill battle to get Google to hand over information. If all my secrets are laid bare, only to be placed in some database and occasionally analyzed, summarized, and aggregated in a bland statistical report by some artificial intelligence, do I have any reason to be afraid?
The thought still gives me just a little twinge, but I suspect a slightly younger generation–the Facebook generation, just in high school or even younger right now–might not understand my reservations.
On a more pragmatic note, these trackers are implemented entirely client-side. The browser is yours. Using any of the major browsers, you can block these and take back at least a tiny shred of your privacy. I highly recommend it.
Methodology, in depth
In order to calculate ad networks per page view, I downloaded page view data from the Alexa Web Information Service on March 26, 2011. Pages for which Alexa data were unavailable are assumed to have zero page views, which again implies the reported data should be taken as a lower bound.
The process of unifying content networks with multiple domain names was enabled by way of manually authoring canonicalization rules. For example, “doubleclick.net,” “googleadservices.com,” and “google-analytics.com” all canonicalize to “Google.” The possibility that I have missed a given alias should, again, be taken to imply that the reported data are a lower bound.
My data are available for download or querying here for records of third party resources and here for the list of all URLs successfully analyzed. Alternatively, the database, in SQLite3 format, can be downloaded from here. I have not reproduced Alexa’s pageview data due to licensing concerns. If you do anything interesting with the data, please let me know.