Who Knows Where I’ve Been?

March 27, 2011 § Leave a comment

Fraction of top 100,000 webpages which contain elements from each networkJust how much of our behavior online is being tracked, collated, and data-mined has been a subject of some public debate recently. The Department of Commerce, at the urging of consumer privacy advocates like the Electronic Frontier Foundation, has been discussing requiring advertisers to honor consumer opt-outs. Meanwhile, Google and a host of other large advertising networks have pushed a cookie-based opt-out mechanism (which advertising networks would voluntarily comply with), Internet Explorer 9 has implemented a more aggressive third-party resource filter, and Firefox has announced a plan for a different, also-voluntary opt-out mechanism that is distinct from the cookie-based approach supported by Google and others. The industry seems likely to accept some voluntary limits. Yet little empirical data quantifying the extent of behavioral tracking exists.

In July of last year, The Wall Street Journal published a survey of the prevalence of behavioral tracking networks among the top fifty websites, called What They Know. (A subsequent piece explored the privacy implications of popular smartphone apps.) The Journal‘s manual approach yielded deep insight into the data gathered–including a detailed view of the privacy and data-retention policies of the most prevalent networks–but it also limited the data gathered to a very small subset of the entire World Wide Web.

In a more scalable way, I have attempted to gather data to answer the question of just how much of all online behavior is visible to a handful of advertising networks.

What the data represent

For every site in Alexa’s top one hundred thousand webpages (a ranking of webpage popularity by traffic), I rendered the page in Internet Explorer and looked for the presence of blocked third-party content, as identified by IE9’s “tracking protection” filter. This includes JavaScript bugs, the most common mechanism of behavioral tracking. It does not, however, provide any insight into what cookies a webpage sets or views, or more subtle web bugs such as “Flash cookies.” The most prevalent content networks are well-known behavioral trackers; for many others, we can only guess what data they collect.

Googling

Looking at the top websites, it is clear that content owned by Google–primarily hosted at Google-Analytics.com, GoogleAdServices.com, GoogleSyndication.com, and 2mdn.net, but also including YouTube, Blogger, and more–utterly dominates the list of top content networks. Google-owned content is on over half of the top 100,000 websites; the next nearest competitor, Facebook, is only less than a tenth as many.

Network % of webpages
Google 52.66%
Facebook 4.46%
QuantCast 3.55%
AddThis 3.14%
Cnzz.com 2.54%
ScoreCardResearch 2%
Twitter 2%
Yahoo! 1.85%
StatCounter.com 1.74%
Baidu 1.57%

A total of twelve content networks are present on at least 1% of the top 100,000 webpages; the number rises to 26 when we consider networks present on at least half a percent (500 webpages or more). Yet nobody is on as many webpages as Google.

Not all of Google’s content is necessarily designed to track usage, though it’s hard to know just what data Google retains.

Breakdown of Google content by type

Segregating Google’s content networks into those which are designed for, and described publicly as, advertising or behavioral tracking networks (to wit, 2mdn.net, google-analytics.com, googladservices.com, googlesyndication.com, and doubleclick.net) shows that 95% of all webpages that contain any Google content contain one or more of these Google trackers.

In other words, even if we assume that Google does not correlate data from embedded YouTube videos, Blogger content, JavaScript libraries that Google hosts for public use at googleapis.com, or any of the other Google products that web developers often embed on their webpages, products designed by Google to track user behavior are present on 50.22% of the top 100,000 websites.

Another interesting detail is the relative frequencies of Google’s various advertising and tracking products. Breaking apart webpages with at least one of those products into those with Google Analytics on them and those without, it turns out that 88% of all those websites contain the tracking bug from Google Analytics. Google Analytics is particularly interesting, because it’s a product provided to webmasters for free that allows them to gather data on the makeup of their audience. By giving webmasters an easy way to gather data on what users do while on a given webpage, Google has found perhaps the perfect way to themselves gather data on a much wider range of behavior.

Tracking each click

Unfortunately, looking at the prevalence of tracking bugs per webpage doesn’t give a great idea of how much of our online behavior is actually visible to a given advertising network. It may be that Google sees half of the top 100,000 webpages, but most people probably spend most of their time on a much smaller set of that list.

A much better approach is to normalize the data by the number of page views spent on each website. Fortunately, for a small fee Alexa will share these data, too, in the form of “page views per million” per website. Page views per million is just Alexa’s estimation of the number of page views a given website receives out of one million random page views on the Web. (More on Alexa’s methodology here.) To pick on Google again, Google.com, Blogspot.com, and YouTube.com together receive, according to Alexa, 93,790 views per million, or about 9.4% of all web traffic.

Here it is worth noting that because Google.com, Blogspot.com, and YouTube.com do not have third party content from a different Google network on their homepages, I have instead summed page views on all the pages which are either Google-owned or contain Google’s third-party resources. This means that in addition to the approximately 25% of all clicks directed at non-Google webpages that contain Google resources, I also include the 14% of traffic directed at Google.com, YouTube.com, and various other Google properties. This sum represents the total amount of web traffic that Google may be aware of, directly or via third-party resources. (The relative balance of these parts–first- vs. third-party traffic–obviously can vary immensely for different networks. Google and Facebook generate a lot of first-party traffic, whereas, for example, Quantserve.com doesn’t generate any.)

Another methodological issue here is that the top 100,000 webpages only account for about 75% of Alexa’s reported statistics. Thus I can reliably state the lower bound of all pageviews a given ad network might be able to capture, but I cannot say for certain what the more precise number really is. Thus for accuracy I have given, again for the top ten networks, the number of page views that network sees among the top 100,000 webpages. The lower bound one can derive from this number is to divide page views by 1,000,000; out of page views we are aware of one can divide by 749554.25, the number of page views per 1,000,000 which go to any of the top 100,000 websites.

Network Page views per 749,554.25
Google 38.32%
Facebook 8.00%
Yahoo 4.43%
Baidu 2.96%
Quantserve.com 2.53%
ScoreCardResearch.com 2.28%
Ebay 1.59%
Microsoft 1.46%
AddThis.com 1.22%
Twitter 1.12%

Surprisingly, most of the numbers drop. In fact, as a portion of pageviews directed to the top 100,000 websites, the top 10 content networks only saw only a combined 57.19%.

My hypothesis is that the very large sites tend to negotiate directly with advertisers and thus have fewer large ad networks present on their homepages. For example, cnn.com receives 0.073% of all traffic (nearly 0.1% of the traffic in the top 100,000),  yet contains no third party trackers.

So what does it all mean?

I hope this information can encourage you to ask the right questions, but it’s not very good at answering any of them. If you wanted to know, “How much does Google know about me?” the best you can glean from these data is a lower bound. After all, if you’re anything like me, Google knows not just 25% of the webpages you visit, but potentially most of them (if you find them by a Google search), not to mention who your friends are, whom you call on the phone, where you live and work and where you are at any given point in time, what’s in your stock portfolio and what books you’re reading. By that point, I don’t know if I even care if Google has a bit of JavaScript running on 25% of the webpages I visit.

And do I care if they know that? Eric Schmidt isn’t looking over my personalized file and laughing at me. Google engineers don’t have access to anything personally identifiable as mine. Employers can’t pay Google to give them the inside scoop on me and any embarrassing diseases I might have contracted. Even governments, at least at times, have had to fight an uphill battle to get Google to hand over information. If all my secrets are laid bare, only to be placed in some database and occasionally analyzed, summarized, and aggregated in a bland statistical report by some artificial intelligence, do I have any reason to be afraid?

The thought still gives me just a little twinge, but I suspect a slightly younger generation–the Facebook generation, just in high school or even younger right now–might not understand my reservations.

On a more pragmatic note, these trackers are implemented entirely client-side. The browser is yours. Using any of the major browsers, you can block these and take back at least a tiny shred of your privacy. I highly recommend it.

Methodology, in depth

I gathered the data representing the presence of ad networks on a given page between March 20, 2011 and March 24, 2011. I created a custom Web crawler based on Internet Explorer 9, which used IE9’s “tracking protection” setting to detect blocked third-party resources. I enabled this setting with the “personalized” protection list, such that third party resources appearing on at least 3 different pages (the minimum setting allowed) would be blocked (this has an obvious flaw of failing to record the first two instances of any given resource). I then hooked the ThirdPartyUrlBlocked event to create a log of all blocked third party content. The types of content detected thus include not only statically imported third party JavaScript and frames, but also those content inserted into the DOM at runtime (as ad trackers commonly do). I seeded the crawler with the top 100,000 URLs according to Alexa as of March 12, 2011. Of the 100,000 initial URLs, 92,157 loaded with no errors. Due to the possibility of loading errors, these data should be considered a lower bound—false negatives are a possibility.

In order to calculate ad networks per page view, I downloaded page view data from the Alexa Web Information Service on March 26, 2011. Pages for which Alexa data were unavailable are assumed to have zero page views, which again implies the reported data should be taken as a lower bound.

The process of unifying content networks with multiple domain names was enabled by way of manually authoring canonicalization rules. For example, “doubleclick.net,” “googleadservices.com,” and “google-analytics.com” all canonicalize to “Google.” The possibility that I have missed a given alias should, again, be taken to imply that the reported data are a lower bound.

My data are available for download or querying here for records of third party resources and here for the list of all URLs successfully analyzed. Alternatively, the database, in SQLite3 format, can be downloaded from here. I have not reproduced Alexa’s pageview data due to licensing concerns. If you do anything interesting with the data, please let me know.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

What’s this?

You are currently reading Who Knows Where I’ve Been? at I've been hacked!.

meta

%d bloggers like this: