Methods – 2012

Methods

Data were collected on the top 100, top 1000 and top 25,000 websites from Quantcast.  Data were collected using two processes. A shallow crawl of 25,000 sites, which consisted of visiting only the home page of the domain obtained from Quantcast’s ranking. And a deep crawl of 100 and 1000 sites that consisted of visiting the home page of the domain obtained from Quantcast and then traversing up to 6 random links from that page, intended to simulate some level of activity at the website.

The shallow and deep crawls collect the same type of information at each webpage: http cookies, flash cookies, calls to HTML5 local storage, calls to flash that may be used for browser fingerprinting, as well as metadata about the webpage and crawl.

The Crawler

One purpose of the Web Privacy Census was to develop a crawling process that could be used to take regular samples of the tracking ecosystem. The list of domains were crawled by a distributed crawling system built for this purpose. Each crawler node was functionally identical and consisted of an Ubuntu host running a full FireFox browser with a modified version of the FourthParty extension for extended data capture and analysis. A MozMill script controlled the browser. Automated deployment scripts created each crawling node, and the domains to be crawled were provided by a control node that also monitored the status of the crawling nodes and collected the resulting crawl data. The deployment scripts also set the crawl parameters, such as DNT headers, number of links to follow, etc.

Modifications were made to FourthParty to expand its scope of data capture as well as to improve the ability to connect the stored data on browser activity with the visited domain. Each crawler collected the following information from each crawled domain visit: http cookies, flash cookies, local storage access, JavaScript calls, raw html provided to the browser, and http requests and responses (including headers). Before moving on to the next domain, the data collected at each site was stored, the browser state and cache was cleared and Flash SOL storage directory in the file system were cleared to “reset” the crawler node. After the crawler nodes finished, the data was gathered on the control nodes where it was normalized and summarized to enable efficient analysis by crawl and by domain.

Crawl Types

Shallow Crawl

For the shallow crawl each domain in the crawl list was visited was a fresh browser with clean data directories and the about:blank url loaded. The crawler then navigated directly to its control node, received the URL of the domain to be crawled and the browser and crawl settings to be used Only the domain was visited – no links were clicked or followed. After waiting for the domain’s homepage to load, all the data was stored and the crawler cleaned out before visiting the next domain.

Deep Crawl

For the deep crawl, each domain in the crawl list was visited with a fresh browser. One the domain’s homepage had loaded, links were followed by selecting randomly from the set of html anchor tags on the page that: 1) matched the domain being crawled and 2) were not “javascript:” links. Once the links had been filtered, the remaining array of links were randomized and then the first six elements where selected to be followed. This resulted in up to 6 links being followed, and the same link would not be visited twice unless there was more than one link to that destination URL from the homepage of the domain. For example, a deep crawl of google.com would consist of visiting google.com, and selecting at random up to 6 links that were all part of google.com, including apps.google.com but not doubleclick.com, etc. The browser and file system would be cleared between each deep crawled domain, after the links had been followed.

Limitations

Limitations of crawling method

The crawler architecture was limited to using the Firefox browser, so any differences between browser types were not captured in this iteration. Additionally, as the crawler automated selection of URLs for deep crawls, any retargeting that was based on a human action (e.g. adding items to a shopping cart) was not necessarily captured in this crawl. Deep crawls were also limited to html anchor tags found and did not follow links set by JavaScript. Additionally, links obtained by the Deep Crawler were selected at random from links on the page, and consequently did not take into account page layout and visual layout in the selection process. Also, the crawler did not access content behind sites that require logins, consequently any content and trackers that existed behind a log in were not recorded. Related to this, the crawler did not login and maintain an identity while traversing sites. For example, the crawler did not log into a Facebook account and then attempt to visit websites in this iteration.

Limitations of data collection methods

Identification and classification of third parties and first party cookies can be complicated. Many sites are owned by other sister sites that have different domain names. For example, DoubleClick is owned by Google. For consistency in categorizing third party cookies, we leveraged the public suffix list to combine suffixes consistent with how other authors in the space have done, and used the rule that cookies that were from that top level domain were classified as first party, and cookies that were from a domain outside of the top level domain were third party. Our analysis of third party domains is then limited to domains that are considered syntactically to be third parties, and not reflective of any underlying agreements or connections that may exist between multiple domains.