Data were collected on the top 100, top 1000 and top 25,000 websites from Quantcast. Data were collected using two processes. A shallow crawl of 25,000 sites, which consisted of visiting only the home page of the domain obtained from Quantcast’s ranking. And a deep crawl of 100 and 1000 sites that consisted of visiting the home page of the domain obtained from Quantcast and then traversing up to 6 random links from that page, intended to simulate some level of activity at the website.
The shallow and deep crawls collect the same type of information at each webpage: http cookies, flash cookies, calls to HTML5 local storage, calls to flash that may be used for browser fingerprinting, as well as metadata about the webpage and crawl.
One purpose of the Web Privacy Census was to develop a crawling process that could be used to take regular samples of the tracking ecosystem. The list of domains were crawled by a distributed crawling system built for this purpose. Each crawler node was functionally identical and consisted of an Ubuntu host running a full FireFox browser with a modified version of the FourthParty extension for extended data capture and analysis. A MozMill script controlled the browser. Automated deployment scripts created each crawling node, and the domains to be crawled were provided by a control node that also monitored the status of the crawling nodes and collected the resulting crawl data. The deployment scripts also set the crawl parameters, such as DNT headers, number of links to follow, etc.
For the shallow crawl each domain in the crawl list was visited was a fresh browser with clean data directories and the about:blank url loaded. The crawler then navigated directly to its control node, received the URL of the domain to be crawled and the browser and crawl settings to be used Only the domain was visited – no links were clicked or followed. After waiting for the domain’s homepage to load, all the data was stored and the crawler cleaned out before visiting the next domain.
Limitations of crawling method
Limitations of data collection methods
Identification and classification of third parties and first party cookies can be complicated. Many sites are owned by other sister sites that have different domain names. For example, DoubleClick is owned by Google. For consistency in categorizing third party cookies, we leveraged the public suffix list to combine suffixes consistent with how other authors in the space have done, and used the rule that cookies that were from that top level domain were classified as first party, and cookies that were from a domain outside of the top level domain were third party. Our analysis of third party domains is then limited to domains that are considered syntactically to be third parties, and not reflective of any underlying agreements or connections that may exist between multiple domains.