Data were collected on the United States top 100, 1000 and 25,000 websites as ranked on Quantcast’s top 1 million websites in the United States. These data were collected using two processes: 1) A shallow automated crawl of the top 100, 1000, and 25,000 sites, which consisted of visiting only the homepage of the domain obtained from Quantcast’s rankings, and 2) A deep automated crawl of the top 100 and 1000 sites which consisted of visiting the homepage and 2 randomly selected links from the homepage. After visiting the first link, the crawler returned to the homepage before selecting the second link. Both links were on the same domain as the homepage.
The crawler used was OpenWPM, a flexible and scalable platform written in Python. This crawler offers features such as collecting HTTP cookies, Flash cookies, HTML5 local storage objects, and the ability to perform deep crawls by visiting links. OpenWPM allows the crawl to be run in either Firefox or Chrome, and can be run with or without add-ons.
All crawls were run using a Firefox version 39 browser with no add-ons, with Flash turned on, and in headless mode. The following information was collected from each crawled domain visit: HTTP cookies, HTML5 local storage objects, Flash cookies, and HTTP requests and responses (including headers). Each crawl was run four times and the average was taken for each tracking method.
Shallow Automated Crawl
The shallow crawls were run with a clean browser instance that was cleared of all tracking data. The crawler visited each URL homepage, waited for the page to load, and then dumped all tracking data obtained from that URL into a database. The crawler would then close that browser tab, open a new tab, then continue this process with the next URL on the Quantcast list.
Deep Automated Crawl
The deep crawls were run with a clean browser instance that was cleared of all tracking data. The crawler visited each URL homepage and waited for the page to load. It would then randomly select a link on the homepage and visit that site. After the linked page finished loading, the crawler would go back to the previous page and visit a second randomly selected link. After the second link finished loading, the crawler would dump all tracking data obtained from those three URLs into a database. The crawler would then close that browser tab, open a new tab, then continue this process with the next URL on the Quantcast list.
Limitations of crawler methods
Limitations of data collection methods
The identification and classification of third and first party cookies is a complex task. Many tracking and advertising companies are owned by other sites that have different domain names. For example, DoubleClick is owned by Google. For consistency in categorizing third party cookies, the public suffix list was leveraged to combine suffixes consistent with previous work. Cookies from the top level domain were classified as first party, while cookies from a domain outside of the top level domain were classified as third party. Analysis of third party domains is therefore limited to domains that are syntactically considered to be third parties, and not reflective of any underlying agreements or connections that may exist between multiple domains, through “DNS aliasing,” for instance, where a primary domain assigns a subdomain to a tracking company. Under such an arrangement, ordinary third party cookies would be instantiated in a first-party fashion. The ranking list used was Quantcast’s top 1 million sites in the United States. This ranking may be different in other countries.
This work was supported by TRUST, Team for Research in Ubiquitous Secure Technology, which receives support from the National Science Foundation (NSF award number CCF-0424422).