Andreas Weigend | Social Data Revolution | Fall 2012
School of Information | University of California at Berkeley | INFO 290A-3

Class 6: October 8, 2012

Responsible for initial page (up by 10pm on Thursday after class):
  • Corey Hyllested

Class materials:

Pete Warden


Pete provided a presentation on several companies, start-ups mostly, that he believes are changing the way we think about data. Most of the companies are looking at uses of geographical data. Pete also spoke about his background and his journey from Apple to Mailana. He provided some insight into his interactions with Facebook and Apple came from trying to find new and useful data sources. He stumbled onto the "spyPhone" data while he was looking for interesting geographical information. He seemed somewhat shocked by the outcome, when compared to his mined Facebook experiment which caused a lot of consternation from Facebook.

Companies Thinking about Data


Cazoodle is a big-geo corporation that creates geographical data. They find publicly available images and use the geo-graphical information encoded in the images in addition to the crowd-sourced free-text (e.g. Flickr comments, tags) and mashes them up to create a highly specific maps. These maps can be used by organizations performing physical reconnoissance. The US military is using such maps to identify mosques in Iraq and Afganistan.


Bundle is a company that has partnered with CitiBank to provide restaurant reviews. By collecting the restaurante, amount spent, age and address of the spender, Bundle is able to make suggestions. Bundle is able to tell you if the average "hipster" comes back to this pizzeria or if they tend towards another pizzeria. To get a more fuller sense of each restaurante, Bundle incorporates (and links to) Foursquare and Yelp data. Bundle on Cheeseboard Pizza.


OpenPaths wants to let users know where they have been and are going. By showing users where they have been, it can provide a Interestingly, they will provide users access to their own location information and some maps. They claim the data will be stored encrypted and the admins do not have access to your data unless requested. Having such an openly strong terms and conditions for personal data use seems intuitive from a privacy standpoint. However, unless companies do have access, they cannot actually monetize it and keep the services up and running. An interesting trade-off.


PlaceIQ is a service that provides mobile advertisers information about the people in locations by determining location demography in real-time. This information is gleaned from real-time feeds of public data sets. Thus, it's attempting to create context by inspecting mostly anonymous data. One example gleaned from their website is being able to tell an event is occurring. The dispersal of these people should prompt mobile advertisers to target people in the context of the event, be it hockey, football, opera, or public speech.


GNIP sells access to social media data. They resell access to Twitter data. They are one of the few owners of the Twitter firehose (since 2010), and have the entire Twitter archive. GNIP is under contract with the Library of Congress to preserve tweets.


Obligatory. The dogfood for this course was to use it. Jetpac works as a Facebook "application" helping users find locations they would like to visit, based on friend's images. The general consensus: very cool, nice UI, but it doesn't lend itself to repeat usage. Pete laughed and said he's working on it.


Our discussion centered over three topics.


A tradeoff always exists. Where that line exists differs person to person. Pete expressed he has been surprised where that line exists for the general populous. He pointed out no one seems to care that credit card purchase information is being given away by Citibank to Bundle. Others, such as the "spyPhone" case are more obvious. Pete suggested blogging can be useful to get feedback. The ability to visualize the geographical data make the loss of privacy more intuitive, more accessible. In a sense, this increases the "shock" effect and makes news of privacy concerns go viral. In the case of Apple's spyPhone, it's easy to see what information Apple collected. In the Facebook case, its very easy to visualize what information Facebook has.

Visualization of geo-location data stored on iPhone.

Social Network analysis of Facebook user's mobility, friendships, and 'likes'

Process of Finding Data

Pete was adamant there wasn't any magic to it but rather a hacker mentality. Look where the free and open data may already exist? Can you scrape it? Perhaps somewhere else? He brought up a few examples of open data sets with APIs. Common Crawl is an archive of webpages. Freebase provides an API allowing users to inspect the graph of related information.

Social Media Archives

GNIP and to a lesser degree Splunk are archiving social media's "big data" feeds (e.g. Twitter, Foursquare, etc). This provides them the ability to resell access to the data corpus in addition to providing corpus search and metrics on feeds. This has research potential, but are these inherently biased samples? Not everyone has made the trade-off to share information.

In the case of GNIP and Twitter, they are purchasing the data. It's curious that Twitter would partner with another company to archive their tweets. Is this a callback to The Future of Work, where Twitter isn't interested in being a Social Media Archive. Twitter is able to sell their data to and assume others, i.e. GNIP, will provide effective, efficient, long-term storage for others interested in the data such as The Library of Congress.

This asks another interesting question. If the LoC acts as a library, will it have an API ... to search this corpus of tweets. Would that be outsourced to GNIP, as it appears the actual storage is?