Andreas Weigend | Social Data Revolution | Fall 2012
School of Information | University of California at Berkeley | INFO 290A-3

Class 5: October 1, 2012 (4:15-5:45pm)


Responsible for initial page (up by 10pm on Thursday after class):
  • Eunkwang Joo

Class materials:



What really is social data?

Human interactions become observable. And they can be categorized as below.
  • The individual
  • Pairs of individuals
  • Groups of individuals

People have been creating social data or social creatures for a long time. Data are created, captured, and sometimes disappeared. Now we're capturing more data. If we record a conversation, or a lecture, it's becoming a part of the social data. It's persisting, then it becomes; the availability is whether or not this persistent data is private or not. If we hide it in Facebook under private, then we've got persistent social data that is not very accessible, or we make it public and it becomes very accessible. It happens to get promoted and that's very accessible and very persistent. Every data can be social by somebody. Data, all data can be social.


What is data science, and a data scientist?

Data is always incomplete.
A data scientist's role is actually to have a dialog with other people. A data scientist's role is to tell stories about the data to other people. Storytelling is a pretty important skill for data scientists. Data scientists don't be worried to find data; actually, they find data somewhere. They say more about what problem we want to solve. Whereas CIO's view is more concerns about getting some sorts of data.


By definition of Wikipedia (http://en.wikipedia.org/wiki/Data_science)

Data science is a discipline that incorporates varying elements and builds on techniques and theories from many fields, including Math, Statistics, Data Engineering, Pattern Recognition and Learning, Advanced Computing, Visualization, Uncertainty Modeling, Data Warehousing, and high performance computing with the goal of extracting meaning from data and creating data products. Data Science is a novel term that is often used interchangeably with "Competitive Intelligence" or "Business Analytics," although it is becoming more common. Data Science seeks to use all available and relevant data to effectively tell a story that can be easily understood by non-practitioners.
A practitioner of Data Science is called a Data Scientist. Data Scientist solve complex data problems through employing deep expertise in some scientific discipline.It is generally expected that Data Scientists are able to work with various elements of mathematics, statistics and computer science, although expertise in these subjects are not required. However, a data scientist is most likely to be an expert in only one or two of these disciplines and proficient in another two or three. There is probably no living person who is an expert in all of these disciplines and an extremely rare person would be proficient in all of these disciplines. This means that data science must be practiced as a team, where across the membership of the team there is expertise and proficiency across all the disciplines.
Good Data Scientists are able to apply their skills to achieve a broad spectrum of end results. Some of these include the ability to find and interpret rich data sources, manage large amounts of data despite hardware, software and bandwidth constraints, merge data sources together, ensure consistency of data-sets, create visualizations to aid in understanding data and building rich tools that enable others to work effectively. According to some experts(?), the best data scientist tend to be "hard scientists," particularly physicists, rather than those with backgrounds in computer science. The skill-sets and competencies that Data Scientist employ vary widely. Data scientists are an integral part of competitive intelligence, a newly emerging field that encompasses a number of activities, such as data mining and analysis, that can help businesses gain a competitive edge.[1]
A major goal of Data Science is to make it easier for other to find and coalesce data with greater ease. Data science technologies impact how we access data and conduct research across various domains, including the biological sciences, medical informatics, Social Sciences and the humanities. “From intelligence search that integrates better understanding of the text and the user’s intentions, to integrating multiple modalities when accessing information, to the ability to actually aggregate information from multiple sources and answer users queries, the possibilities are endless.”

And, its research areas are as below.
  • Cloud Computing
  • Databases and Information Integration
  • Learning, Natural Language Processing and Information Extraction
  • Computer Vision
  • Information Retrieval and Web Information Access
  • Knowledge Discovery in Social and Information Networks


external image Data_Science_VD.png
from drewconway.com


According to Mike Loukides (http://radar.oreilly.com/2010/06/what-is-data-science.html)

A data application acquires its value from the data itself, and creates more data as a result. It’s not just an application with data; it’s a data product. Data science enables the creation of data products.

Google is a master at creating data products. Here’s a few examples:
  • Google’s breakthrough was realizing that a search engine could use input other than the text on the page. Google’s PageRank algorithm was among the first to use data outside of the page itself, in particular, the number of links pointing to a page. Tracking links made Google searches much more useful, and PageRank has been a key ingredient to the company’s success.
  • Spell checking isn’t a terribly difficult problem, but by suggesting corrections to misspelled searches, and observing what the user clicks in response, Google made it much more accurate. They’ve built a dictionary of common misspellings, their corrections, and the contexts in which they occur.
  • Speech recognition has always been a hard problem, and it remains difficult. But Google has made huge strides by using the voice data they’ve collected, and has been able to integrate voice search into their core search engine.
  • During the Swine Flu epidemic of 2009, Google was able to track the progress of the epidemic by following searches for flu-related topics.

Flu trends
external image datascience-swing-flu.png
Google was able to spot trends in the Swine Flu epidemic roughly two weeks before the Center for Disease Control by analyzing searches that people were making in different regions of the country.


Facebook, Twitter, and LinkedIn also use friends data to suggest other friends.

Where data comes from

Data is everywhere: your government, your web server, your business partners, even your body. While we aren’t drowning in a sea of data, we’re finding that almost everything can (or has) been instrumented. Sites like Infochimps and Factual provide access to many large datasets, including climate data, MySpace activity streams, and game logs from sporting events. Factual enlists users to update and improve its datasets, which cover topics as diverse as endocrinologists to hiking trails.
As storage capacity continues to expand, today’s “big” is certainly tomorrow’s “medium” and next week’s “small.” The most meaningful definition I’ve heard: “big data” is when the size of the data itself becomes part of the problem. We’re discussing data problems ranging from gigabytes to petabytes of data. At some point, traditional techniques for working with data run out of steam.


Data scientists

Data science requires skills ranging from traditional computer science to mathematics to art. Describing the data science group he put together at Facebook (possibly the first data science group at a consumer-oriented web property), Jeff Hammerbacher said:
  • … on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, or communicate the results of our analyses to other members of the organization 3
Where do you find the people this versatile? According to DJ Patil, chief scientist at LinkedIn(@dpatil), the best data scientists tend to be “hard scientists,” particularly physicists, rather than computer science majors. Physicists have a strong mathematical background, computing skills, and come from a discipline in which survival depends on getting the most from the data. They have to think about the big picture, the big problem. When you’ve just spent a lot of grant money generating data, you can’t just throw the data out if it isn’t as clean as you’d like. You have to make it tell its story. You need some creativity for when the story the data is telling isn’t what you think it’s telling.
Data scientists combine entrepreneurship with patience, the willingness to build data products incrementally, the ability to explore, and the ability to iterate over a solution. They are inherently interdiscplinary. They can tackle all aspects of a problem, from initial data collection and data conditioning to drawing conclusions. They can think outside the box to come up with new ways to view the problem, or to work with very broadly defined problems: “here’s a lot of data, what can you make from it?”
The future belongs to the companies who figure out how to collect and use data successfully. Google, Amazon, Facebook, and LinkedIn have all tapped into their datastreams and made that the core of their success. They were the vanguard, but newer companies like bit.ly are following their path. Whether it’s mining your personal biology, building maps from the shared experience of millions of travellers, or studying the URLs that people pass to others, the next generation of successful businesses will be built around data. The part of Hal Varian’s quote that nobody remembers says it all:
  • The ability to take data — to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it — that’s going to be a hugely important skill in the next decades.
Data is indeed the new Intel Inside.