Andreas Weigend | Social Data Revolution | Fall 2012
School of Information | University of California at Berkeley | INFO 290A-3

Class 3: September 17, 2012

Responsible for initial page (up by 10pm on Thursday after class):
  • Naehee Kim (
  • Venkat Mynampati
  • Student 3

Class materials:

The Future of Work


Social data allows us to have a record of what people do, wherever they do it. Most of employment based on showing up at a particular point in time, sitting there, and going home in the evening. This makes sense in the world of manufacturing. Our school system is goof for compliance. Maybe it is not the creativity that drives our society forward. We are going to talk about ways we see the future of work.

BranchOut - Career Networking on Facebook

  • a Facebook application designed for finding jobs, networking professionally, and recruiting employees.(Wikipedia)
  • They imagined to immediately get millions of users - 22-25m active users on Facebook, roughly 1/2 billion coverage
  • Data quite poor; most people don't fill out majority of job details, but can use machine learning mechanisms to make predictions. Full time job market can move on from what we have now (LinkedIn) to more social data based.
  • Do people like to have personal/professional separate - the verdict is undecided; people are horny, lazy, and greedy

What are properties of tasks that are easy to outsource?

  • Communication price of task is higher than doing the task yourself.
  • How can social data help you decide to do a task for someone or outsource your task?

Astrid (Lecturer: Jon Paris, CEO of

Social Data Revolution is changing the world again in some incredible ways.

The numbers on Social Network Services ( e.g. Facebook friends, Linked in connections, Twitter followers) are indicators of whether a person would be a good employee or not. I experienced an intern having these numbers was much more effective at research, working with the web, and sourcing leads than a MBA graduate having none of these numbers.

The profiles and social activities on Facebook and Twitter give us a lot of information about a person. Likewise, productivity profile is another way we generate information. Astrid provides three types of profiles: (1) Completed tasks (2) Inspired tasks (3) Supported tasks. Different people care about different aspects. CEO is likely to care about completion rate.

Nowadays, people are more comfortable outsourcing a wider range of tasks. Taskrabbit. Zaarly.

Astrid builds a dashboard for delegation.

  • User base - busy moms overwhelmed by different roles.
  • Helped people complete 45m tasks; people complete tasks 30% faster if tasks are shared
  • Problems to delegate tasks: (1) pains of communication (so much time to get it done) (2) pains of negotiation (3) pains of identification (who can do it for me?)
  • Goal: Helps people get things done through social sharing -- delegation.
  • Specialization continues beyond changing oil to upgrading public presence and brand. How to find the people with this expertise? As the world transitions to gig economy, social data will help the world navigate these problems.It will solve problems through trust. If you are a friend of my friend, I am more open to letting you update my Facebook profile or cleaning my car.
  • Astrid can work for low skill tasks to high skill tasks such as chores in family, managing public presence of a business man, etc. Elanceand other freelance companies are more about technical problems.


  • How do I make sure the person does what they're assigned?
    • Face counting on you - We work with all social networks to pull faces of all people assigned
    • Reminders
    • We did an experiment of the wall of fame and wall of shame.
  • What features make Astrid popular?
    • Fast, simple, easy to sue
    • Differentiation point: personalization e.g.) reminder system
  • Gamification is a good idea?
    • Some of stats are helpful. e.g.) completion rate, top 10 inspirational people
  • How to manage the complexity of your app when adding new features?
    • A/B test - drop some features.
  • C2C model? B2B model?
    • Astrid service is versatile. It is already used by a number of small organizations.
  • Discrimination is concern
    • Social data can be used for discrimination. For example, we gain information such as age, social status, and company name from your email address. Hopefully the rating system in Astrid will help to counter act discrimination.

Social data is crucial in the creation of trust.
  • Social distance: social network can help increase or decrease the distance among people
  • Categorizing commonality by suing social network services; level of trust increases.

Talk by Jeremy Carr of Palantir

What is Palantir

Palantir - A palantír (pl. palantíri) is a magical artifact from J. R. R. Tolkien A palantír (sometimes translated as "Seeing Stone" but literally meaning "Farsighted" or "One that Sees from Afar"; cf. English television) is a spherical stone that functions somewhat like a crystal ball. (source wikipedia)

What does Palantir do? (source: Redefining Search. The Palantir Play: A Blend of Open and Closed. By: Arnold, Stephen E., Information Today, 87556286, Sep2010, Vol. 27, Issue 8)

Palantir is in the business of building platforms for human-driven analysis and has potential to become increasingly familiar in the information retrieval world with its new approach to building contextually grounded visual analytic's environments.

"To simplify, think of Palantir as a platform for integrating, visualizing, and analyzing the world's information. If this sounds similar to Google, the resonance is in harmony. The founders of Palantir have their roots deeply entwined in Silicon Valley, and a number of the founders are alums of Stanford University. Since the company's inception in 2004, Palantir has emerged as one of the few outfits that can ingest structured, unstructured, relational, temporal, and geospatial data. Once in the system, Palantir makes it possible to perform sophisticated analyses."

Palantir straddles both commercial and governmental's technology(ies) is effectively used to deter threats, be it Fraud protection in commercial operations like PayPal or Banks or be it national security threats. (Reference: Palantir's tool is called most effective to probe terrorist networks, intelligence analysts
Gorman, Siobhan. Wall Street Journal [Brussels] 08 Sep 2009: 16.)..."Some analysts say Palantir's strength is helping analysts draw inferences when confronted with an enormous amount of disparate data. Palantir's tool is getting a thumbs-up from officers using it. "It is much simpler to understand the results of inquiries, and provides more in-depth database links then the current programs in use by the Army today," says Captain James King, an Army intelligence officer."

Case study from JP Morgan Chase...(reference: Investment Weekly News (Oct 22, 2011): 720.)
Palantir was recognized by the IT Executive Team at JPMC for the impact its technology has made on the business, the disruptive nature of its technology, and the ability of its partnership to drive business value within JPMC. Palantir was honored with the Hall of Innovation Award at the 3rd Annual J. P. Morgan Chase Technology Innovation Symposium on Tuesday, October 4, 2010 at the Rosewood Sand Hill in Menlo Park, California.
"Palantir provides a window into valuable data both broadly and directly affecting our customers," notedJPMorgan Chase CIO, Guy Chiarello. "These insights allow JPMorgan Chase to provide greater value to our clients."
The Palantir Platform is ideally suited for working in exceedingly data-rich environments, such as consumer banks that depend on understanding their data to drive enterprise critical results. The Palantir Platform enables technical and non-technical users to work across the enterprise to achieve critical results. Palantir is widely deployed in many of the world's most important financial, commercial and mission critical organizations.

Now about the talk by Jeremy

Basically Jeremey covered 4 things that are related to Fraud detection...this is in the context of explaining that it's easier to build a payment system/service, but it is really hard to detect and prevent fraud...this was in relation to the efforts at Paypal to fight fraud...apparently Paypal was losing close to $10M USD a month (? seems high)

Keys questions asked during this effort were:
1. How do you combat fraud?
2. What data we have to detect and fight fraud?
3. What data needs to be collected to detect and fight fraud.
4. How do we make sense of data?
5. What infrastructure is needed to make sense of data?

So the 1st order of business is to collect data...all kinds of data, so that associations can be make between these data points and patterns can be discerned.

Speaker listed 4 core concepts to collect and collate data:

step #
Data Integration
Patoto and Potata....multiple ways of saying the same thing. When data is being aggregated from different systems, it needs to be mean the same thing...refer to MDM (Master Data Management)
Search and Discovery
Given a large data-set, how to do you search for anything? You start with a problem, and then come up with key words(concepts) that describe the problem...and then search for those key concepts (aka associations)
Knowledge Management
Knowledge is not in systems or's in the processes or in people's head. How to do we ensure that this 'trapped' (aka tribal) knowledge is spread out and/or made available to folks without that rich background. (Question: Can folks get knowledge without knowing context?)
How can teams or machines collaborate...what is the unit of collaboration?

Data Integration

Most of the data captured in systems is the form of rows and columns....OLTP systems are good at this...example: oracle, MySQL etc are the work horses of all transactional systems and they capture tons of data...user account information, payment transaction details, order histories etc. But structured data, forces a structure on data for ease of capturing, storing and querying that in databases. Where as real world is full of un-structured data...speech, emails, documents (word, pdf, photos etc)..representing this data in a structured form is nightmare..,unless we find a mechanism to structure this real-world data and associate that structured-text data with existing structured data (in databases), we cannot get a comprehensive picture of true interactions between individuals and/or individual transactions. (side note: In movie Eagle Eye, this was dicpicted, but in a somewhat negative manner..+ refer to TIA...Total Information Awareness program by DOD...

+ Refer to Data Integration Flows for Business Intelligence by Umeshwar Dayal et el...2009 ACM.

So how do we create a structure for un-structured data....we create an Ontology or map of entities...for example, we may extract nouns or, we may extract people/places/name-of-things and other action items from un-structured data then create a map or hierarchy of these entities. This forming of the structure should be guided by the problem we are trying to solve...I think this is what the speaker implied when he mention about too much vs too little tagging..i.e. if we tag or map every thing in the pile of text, then we may loose forest for trees...too much data is essence, speaker implied that ' its tricky to do un-structured data analysis' thru traditional means...apparently, Palantir developed tricks and techniques to match structured and un-structured data...not clear if this is proprietary or not.

Data Integration between system - A CRM example...this shows integration of data between structured systems...the point is: bringing disparate data into 1 single system is a takes time and effort to streamline various data items.
Data integration bw systems  - a CRM example.JPG
Data integration between systems - A CRM example

Big data processing is characterized by volume, velocity, variety and variability
Big data pic.JPG

Refer to this Master data management doc... to understand intricacies and utility of data management across an enterprise.

Search and Discovery

Imagine a small book shelf..holding about 50 books, it is easier to look for a book or any book or find patterns of organization or lackof in that, imagine berkeley central library, holding hundreds and thousands of's not easy to look for a book, unless you know something about it...title, author, subject, publication year or publisher..without these you would spend year(s) going serially or if you are in luck you will get that in a day. Now, magnify that problem a thousand/million times in a big peta-byte database...if you want to find any meaningful information or patterns of data, you need to first identify what you are trying to figure-out/solve...a basic definition of a problem..just like that title/subject/author, we have to start with a problem to solve...then,understand the components of that problem and map those components to data subject areas...note that this organization of data into subject areas is accomplished during data integration. Once you identify data subject areas, then it is easy to drill down into individual data items that can help solve the problem that we defined...back to our library example...once we know the title and subject, we can narrow down the area to search that book...taken this to automation level, once we know data subject area, we can get those individual fields.

Per speaker, Palantir does this S&D far better than existing system (? double check this).

In any case, there are existing methods to make meaning out of big data...most of these methods were used in astronomy and has been ported to tackle the problem of big data in commercial realm. Loosely these are classifed as Supervised learning vs un-supervised learning and the topic is data-mining...refer to this 101.

The point is, once we build a massive data-store of both structured and un/semi-structured data, how do we find nuggets of information/knowledge...this problem is tackled by data mining or machine learning...refer to chapter 14 (Unsupervised learning) from "The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics)" by Trevor Hastie. But, even if we believe in machines finding patterns (or ghosts) in a pile of data, it's always a human who should make a decision...not a machine...atleast for a foreseable future, machine-learnings are not to be blindly trusted...(well, atleast till we achieve that singularity point...refer to Ray Kurzewiel's book Singularity is Near)

Knowledge Management

Jeremy spoke about the importance of knowledge management, especially in knowledge driven industries.

side-note: unlike in mechanized industries, where decades of experience got codified into processes and/or standards, be it technological or design or organizational, new information-driven industries (financial power-houses, software producers, ?) are characterized by knowledge concentrated in specific individuals or group of individuals...i.e. this knowledge is not codified yet... any knowledge that can be codified, be it an algorithm or a technology process, can be cut into pieces and shipped to different places for an efficient and/or parallel processing...I think it is not a co-incidence that automobile industry got out-sourced as soon as we could codify the car-manufacturing processes and of-course collapse of Bretton-Woods system hastened this trend)

Speaker mentioned that the key to knowledge management lies in 'access controls'..who can view the information, who can edit/update the information. In this context Jeremy mentioned 2 things:
1. Where does the expertise go after people leave/quit
2. How to broadcast knowledge to right individuals at right time.

In this context, Jeremy mentioned:
1. the need to control information flow @ granular level
2. the need to ensure information privacy

Once granularity of information-control is known and access controls are in place, then it becomes easy to share information across organization, which in turn helps collaboration between teams and increases productivity.

Now, to look at components of KM...
source: Knowledge Management of Internal Best Practices # Best Practices Benchmarking Report (2000)
Knowledge management concepts.JPG
Knowledge Management concepts

Refer to this doc for Best Practices in KM..aks Knowledge sharing.


Speaker mentioned that collaboration can be at 2 levels:
1. At individual level
2. At project level

The premise of collaboration is age old, as the saying goes...2 heads are better than's no co-incidence that most of the fundamental break through's in science occurred due to collaboration or as Newton said, 'standing on the shoulders of gaints'. In business world that standing part may not go down well, but it's a fact that a small team can work wonders...skunk-works comes to my mind.

With respect to knowledge management and larger data related problems, collaboration between teams or individuals becomes critical. Unless there is a system in place to pass knowledge, expertise, lessons-learned, tactics to next team or generation, it's likely that this new team is going to spend as much time as the preceding one, solving the same old problems, whence, such a time can be saved and this new team can build next best thing ('standing on shoulders of gaints!). The question is how-to build that system to pass knowledge, context, expertise, tactics? As you know, knowledge is highly contextual and hence cannot be codified, hence some companies have implemented mentor-ship software development, we see pair programming, where a junior person gets paired with a senior and get to see 1st hand how the senior makes design choices or coding decisions.

Interesting thing happened along the way towards collaboration...companies started realizing that there are millions of smart folks 'outside' of company. So imagine this P&G has ~50K R&D personnel, but outside of P&G, in the same subject area, there are >2M scientists...if only P&G can 'collaborate' with these folks to solve their problems fast - enough...this hinges on the ability of P&G teams to chunk up problems in a neat packages and if they can do that, then it would be easy to bring in that external knowledge into P&G...this was the thought that led to creation of:

A similar thing happened with Netflix prize....for a paltry sum of $1M (chump change for netflix), it got the best minds in data science compete to improve it's algorithm.

Prof. Weigend wrapped the discussion by eliciting commentary from Jeremy on the following points:

Eight Rules for Data
  1. Collect everything
  2. Give data to get data
  3. Start with the problem, not with the data
  4. Focus on metrics that matter to your customers
  5. Drop irrelevant constraints
  6. Embrace transparency
  7. Make it trivially easy for people to connect, contribute, and collaborate
  8. Let people do what people are good at, and computers do what computers are good at