Shutdown morality

October 15th, 2013 by Brad

The shutdown impasse has led to a great deal of speculation about the underlying causes of governmental dysfunction in the United States.

It has become clear that the parties have separated on moral issues. As politics has become more about engaging the mass public (as opposed to an earlier era of patronage politics), the parties have adopted increasingly divergent moral positions.

Moral diversity within parties

But the picture isn’t so clear cut as liberals on one side and conservatives on the other. While most metrics do show a better sorting of members of Congress and the mass public into parties by ideology (conservatives are much more likely to also be Republicans and liberals are much more likely to be Democrats today than has been the case historically). There still exists a great deal of heterogeneity within each party.

The shutdown fight is an interesting case of intraparty disagreement. It seems clear that Speaker Boehner would have preferred to keep the budget negotiations on the subject of entitlement reform and other long-term Republican goals. However, conflicts within the Republican Party have put him into a difficult position. The (very) conservative wing of his party has made the current imbroglio about health care.

In other work, I have generated estimates of the moral foundations at the Congressional District level. Does variation in the moral foundations of Republicans’ constituents help us to explain the current conflict?

The poison pill

One way to gauge support for the shutdown is to examine the cosponsors of Rep. Tom Graves’s Continuing Resolution (CR). Graves included language in his version of the budget resolution that would ‘defund’ the Affordable Care Act – a non-starter for the administration and senate Democrats and by all accounts contrary to Boehner’s wishes. Which moral foundations correlate with support for Graves’s CR among Republicans?

One-hundred and thirty-six (nearly 60%) House Republicans cosponsored Graves’s CR.

It turns out that even after controlling for Obama’s share of the vote in their districts (which explains a very significant portion of the variation), Republicans who represent districts that are particularly low on the Care/Harm foundation are most likely to support the CR.

The relationship (holding all of the other foundations and presidential support in the district at their mean values) [1] looks like this:

The plot shows the predicted probability of supporting the CR across the range of scores for the Harm/Care foundation in Republican districts. The points at the top and bottom of the plot show the actual positions of each district. The points at the top show the districts that cosponsored the CR and the points at the bottom show those that did not cosponsor it.

According to my model, a legislator who represents a district at the mean levels of each of the foundations and the average for Obama’s vote share in Republican districts is 60 percent likely to support the CR. An increase of one standard deviation in the Care/Harm foundation lowers the probability by a little over 10% making it dead even. A decrease in one standard deviation in the Care/Harm foundation increases the probability of cosponsoring the CR by about the same amount to 70 percent.

Given two districts that are equally Republican (gave the same proportion of the vote to Obama in 2012), variation in the Care/Harm foundation explains some of the remaining variance in position taking on the CR.

We can think about this finding in two ways. First, it might be that Republicans who represent individuals who place more emphasis on the morality of care and harm are less likely to go on the record as wanting to take away health coverage from those who will be added under the president’s new initiative. Perhaps their constituents would punish them for perceived cold-heartedness.

On the other hand, it could be that the effect is mainly driven by the fact that many Republicans represent districts whose constituents do not place particularly high emphasis on concerns about care and harm to individuals. Their mix of moral concerns tends to direct their attention toward the (perceived or actual) loss of individual freedom that is implied by compelling individuals to receive health coverage. In the weighing of costs and benefits of the new proposal, there is obviously a deep moral divide between the two sides.

Given that this is a single action at a single point in time, it is difficult to say definitively one way or the other. As with most things, the correct answer is surely some mix of the two explanations (along with a host of other unaccounted for factors).

Discussion

The above analysis suggests that some of the conflict within the Republican Party can be explained by variation in the moral concerns of constituents. There is nothing inevitable or foreordained about the relationship outlined above, but it does perhaps tell us something about a way forward.

For the most strident parts of the opposition, arguments focusing on the harm being done by the shutdown aren’t likely to be very persuasive. Interestingly, there is some suggestion in my analysis that the Loyalty/Betrayal foundation is negatively associated with support for the CR (although the effect did not reach conventional levels of statistical significance). It is possible that framing the debate in terms of patriotism (our national obligation to pay our bills) or even good-old partisanship (the need of the party to stick together behind Speaker Boehner) would be more successful.


[1] I estimated a probit regression with cosponsorship of the CR as the dependent variable. The model included the five moral foundations and Obama’s share of the vote in the 2012 election as independent variables.

Posted in Uncategorized | 6 Comments »

Progress on Debt Ceiling shows us how Moderates can temper the Dark Side of Moral Conviction

October 14th, 2013 by Ravi Iyer

Reposted from this post on the Civil Politics Blog

As I write this, reports indicate that moderate Republicans (Susan Collins) and Democrats (Joe Manchin) appear to be spearheading bipartisan talks to avoid the economic consequences of a debt default and also end the partial government shutdown.  This is in marked contrast to partisans on the left who are willing to endure some economic hardship to regain political power and partisans on the right who are willing to endure some economic hardship to achieve policy goals.

Strong partisans tend to have strong moral convictions, which can certainly lead to pro-social behavior in many cases.  One principle of practicing political civility is to try to accept the genuine good intentions of others, and I have little doubt that those on the left and right have good intentions for the country.  Yet psychological research on moral conviction shows how it is precisely those individuals with the strongest moral desires who are often willing to overlook the consequences of their actions (e.g. shutting down the government and threatening to breach the debt ceiling) in service of their goals.  

This review paper by Linda Skitka and Elizabeth Mullen provides a nice overview of this research:  

Although moral mandates may sometimes lead people to
engage in prosocial behaviors, they can also lead people to disregard procedural safeguards. This article briefly reviews research that indicates that people become very unconcerned with how moral mandates are achieved, so long as they are achieved. In short, we find that commitments to procedural safeguards that generally protect civil society become psychologically eroded when people are pursuing a morally mandated end. Understanding the “dark side” of moral conviction may provide some insight into the motivational underpinnings of engaging in extreme acts like terrorism, as well as people’s willingness to forego civil liberties in their pursuit of those who do.

 

It is precisely for these reasons that partisan gerrymandering, which makes politicians accountable to the extremes who vote in primaries as opposed to moderates, threatens to lead to more future crises.  If moderates like Olympia Snowe leave and their influence is replaced by hardliners like Ted Cruz, we are all bound to suffer the consequences of their moral conviction.

- Ravi Iyer

Go to Source

Posted in civil politics | No Comments »

Values Based Matching in the Labor Market

September 23rd, 2013 by Ravi Iyer

Recently, I received news that Reid Hoffman, founder of LinkedIn, has decided to indirectly support some of the work I do with Jonathan Haidt, because he is a fan of Jon’s work.  The news got me thinking a bit about the confluence of moral psychology and what LinkedIn does.  As Ranker’s Principal Data Scientist, I will often find myself at data conferences with members of LinkedIn’s data team, which is one of the most visible and productive, given the wealth of data they have to mine.  It may seem that what they do and what moral psychologists do are separate worlds, but as I argued in my 2012 SXSW talk, there is likely to be a convergence.  Big data is merely a tool that inevitably will be used to answer the questions we care about, and for those of us with first world problems where crime is going down, war is less of an issue, and obesity is a bigger problem than hunger, the questions we increasingly ask tend to be more existential.  It is only natural that we use the tools of the age (data) to answer the questions of the age, like how we can live a meaningful life.

Unlike some who are more dogmatic in their belief, I don’t believe that data can tell you what a meaningful life is or how you should live.  In short, moral psychology cannot tell you what ought to be…but it can tell you what is.  For example, in a partnership with Zenzi Communications, we have been working on describing the narratives that people of different value types tend to resonate with.  I can’t tell you whether action or comedy movies are better, but I can provide a probabilistic fit between a person’s value orientation and the kinds of stories that they are likely to enjoy.

Similarly, companies and prospective employees are looking for such fits as well.  Zappos wants people whose values tend toward the unconventional.  LinkedIn wants people who will put their members’ first.  In moral psychology terms, Zappos is looking for individuals who are high on traits like openness to experience and stimulation, while LinkedIn might be looking for individuals who are more self-transcendent in terms of putting others (their members) first.

These values are measurable both with self-characterizations and endorsements, just like LinkedIn skills, but also through more subtle means such as music tastes or use of certain words.  Further, following on work by Shalom Schwartz, values can be organized as oppositional (see above graph using Zenzi’s value typology which is based on Schwartz’s work), so that self-presentation effects (e.g. “I’m high on all values!”) can be mitigated.  In this way, companies and employees may potentially be able to find employment matches that are not just a job, or even a career, but are closer to a calling, leading to happier and more productive employees.

We get a lot of resumes at Ranker.  They honestly all start to blend together at some point.  The employment market needs more nuance, more subtlety, and more data to help create better matches.  I don’t know if LinkedIn or some other organization will lead the charge, but I’m confident that this type of values based matching will indeed be part of the future of human resources recruiting, and it already is part of many forward thinking organizations’ processes.  Just as the internet and information technology has made matching in other domains (e.g. classified ads or home buying) more efficient, values based matching in the labor market is bound to be a problem tackled by data science teams at online companies in the near future.

- Ravi Iyer

Posted in big data, business of psychology, consumer psychology, data science, jonathan haidt, linkedin, moral psychology, openness to experience, ranker, technology, technology business, yourmorals.org, zappos | No Comments »

Rankings are the Future of Mobile Search

July 16th, 2013 by Ravi Iyer

Did you know that Ranker is one of the top 100 web destinations for mobile per Quantcast, ahead of household names like The Onion and People magazine?  We are ranked #520 in the non-mobile world.  Why do we do better with mobile users as opposed to people using a desktop computer?  I’ve made this argument for awhile, but I’m hardly an authority, so I was heartened to see Google making a similar argument.

This embrace of mobile computing impacts search behavior in a number of important ways.

First, it makes the process of refining search queries much more tiresome. …While refining queries is never a great user experience, on a mobile device (and particularly on a mobile phone) it is especially onerous.  This has provided the search engines with a compelling incentive to ensure that the right search results are delivered to users on the first go, freeing them of laborious refinements.

Second, the process of navigating to web pages (is) a royal pain on a hand-held mobile device.

This situation provides a compelling incentive for the search engines to circumvent additional web page visits altogether, and instead present answers to queries – especially straightforward informational queries – directly in the search results.  While many in the search marketing field have suggested that the search engines have increasingly introduced direct answers in the search results to rob publishers of clicks, there’s more than a trivial case to be made that this is in the best interest of mobile users.  Is it really a good thing to compel an iPhone user to browse to a web page – which may or may not be optimized for mobile – and wait for it to load in order to learn the height of the Eiffel Tower?

As a result, if you ask your mobile phone for the height of a famous building (Taipei 101 in the below case), it doesn’t direct you to a web page.  Instead it answers the question itself.

That’s great for a question that has a single answer, but an increasing number of searches are not for objective facts with a single answer, but rather for subjective opinions where a ranked list is the best result.  Consider the below chart showing the increase in searches for the term “best”.  A similar pattern can be found for most any adjective.

So if consumers are increasingly doing searches on mobile phones, requiring a concise list of potential answers to questions with more than one answer, they naturally are going to end up at sites which have ranked lists…like Ranker. As such, a lot of Ranker’s future growth is likely to parallel the growth of mobile and the growth of searches for opinion based questions.

- Ravi Iyer

Digg This  Reddit This  Stumble Now!  Buzz This  Vote on DZone  Share on Facebook  Bookmark this on Delicious  Kick It on DotNetKicks.com  Shout it  Share on LinkedIn  Bookmark this on Technorati  Post on Twitter  Google Buzz (aka. Google Reader)  
Posted in data science, ranker | No Comments »

All Scientific Research Should Be Crowdsourced

June 11th, 2013 by Ravi Iyer

Very few days go by without a new article describing the limits of published scientific research.  The headline cases are about scientists who plagarize or completely fabricate data.  Yet, in my experience, most scientists are actually quite ethical, meticulous, hard-working, and really concerned with finding the truth.  Still, non-scientists would likely be surprised to know that a large number of scientific studies are actually false.  An Amgen study found that 46 out of 53 studies with ‘landmark’ findings were unable to be replicated.  A team at Bayer found a slightly more optimistic picture where 43 out of 65 studies revealed inconsistencies when tested independently.  Scientific journals continue to accept articles based on the novelty and projected impact of the submission, yet simulations illustrate how the bias of journals toward publishing novel results likely leads to an environment where most published results are actually false.  My home discipline of psychology is currently doing some soul searching as it’s a relatively open secret that many results are difficult to reproduce such that a systematic reproducibility project is taking place.

Crowdsourcing is, and always has been the solution.  Indeed, the phrase at the bottom of Google Scholar, “standing on the shoulders of giants”, acknowledges that science has always been about crowdsourcing, as every scholar is collaborating with the scholars before them.  Findings are not produced in a vacuum and build upon (or challenge) previous findings.  Replication by others, which effectively crowdsources verification of results, is at the heart of the scientific method.  It is perhaps a sign of the narcissism of our age that scientists feel compelled to believe that they discover things largely independently, such that they feel compelled to attack when their findings are challenged.  Yet a willingness to be wrong about something is essential to learning, as we can’t learn to walk without falling or learn about relationships without heartbreak.  When science becomes more about ego, career, and grant money, it naturally becomes less accurate.  Insisting that findings be crowdsourced solves this.  No single study, paper, or research group can prove anything by themselves.

Crowdsourcing is not simply averaging the opinions of the masses, as those who would argue against that straw man would have you believe.  Mathematically, crowdsourcing is about reducing the influence of sources of error and there is a great deal of academic research on this topic.  A good crowdsourcing algorithm does not weight all inputs equally, but instead seeks to identify clustered sources of error, which explains why aggregating across people with diverse personalities, perspectives or job functions produces better results.  Inputs need to have some signal vs. noise and need to have uncorrelated error.  The unfortunate assumption in most research is that error is uncorrelated statistical noise that can be dealt with using statistical tests.  Yet error also occurs due to the unconscious biases of researchers, the sheer number of researchers trying to find novel findings, the degrees of freedom that a researcher has in trying to prove their hypothesis, the non-randomness of sampling, and the volume of available statistical tests that a researcher can use.  Given all these other sources of error, it is no wonder that many findings are false.  A good crowdsourcing algorithm would be weighted such that true results would have to be shown by multiple researchers using multiple methods, multiple samples, multiple statistical tests, and multiple paradigms.  This requires crowdsourcing as no single person can do all this, and even if they could, they would still represent a single source of error.

Technology enables crowdsourcing to be conducted far more efficiently, as has been proven by successful science crowdsourcing projects like GalaxyZoo, FoldIt, Seti@Home, and psychology’s reproducibility project.  Trends like citizen science, the quantified self, open access publishing, and interdisciplinarity improve the diversity of perspectives which mathematically improves the ability to find truth.  Every meta-analysis result and Nate Silver’s success in aggregating polls in the last election take advantage of the mathematical principles that underlie crowdsourcing, specifically the certainty that aggregating across sources of error produces more truth.  In our daily lives, we all crowdsource knowledge that we are uncertain about, looking for confirmation from multiple independent sources when we are skeptical.  This same skepticism serves scientists well and scientists should embrace being wrong, confident that the broader truth will be revealed when all data is aggregated intelligently and all perspectives are valued.  Crowdsourcing is not some new technique that threatens to fundamentally change scientific research.  Rather, it is an extension of the collective effort of knowledge aggregation that is the heart of science and scientists should embrace it as such.

- Ravi Iyer

Posted in business of psychology, crowdsourcing, data science, replications of other studies, reproducibility, scientific method, standing on the shoulders of giants, technology, yourmorals.org | 5 Comments »

Personality Types in Business: Conscientious CEOs & Open Technologists

May 7th, 2013 by Ravi Iyer

Part of my job at Ranker is to talk to other companies about our data.  While people often talk about how “big data” is revolutionizing everything, the reality of the data marketplace is that it still largely revolves around sales, marketing, and advertising.  Huge infrastructures exist to make sure that the most optimal ad for the right product gets to the right person, leveraging as much data as possible.  For example, I recently presented at a data conference at the Westin St. Francis in San Francisco, which meant that I spent some time on their website.  For the past few weeks, long after the conference, I’ve been getting ads specifically for the Westin St. Francis on various websites.  At some level, this is an impressive use of data, but at another level, it’s a failure, as I’m no longer in the market for a hotel room.  The data to solve this problem is out there as someone could have tracked my visitation of the conference website, understood the date of the conference, and better understood my intent in visiting the Westin.  However, this level of analysis doesn’t scale well for an ad that costs pennies, and so nobody does this level of behavioral targeting.

I bring up this story because I believe this illustrates a difference between how people who think of themselves as businesspeople and people who think of themselves as technologists often think.  When talking about Ranker data, I often see this dichotomy.  People who are more traditionally business minded want a clear business reason to use data, while people who think of themselves as technologists seem more open to trying to envision a world where data does all sorts of neat things that data should be used for.  For example, I recently graphed opinions about beer, illustrating that Miller Lite drinkers were closer to Guinness drinkers than to Chimay drinkers.  As a technologist, I’m certain that a world will soon exist where bartenders can use data about me and others like me (e.g. the beer graph), to recommend a beer.  I don’t worry as much about the immediate path from the conception of such data to monetization.  I know that the beer graph should exist and I’m happy to help contribute to it, confident of my vision of the future.

This division between people who think like businesspeople and people who think like technologists is important for anyone who does business development or business to business sales, especially for those of us in the technology world where the lines are often blurry.  Mark Zuckerberg is a CEO, but clearly he thinks like a technologist.  My guess is that a lot of the CTOs of big companies actually think more like businesspeople than technologists.  If I were trying to sell Mark Zuckerberg on something, I would try to sell him on how whatever I was offering could make a huge difference to something he cared about.  I would sell the dream.  But if I were selling a more traditional businessperson, I would try to sell the benefits versus the costs.  I would have a detailed plan and sell the details.

I actually have a bit of data from YourMorals.org to support this assertion.  We have started collecting data on visitors’ professions and below I compare businesspeople to technologists on two of the Big Five personality dimensions that are said to underlie much of personality: Conscientiousness and Openness to Experience.  As you can see, businesspeople are more conscientious (detail oriented, fastidious, responsible), while technologists score higher on openness which is indicative of enjoying exploring new ideas and thinking of new possibilities.

Personality Types in Business

The reality is that every business needs a balance between those who are detail oriented and precise (Conscientious) and those who think about a vision for the future (Openness to Experience).  Often, technologists who start a company will eventually hire professional businesspeople who provide this balance (e.g. Sheryl Sandberg or Eric Schmidt).  Clearly, the best sales pitch will be both detailed and forward thinking.  However, if you’re talking to someone and have limited time and attention, considering whether you are speaking to someone who is more of a businessperson or more of a technologist may give you better insight into how to frame your pitch.

- Ravi Iyer

ps. Crossposted on Zenzi Communications‘ blog here, which is using a data driven approach to improving communications strategies.

Posted in big data, conscientiousness, consumer psychology, data science, marketing, openness to experience, ranker, technology, technology business, unpublished results, yourmorals.org | 1 Comment »

An Opinion Graph of the World’s Beers

May 1st, 2013 by Ravi Iyer

One of the strengths of Ranker‘s data is that we collect such a wide variety of opinions from users that we can put opinions about a wide variety of subjects into a graph format.  Graphs are useful as they let you go beyond the individual relationships between items and see overall patterns.  In anticipation of Cinco de Mayo, I produced the below opinion graph of beers, based on votes on lists such as our Best World Beers list.  Connections in this graph represent significant correlations between sentiment towards connected beers, which vary in terms of strength.  A layout algorithm (force atlas in Gephi) placed beers that were more related closer to each other and beers that had fewer/weaker connections further apart.  I also ran a classification algorithm that clustered beers according to preference and colored the graph according to these clusters.  Click on the below graph to expand it.

Ranker's Beer Opinion Graph

One of the fun things about graphs is that different people will see different patterns.  Among the things I learned from this exercise are:

  • The opposite of light beer, from a taste perspective, isn’t dark beer.  Rather, light beers like Miller Lite are most opposite craft beers like Stone IPA and Chimay.
  • Coors light is the light beer that is closest to the mainstream cluster.  Stella Artois, Corona, and Heineken are also reasonable bridge beers between the main cluster and the light beer world.
  • The classification algorithm revealed six main taste/opinion clusters, which I would label: Really Light Beers (e.g. Natural Light), Lighter Mainstream Beers (e.g. Blue Moon), Stout Beers (e.g. Guinness), Craft Beers (e.g. Stone IPA), Darker European Beers (e.g. Chimay), and Lighter European Beers (e.g. Leffe Blonde).  The interesting parts about the classifications are the cases on the edge, such as how Newcastle Brown Ale appeals to both Guinness and Heineken drinkers.
  • Seeing beers graphed according to opinions made me wonder if companies consciously position their beers accordingly.  Is Pyramid Hefeweizen successfully appealing to the Sam Adams drinker who wants a bit of European flavor?  Is Anchor Steam supposed to appeal to both the Guinness drinker and the craft beer drinker?  I’m not sure if I know enough about the marketing of beers to know the answer to this, but I’d be curious if beer companies place their beers in the same space that this opinion graph does.

These are just a few observations based on my own limited beer drinking experience.  I tend to be more of a whiskey drinker, and hope more of you will vote on our Best Tasting Whiskey list, so I can graph that next.  I’d love to hear comments about other observations that you might make from this graph.

- Ravi Iyer

Incoming search terms:

Digg This  Reddit This  Stumble Now!  Buzz This  Vote on DZone  Share on Facebook  Bookmark this on Delicious  Kick It on DotNetKicks.com  Shout it  Share on LinkedIn  Bookmark this on Technorati  Post on Twitter  Google Buzz (aka. Google Reader)  
Posted in data science, ranker | No Comments »

Ranker Uses Big Data to Rank the World’s 25 Best Film Schools

April 28th, 2013 by Ravi Iyer

NYU, USC, UCLA, Yale, Julliard, Columbia, and Harvard top the Rankings.

Does USC or NYU have a better film school?  “Big data” can provide an answer to this question by linking data about movies and the actors, directors, and producers who have worked on specific movies, to data about universities and the graduates of those universities.  As such, one can use semantic data from sources like Freebase, DBPedia, and IMDB to figure out which schools have produced the most working graduates.  However, what if you cared about the quality of the movies they worked on rather than just the quantity?  Educating a student who went on to work on The Godfather must certainly be worth more than producing a student who received a credit on Gigli.

Leveraging opinion data from Ranker’s Best Movies of All-Time list in addition to widely available semantic data, Ranker recently produced a ranked list of the world’s 25 best film schools, based on credits on movies within the top 500 movies of all-time.  USC produces the most film credits by graduates overall, but when film quality is taken into account, NYU (208 credits) actually produces more credits among the top 500 movies of all-time, compared to USC (186 credits).  UCLA, Yale, Julliard, Columbia, and Harvard take places 3 through 7 on the Ranker’s list.  Several professional schools that focus on the arts also place in the top 25 (e.g. London’s Royal Academy of Dramatic Art) as well as some well-located high schools (New York’s Fiorello H. Laguardia High School & Beverly Hills High School).

The World’s Top 25 Film Schools

  1. New York University (208 credits)
  2. University of Southern California (186 credits)
  3. University of California – Los Angeles (165 credits)
  4. Yale University (110 credits)
  5. Julliard School (106 credits)
  6. Columbia University (100 credits)
  7. Harvard University (90 credits)
  8. Royal Academy of Dramatic Art (86 credits)
  9. Fiorello H. Laguardia High School of Music & Art (64 credits)
  10. American Academy of Dramatic Arts (51 credits)
  11. London Academy of Music and Dramatic Art (51 credits)
  12. Stanford University (50 credits)
  13. HB Studio (49 credits)
  14. Northwestern University (47 credits)
  15. The Actors Studio (44 credits)
  16. Brown University (43 credits)
  17. University of Texas – Austin (40 credits)
  18. Central School of Speech and Drama (39 credits)
  19. Cornell University (39 credits)
  20. Guildhall School of Music and Drama (38 credits)
  21. University of California – Berkeley (38 credits)
  22. California Institute of the Arts (38 credits)
  23. University of Michigan (37 credits)
  24. Beverly Hills High School (36 credits)
  25. Boston University (35 credits)

“Clearly, there is a huge effect of geography, as prominent New York and Los Angeles based high schools appear to produce more graduates who work on quality films compared to many colleges and universities,“ says Ravi Iyer, Ranker’s Principal Data Scientist, a graduate of the University of Southern California.

Ranker is able to combine factual semantic data with an opinion layer because Ranker is powered by a Virtuoso triple store with over 700 million triples of information that are processed into an entertaining list format for users on Ranker’s consumer facing website, Ranker.com.  Each month, over 7 million unique users interact with this data – ranking, listing and voting on various objects – effectively adding a layer of opinion data on top of the factual data from Ranker’s triple store. The result is a continually growing opinion graph that connects factual and opinion data.  As of January 2013, Ranker’s opinion graph included over 30,000 nodes with over 5 million edges connecting these nodes.

- Ravi Iyer

Incoming search terms:

Digg This  Reddit This  Stumble Now!  Buzz This  Vote on DZone  Share on Facebook  Bookmark this on Delicious  Kick It on DotNetKicks.com  Shout it  Share on LinkedIn  Bookmark this on Technorati  Post on Twitter  Google Buzz (aka. Google Reader)  
Posted in data science, ranker | No Comments »

Big Data Stocks? Invest in Data, not in Tools.

April 25th, 2013 by Ravi Iyer

I want to invest in “big data stocks”.  After all, everyone is saying that big data is the future of health care, education, government, business, and will literally change the world.  As someone who works with data both as an academic at USC and as the principal data scientist at Ranker, I am the type of person who is likely to make and believe in such hyperbolic claims.  I recently put money into my IRA and needed to invest it and as someone who believes in investing in what I know about, naturally I wanted to invest in our data driven future.

Where should I invest?  If you look around the internet, you’ll find a number of recommendations from places like Forbes or The Street.  The general consensus appears to be to take the “picks and shovels” approach to investing in big data, where you invest in the companies that make the tools that enable people to use data, rather than in the data itself.  I’m writing this post because I think this is absolutely the wrong approach.  I believe in investing in data, not in tools.  Why do I believe that?

- My experience in academia has taught me that simple statistics and tools are often the most reliable. If there is signal to be detected, any analysis and/or tool should be able to find it.  Many people turn to more complex statistics when they don’t find the right relationship using simple statistics.  In psychology, people are finding that the use of  more complex models (e.g. covariates) is often an indicator that the study’s results may be less likely to be reliable.  Given the size of datasets that we often have in data science, we often don’t need special statistical techniques to find relationships in data as we have so much statistical power that most tools and techniques should give you convergent results.  Put simply, the tools matter less than the data.

- The most popular tools and techniques are often open source. You can do a lot with R, Python, Gephi, Mahout, etc.

- Yes, there are advantages to using particular distributions of open source tools (e.g. Hadoop distributions that come with particular features), but there are so many companies out there offering different flavors of products that do essentially the same thing, that I can’t see how any particular company is going to be the next Apple or Google, in terms of stock growth.  There are no barriers to entry in the tools market.  Perhaps a company will be the next RedHat, which may be a fine business to be in, but I don’t believe that that is the revolutionary wave that investors in big data stocks are looking for.

So what should you do if you want to invest in big data?  Buy stock in companies that have the best, biggest, most unique sets of data and/or the most defensible ways of collecting that data.  I invested my IRA money into Facebook, which has the biggest and best dataset of human behavior that ever existed.    I invest my academic time into scalable data collection projects such as YourMorals, BeyondThePurchase, and ExploringMyReligion, confident that that will lead to the most long-term knowledge.  And I invest my professional time into Ranker, which has a scalable process for collecting an opinion graph, that will be essential for the kinds of intelligent applications that big data futurists have been promising us.

Do you want to invest in big data?  Generally, you’ll get better returns if you invest your money, time, and energy in data, rather than in tools.

- Ravi Iyer

Posted in academia, big data, business of psychology, data science, ranker, technology | No Comments »

Predicting Box Office Success a Year in Advance from Ranker Data

April 22nd, 2013 by Ravi Iyer

A number of data scientists have attempted to predict movie box office success from various datasets.  For example, researchers at HP labs were able to use tweets around the release date plus the number of theaters that a movie was released in to predict 97.3% of movie box office revenue in the first weekend.  The Hollywood Stock Exchange, which lets participants bet on the box office revenues and infers a prediction, predicts 96.5% of box office revenue in the opening weekend.  Wikipedia activity predicts 77% of box office revenue according to a collaboration of European researchers.  Ranker runs lists of anticipated movies each year, often for more than a year in advance, and so the question I wanted to analyze in our data was how predictive is Ranker data of box office success.

However, since the above researchers have already shown that online activity at the time of the opening weekend predicts box office success during that weekend, I wanted to build upon that work and see if Ranker data could predict box office receipts well in advance of opening weekend.  Below is a simple scatterplot of results, showing that Ranker data from the previous year predicts 82% of variance in movie box office revenue for movies released in the next year.

Predicting Box Office Success from Ranker Data
Predicting Box Office Success from Ranker Data

The above graph uses votes cast in 2011 to predict revenues from our Most Anticipated 2012 Films list.  While our data is not as predictive as twitter data collected leading up to opening weekend, the remarkable thing about this result is that most votes (8,200 votes from 1,146 voters) were cast 7-13 months before the actual release date.  I look forward to doing the same analysis on our Most Anticipated 2013 Films list at the end of this year.

- Ravi Iyer

Incoming search terms:

Digg This  Reddit This  Stumble Now!  Buzz This  Vote on DZone  Share on Facebook  Bookmark this on Delicious  Kick It on DotNetKicks.com  Shout it  Share on LinkedIn  Bookmark this on Technorati  Post on Twitter  Google Buzz (aka. Google Reader)  
Posted in data science, ranker | No Comments »