[This is the first in a several part series on creating representative samples from convenience sampling data]

Earlier Jon Haidt discussed the “problem” of representativeness of the YourMorals data and concluded that it wasn’t such a problem after all. Convenience samples drawn from the internet can produce reliable data. This is particularly true when we are more interested in taking valid measurements than in painting a representative picture of some underlying population.

But what if we would also like to know something about the underlying population? If we had data that were representative of the country as a whole, we would be able to ask a new set of questions. Does knowing where the states fall in terms of their Moral Foundations tell us anything about voting behavior? We might expect scores on the purity foundation to explain state-level attitudes about gay marriage or the fairness foundation to explain attitudes about tax policy. To answer these kinds of questions, we need representative samples (also see Jesse Graham’s comment in the above link).

In sampling theory, the gold standard is the probability sample. When all individuals in the population have a known (but not necessarily equal) probability of being included in the sampling frame, we can construct reliable estimates of the population parameters and, given sufficient sample size, be confident that these estimates are within some distance of the true values in the population. However, the central assumptions of sampling theory are violated in convenience sampling (but see this discussion of the representation problems in traditional “random” sample polls).

First, we would like to get a sense of how the YourMorals data stacks up against other population measures. We collected data on several demographic characteristics of individuals in the YourMorals dataset. We can easily compare these against population values collected from the census or other representative samples.

One area where we can clearly see the representation problems in the YourMorals data is self-reported ideology. Considering only U.S. respondents for the time being (as all of the following analyses do), recent national samples put the proportion of people who consider themselves “liberal” at between 18 and 22 per cent. In the YourMorals data, this figure is nearly 65 percent.* Given this skew in the data, we might be hesitant in trying to make inferences about the general population from a sample that looks so much different.

The figures below show how the YourMorals data compares with the population values across a handful of demographic and attitudinal variables.

Source: Pew Center for the People and the Press, 2001-2008

This figure shows how even with a significant intercept shift (almost 50 points), the rank ordering of the states stays pretty close to the same. This is encouraging as it means we are not drawing the same type of individual from each state. Put differently, knowing the state that an individual resides in tells us something about the probability that he or she identifies as a liberal. What we would not want to see here would be a horizontal line (indicating no relationship).

Source: American Community Survey, 2006-2008

Source: American Community Survey, 2006-2008

With race it is much the same story as ideology. For whites, there is a substantial intercept shift (almost 70 points), but states with larger white populations also are proportionally more white in the YourMorals data. The data for African Americans is noisier (there were fewer than 900 in the sample of over 60,000), but shows the same pattern. Here there is not a large intercept shift (as we have reached the floor of the data), but we see the same kind of increasing pattern.

Source: American Community Survey, 2006-2008

With respect to education, the data are further afield. The figure shows that the YourMorals sample is significantly more educated than the general population, but it becomes more difficult to draw a convincing trend line through the data. Individuals who came from states with higher levels of education were only marginally more likely to be highly educated themselves.

So where does all of this leave us? It is obvious from the plots that the individuals who self-selected into the YourMorals data look very different than the general population. It would clearly be inappropriate to use the raw data in trying to make inferences about the general population parameters (average levels of a particular foundation in a particular state, for example). The sample is much more liberal, highly educated, and white than the general population. But it is not *as* bad as it could be. The worst-case scenario would show uniformly weird sample across the states. Instead, what we saw in the figures above is a picture that is more-or-less proportionally correct. It is encouraging that the general relationships hold up.

All of this is not to say that we should throw out the analyses presented elsewhere in this blog and in publications based on the YourMorals data. If we condition on ideology (which we saw was particularly skewed) and make statements like “Liberals generally score higher than conservatives on the Harm/Care and Fairness/Reciprocity foundations,” we are probably treading on safe ground.

In the next few posts, I will be revisiting the question of how to construct a representative picture from a convenience sample.

*Beyond the obvious sampling issues, there are a few other problems with directly comparing the measure of ideology in YourMorals with that in nationally representative samples. First, there is a mode difference that could account for some of the discrepancy (although certainly not all or even a very significant portion of it). Another (and more serious) difference between nationally representative samples and the YourMorals data is the choice of a seven point scale rather than a five point scale. Five point scales are used more regularly in telephone samples with the options being “Very Conservative,” “Conservative,” “Moderate,” “Liberal,” and “Very Liberal.” The YourMorals data includes options for “Slightly liberal” and “Slightly Conservative” as well as “Libertarian” and “other” categories. The 65 percent figure lumps all of the “liberals” together. If you believe that the “slightly liberal” respondents might have self-identified as “Moderate” given fewer options, the proportion turns out to be just over 50.

[...] _uacct = "UA-2529404-1"; urchinTracker(); YourMorals Blog Home Create an Account Explore Your Morals About Us Our Blog Links « Having your cake and eating it too: Representativeness and the YourMorals Data [...]

[...] for talking about the general population. The sample is demographically unrepresentative (see here) and somewhat attitudinally unrepresentative (see [...]

[...] weeks ago, I ran a series of posts wherein I discussed a possible way of gleaning information from the YourMorals database [...]