Big Data Will Blind You

Not all of us are scientists, but all of us today are consumers of science. And I mean science, not technology. When we want to lose weight, or make more money, or find that perfect someone, we don’t go to gurus, and we don’t go with our guts. We look at the latest studies.

It’s been said that Generation X has a deep need for data. Certainly a lot of people my age long ago lost our last vestiges of idealism, and are most interested in knowing, as pragmatically as possible, exactly what works and what doesn’t. We no longer believe in Dr. Spock’s intuitions or Oprah’s platitudes. We want to see what science says. We’re only interested in practical, proven methods. We haven’t given up trying to explain the world, but we’ve stopped trying to make beautiful, abstract theories workable. In the same vein, companies like Amazon, Google and Facebook are proud to call themselves ‘data-driven’: they make no claim to being led by ‘visionaries’, but act based on rigorous analysis of consumer activity. (Of course, there are a minority of companies, such as Apple, which do claim to be led by visionaries, but these are the exception, and their stock prices are more volatile.)

Part of this zeitgeist is the modern tech industry excitement about the possibilities of ‘Big Data’, a rapidly-emerging state in which we’ll have so much data on so many people and so many financial transactions that we’ll cross some kind of singularity into perfect knowledge, a threshold beyond which we’ll find new markets, new products, and vast new vistas of profit.

Maybe so. But there’s a big pitfall that comes with Big Data. If you’re given a big pile of facts, you start to imagine that you know more than you did before; that you can just crunch some equations and run some statistics, and the numbers will tell you what to do. You’re tempted to believe that you don’t need to get the ‘how’ and ‘why’ of things, as long as you have enough ‘what’.

A little knowledge is a dangerous thing. But knowledge without understanding is even more dangerous. Here’s some examples of why.

Object Lesson: Ejectives and Altitude

It was recently discovered that languages spoken at high alititudes are more likely to have ejectives (a type of consonant which is spoken with a certain forcefulness of air pressure). This isn’t a hard and fast correlation, but it’s strongly statistically significant. Why should this be?

The author of the paper, an anthropologist at the University of Miami, suggests that it’s because of the thin air at high altitudes. It’s claimed that ejective consonants are easier to hear in low pressure areas, and the closure of the glottis during pronunciation assists the speaker in remaining hydrated.

Are you suspicious of this conclusion? You should be. The author has noticed a strong correlation, and taken a record-breaking high-flying leap to a conclusion. He has not gone out and tested hydration levels of various speakers of these languages, nor checked out how well ejectives can be heard versus other sounds.

In fact, ejectives are slightly easier to hear than non-ejectives, but they’re not the easiest consonants to hear. By far the most audible consonants to hear are sibilants, like ‘s’. (You can’t whisper an ‘s’.) Why don’t these languages have more sibilants? As for preventing dehydration, you lose most moisture when you’re pronouncing vowels, and your mouth is wide open; so you’d expect fewer vowels, not more ejectives. After all, when you speak, vowels make up about 80-90% of the length of a word.

Nor has he checked to see if there are other correlations of linguistic features with altitude. Turns out there are! High-altitude languages also tend to have objects before verbs in their sentences, and there is also a relationship between the order of verbs and objects and the order of nouns and adjectives. What are we to make of this, then? Does high altitude encourage some kinds of syntax, perhaps because of its effect on brain oxygenation? Perhaps air-starved brains are more likely to push their verbs to the ends of sentences. Or maybe the speakers of these languages rush to get the all-important predicate nouns out of their mouths before they run out of breath.

So… Many… Correlations…

That’s nonsense, of course. But in this situation, and many others, people are inclined to think that correlation must equal causation. For example, recently researchers at UPenn found (among many other fascinating things) that people who talk about sports on facebook are less likely to be neurotic. The researchers then go on to speculate that maybe playing sports helps with depression, or something like that. Well, certainly other (more careful) scientists have shown that physical activity helps with depression. But I notice that the methodology of the UPenn study makes no distinction between playing sports and watching sports. Personally, given the choice between neuroticism and watching football, I’ll take my chances with the neuroticism. Better the devil you know… But again, correlation does NOT mean causation.

So if there’s no causation involved — if high altitude doesn’t necessarily cause ejectives, and watching sports doesn’t necessarily make you happy — what’s really going on? What’s causing the correlation? Well, as far as the ejectives go, Mark Liberman at language log points out that there are hundreds of linguistic features, and thousands of languages; and in a data-rich environment like that, just by chance, there’s bound to be some correlations that don’t have any causal link at all. To understand this intuitively, suppose there are a dozen children on a playground, of which six are girls, and all the girls are in the sandbox. In this case, you might be justified in thinking that boys are avoiding the sandbox for some reason. But if instead there are a hundred children, of which three are wearing black shoes, and two of those are in the sandbox, there’s less likely a causative relationship between black shoes and sandboxes. Come back in ten minutes and maybe just the three kids in red shirts will be in the sandbox. There’s just too many variations of clothing, and too large a sample set, to draw any conclusions.

Another example was one I discussed in my Toxic Society post. Crime rates in the United States have been dropping precipitously, and up till recently no one really knew why. In the past, drops in crime have been associated with good economic times and higher rates of incarceration, so it’s been assumed that poor economies and empty prisons leads to more crime. But as the US economy has struggled through the Great Recession, crime rates have continued to plummet — not just here, but all over the world, regardless of incarceration rate. Another apparent correlation / causation link is broken.

So data can fool you into thinking you know more than you do. Even worse, you can use it to bolster ideas you’re already inclined to believe. But even worse than either of these: data can keep you from digging further to find the real causes of what’s going on.

Assume You’re Blinded

It turns out that the drop in crime rates comes not from the economy or the police work, but from environmental regulation 20 years earlier. These regulations lowered the incidence of lead in children’s brains, making them better at impulse control when they got old enough to be tempted to commit crimes. This would never have been discovered if economist Rick Nevin hadn’t followed a hunch that something was wrong with the conventional ‘data-driven’ wisdom, and undertaken a massive project to uncover the truth. He didn’t find this by looking at huge amounts of data, but by going back and questioning his assumptions.

Let’s look back at the high-altitude ejectives, and try to peel off our cultural blinders. Ejectives are found in about 15% of the world’s languages, but it so happens that none of those languages are English, Spanish, Arabic, or any other widespread language of a culture that is or was an imperialist or colonialist power. Imperialist powers tend to take over lowland areas, since they’re easily accessible from water (i.e. easier to reach with your gunboats), and generally support larger populations, are richer agriculturally, and so on. Therefore, one would expect to find languages with ejectives located in high elevations, deserts, and other relatively resource-poor and inaccessible areas.

If I’m right, then you could pick just about any linguistic feature that appears with relatively low frequency (such as object-first sentential structure, or ergative constructions) and find exactly the same geographic distribution. Object-first structure, for example, is found almost exclusively in the foothills of the Andes mountains, deep in the Amazon rainforest. Ergative languages are found in the Basque country (mountainous), the Caucasus mountains, southwestern Iran (mountainous), the mountainous Pacific Northwest, mountainous Central America and the northern Andes mountains, the largely mountainous Arctic, the mixed desert-and-mountains of the Australian outback, and Tibet. (Note that, ironically enough, there are no ejective languages in Tibet; it’s the largest exception to the ejective/elevation correlation.)

I think it would be very hard indeed to make a convincing case that sentential structure or ergativity is ’caused’ by geographic features like elevation. Of course, no doubt somebody could come up with something plausible, because cultural biases are extremely strong.

All that said: I do think geography has an effect on linguistic sounds, but very indirectly, in more subtle ways. I think generally the path leads through culture. Geography has all kinds of effects on culture, and culture has effects on language. For example, English has (for the most part) a simpler set of consonant sounds and clusters than other Germanic languages, and it definitely has a much simpler syntax and morphology. This is because England was, for over a thousand years, subject to waves of invasions by people speaking various dialects of Germanic, and what you ended up with was sort of the simplest common denominator of them all. And England was subject to these invasions because it was an easily-accessible, poorly-defended island, wealthy in land and natural resources like lumber and tin.

(Even more subtly, I think the spiritual nature of the land has an effect on the spiritual nature of the language. But this is something I feel — I don’t really have any data, big or otherwise, to back that up…)

Seeing Past the Data

So why didn’t the anthropology professor, the linguists, or the statisticians see the link between ejectives and our imperialist history? Because they were blinded by their own cultural assumptions. They simply assumed that linguistic features were scattered randomly among the languages of the world. They didn’t stop to remember that the world’s languages were part of cultures — cultures influenced by hundreds of years of imperialism, of which they are the beneficiaries. I’m not accusing anyone of prejudice. But as George Orwell said, to see what is in front of one’s nose needs a constant struggle.

Nevin arrived at the connection between crime and lead not by looking at data, but by questioning basic economic assumptions (that environmental regulation has nothing to do with crime). I came to the connection between ejectives and imperialism by questioning common cultural assumptions. These assumptions are easy to fall into if you don’t know your history. And Big Data isn’t going to save you from that. It’ll be just another tooth on the old saw: lies, damned lies, statistics… and Big Data.

The bottom line is that, as essential as data is, it does not answer any question by itself. Whether in linguistics, business, science, or our own lives, the raw data of our experience has to be analyzed for patterns; and we’ll never see those patterns unless we have unblinkered our eyes.