I just read “Limits of Predictability in Human Mobility” by Chaoming Song, et al. (Science, Vol. 327, 2010). The paper reports an analysis of a really amazing dataset: three months of cell phone records for ~10 million customers of a large European carrier (“anonymized by the data source”). These records include information about the cell phone towers people connected to when they placed calls, what time calls were placed, and how long the calls lasted.
The research question/motivation for the analysis stated in the paper is, “What is the role of randomness in human behavior and to what degree are individual human actions predictable?” The main finding is that people’s daily mobility patterns are very predictable, even when they travel quite far on a daily basis.
I suppose researchers have not been able to model human mobility quite like this before—that’s a LOT of data, even the sample of 50,000 they used for the analysis reported in this paper. But I’m not sure I understand the claim that this finding is surprising: “Yet it is not the 93% predictability that we find the most surprising. Rather, it is the lack of variability in predictability across the population.” How surprising is it that humans are creatures of routine and habit? What was more surprising to me about this paper was the idea that “current models of human activity are fundamentally stochastic”, i.e. the previous status quo was the assumption that there is an important random component to human activity.
The paper seems to be saying “look, the assumptions made by people who study this kind of thing are WRONG, and we can prove it”. I typically like those kind of papers. But my feeling about this one is, while it is useful to know that people’s mobility patterns aren’t random if you study these kinds of things, it also seems like a giant WELL, DUH. I suspect this paper is receiving attention because of the sexy dataset, and not because it presents counterintuitive results. (I first heard about the paper on Twitter.)
For me, the most interesting aspect of the paper is how it glosses over what could very well be a huge sampling bias. This is not unique to this particular paper—many analyses of large “social computing” datasets have similar sampling biases due to technical constraints or dataset limitations. When one analyses an existing dataset, one must make do with the information one is given. (A related example is using IP geolocation to identify users’ locations when the only information you have about them is their IP address—there’s systematic bias in that data for sure, but in many cases there’s just no better way to do it.)
So for example, the cell phone mobility dataset contains location information only for instances when phone calls were placed (or received?)—i.e., the phone had to be in communication with a tower for the tower location to be recorded. This means the dataset contains locations for people *only when they’re using their phones*. If a person doesn’t make any calls, their location is not captured in the dataset. Are there systematic differences in mobility behavior between people who make calls and those who don’t? I’m someone who makes maybe one phone call a day, although I send a lot of text messages and use the packet data service quite a lot. Could it be possible that people who make a lot of calls have more predictable mobility patterns? While I am *sure* this possibility has occurred to the authors, the paper doesn’t address that question. I also wonder what systematic differences exist between people who have cell phones and those who don’t. But the paper doesn’t include a discussion of sampling bias, or any other potential threats to validity.
This is a big problem, I think, in the reporting of results from super large datasets. The datasets are SO big, that they are thought of as more like population data than sample data, and the results are taken for “truth” without being subjected to appropriate scrutiny. Take this blurb written about the article, from an NPR story:
A new study used cell phone billing data for 50,00 people in a European country to show that people’s travel patterns are extremely predictable. That’s true for both homebodies and jet setters. Regardless of age, language group, etc, people’s movements were predictable 93 percent of the time. The study shows the emerging power of using cell phone data for social science research. (from http://www.npr.org/templates/story/story.php?storyId=123879603)
I think it is extremely important when reporting analyses of large datasets to be exceedingly clear about issues like sampling bias and generalizability, and I’d like to see a requirement that papers address these issues. For example, this particular paper might have reported statistics on what proportion of the dataset had to be excluded due to lack of location data. Or, the authors could have undertaken a secondary data collection to try to find out whether those excluded people differed from the analyzed sample in some systematic way.
I’m not saying I think the findings of this particular paper are invalid—the results make perfect sense, and perhaps that’s why the paper doesn’t even mention threats to validity. But then, how is the finding counterintuitive if it makes so much sense we don’t even question it a little bit?