Emilee Rader Rotating Header Image

“getting the most out of twitter”

There was an interesting article the other day on nytimes.com titled “Getting the Most out of Twitter” which essentially argues that lurking (i.e., reading others’ posts and not posting yourself) is the way to go:

Even the most prolific users say Twitter has become more useful as a way to tap in to the discussions of the day than to broadcast their own thoughts. And once you get pulled in, you might just find you have something to say after all.

Biz Stone, Twitter’s co-founder, suggests that naysayers simply log on to Twitter’s home page and search for a topic they are interested in, whether it’s their favorite sports team, the name of their company or a topic in the news.

Within a minute, they understand the appeal, he said.

This is interesting to me, for a couple of reasons. First, it seems like up to now, most of the hype in the press about Twitter has been focused on production—posting Tweets—rather than consumption. This article seems to take the information available on Twitter as a given, and focuses on the potential benefit information consumers might receive by using Twitter for “social filtering”. It even provides some advice for how to get better results from one’s “social filtering” endeavors on Twitter.

Second, in taking the information on Twitter as a given, it sidesteps questions about the incentives mismatch—if the real benefit of Twitter is in consumption and “social filtering”, who are the information producers and what are *their* motivations? What influences their choices about what to contribute, and how do these influences shape the information available via Twitter?

The article quotes someone named Dan Zarrella, who has a blog called “The Social Media Scientist”. On March 1st he posted a comparison of link-sharing on Twitter and Facebook, Data Shows: “Twitter”-Centric Stories are Not Heavily Shared on Facebook. I was initially excited about this—I’m all for cross-site comparisons, and I’m really interested in comparing “social filtering” across different systems. However. Dan does not reveal where his data came from, and only briefly mentions sampling:

I’ve begun by capturing links posted to social media sites from 10 extremely popular news outlets. Some of the top blogs, both mainstream and geeky, as well as a handful of the most web-enabled newspapers of record. Then I’m counting the number of times those links are shared on Facebook (in three different ways) and on Twitter (through good old ReTweets).

This was disappointing. A social media dataset is only as good as one’s data collection and sampling methods, and without detailed information about such things, the results and any conclusions based on them are suspect. Even more disappointing: only two commenters (out of 23) and zero “tweeters” (out of 189, as tracked by DISQUS Comments) ask about where the data came from.

adventures in social filtering

I’ve been thinking a lot lately about “social filtering”, or the practice of discovering information by paying attention to what others are paying attention to. Evidence of the attention of others is explicitly captured and aggregated by various social media applications like Digg, delicious, and Twitter / TweetMeme. This is hardly a new concept (see “Edit wear and read wear“, Hill et al., CHI 1992); however, rather than tracking passive traces of use, these applications collect and aggregate explicit actions—posting a link to delicious or Twitter is essentially a user endorsement of the content.

I can’t quite put my finger on it yet, but I feel like there’s something fundamentally different about social filtering as a side effect of saving a story or bookmark or reference for oneself, and posting it so that will be broadcast to others.

For example, I have been using Mendeley for the past 6 months as my reference manager, which is both a desktop application and a social media system. Today I received the March newsletter from Mendeley, which pointed out the “Top 10 most read articles on Mendeley published in 2009“. Number one on the list is the following: Alon, Uri (2009). How to Choose a Good Scientific Problem. Molecular cell, 35(6), 726-8. I was intrigued by the title, so I looked up the paper, and came across another by the same author titled, “How to Build a Motivated Research Group“. Both of these papers contain interesting and valuable insights, and I expect that I will return to them multiple times as I progress in my career.

It seems like the primary use case for Mendeley is storing and organizing references, and sharing them with a group of collaborators. The Mendeley Desktop app synchronizes automatically with the server, so the usage data that was aggregated to produce the “Top 10 in 2009″ list is more like the “read wear” of Hill et al. than the active endorsements of Twitter posts. I doubt I follow anyone on Twitter who reads the journal “Molecular Cell”, so I probably would never have come across these papers if I hadn’t seen them in the newsletter email today. Were the Mendeley users who read this paper even aware that their actions might contribute to the information discovery of others? Are these the kind of content items that anyone would choose to post to Twitter?

Endorsements are different from “read wear” in that they require an extra action on the part of the user, beyond reading it or saving it for themselves, to share the content. How does the “read wear” vs. endorsement distinction, as incorporated in a social media system, affect the content available to users via the system?

supplemental statistics

I came across a really interesting paper recently after seeing it referred to in a news story: Female teachers’ math anxiety affects girls’ math achievement (Beilock et al., 2010, PNAS, with Supplemental information)

The researchers recruited 17 first- and second-grade teachers (all female) and assessed the math achievement of the students in their classrooms at the beginning and end of the year, as well as the teachers’ anxiety level. They also measured “students’ beliefs about gender and academic success in domains like math”. They found that higher teacher math anxiety was associated with an increase in girls’ tendency to adhere to “boys are good at math, girls are good at reading” gender stereotypes. They also found that girls who were more likely to hold “boys are good at math, girls are good at reading” stereotypes had lower end-of-year math achievement scores. Interestingly, when they put these predictors together in one regression (teacher anxiety and gender stereotypes predicting math achievement), teacher anxiety was “no longer a significant predictor” (and the coefficient decreased from -3.33 to -2.48). The paper presents a “mediation analysis” called “bias-corrected bootstrapping” that suggests math anxiety in female teachers affects girls’ gender stereotypes, which affects math achievement scores. I don’t know much about this analysis method, so I dug up a couple of papers so I can learn more about it. Yay, stats!

I have two issues with the way the results are presented in this paper. First, it took me way too long to figure out what they actually did. I didn’t notice the supplemental material initially, which is where the all of the analyses are described, and the text of the actual article is too vague about the statistics for me to believe the results from just that part of the text. I realize that PNAS limits submissions to 6 pages, but I feel that for this particular paper the supplemental material is not supplemental at all—it is essential. After reading the supplement, it is pretty clear that the analysis was adequate.

But, my second issue is that the interpretation of the result seems more concerned with sign and significance than with effect size. The paper doesn’t ground the numbers in real-world implications, nor does it present descriptive statistics on the instruments used (these are relegated to the online appendix as well). For example, it is impossible to interpret the coefficient in this statement without having some idea what the units mean for both teacher anxiety and math achievement: “In addition, the more girls at the end of the year endorsed the notion that boys are good at math and girls are good at reading, the lower was their math achievement (r = −0.28, P = 0.025).” This oversight surprises me. So what if gender stereotype belief is a significant predictor of math achievement for girls, if this is only associated with very small differences in test scores? Yes, it is still interesting that the effect was present in girls and not in boys, but if the magnitude of the effect is small, in my mind the implications of this particular study are more about gender stereotypes and behavior modeling, and less about figuring out how to help girls do better in math. (There’s a brief acknowledgement of effect size in the third to last paragraph: “It is important to note that the effects reported in the current work, although significant, are small.”)

Finally, it’s interesting to me, and a bit depressing, that in a paper about math anxiety and achievement, the complicated statistics are relegated to an appendix. There is no way to know if the authors expected the statistical analyses to be transparent/obvious enough they didn’t need to include the details in the paper, or if they felt the paper would be more understandable for readers without the stats. This is something I struggle with—how to appropriately describe complicated quantitative analyses for a multi-disciplinary audience that may or may not understand what I’m talking about, or even want to learn. I’m not sure I like the stats appendix solution, but I like it a lot more than two other alternatives I’ve seen: 1) the “sink or swim” approach—describing the analyses as if to an expert, and less experienced readers are left to flounder; and 2) only using stats one believes most members of the community should be familiar with.

web applications

Alina Lungeanu and I started collecting data last week on our experiment! I don’t want to say too much about the hypotheses, etc. in case potential participants google me and find this blog post, so instead today I’m writing about why I’m glad I’m not a web application developer.

For the experiment we are using the same web application created for my dissertation research, with a few small tweaks, and a new set of materials. Whenever you’re doing a study that involves participants using a prototype or other system built specifically for the experiment, it is imperative to do a lot of testing. The last thing you want is for the results of the study to reflect bugs or usability problems and not the actual phenomena of interest. So, before using the experiment app for my dissertation research, I set aside plenty of time for testing and recruited people to bang on the system and try to break it.

This time around, the tweaks to the system were so minor that I basically tested use cases that involved the new features, and nothing else. I figured not much had changed, so I could assume what worked before would still be working. This, as it turns out, is an assumption that doesn’t hold true in the wonderful world of web application development. With a web application, it isn’t just the application code itself you have to worry about. About a year has gone by since my initial data collection, and in that time web browsers have gone through several rounds of updates and major releases. Also, we’re using a different web server this time around. And finally, there’s been an update to one of the toolkits the application uses for the file-and-folder interface. So in reality, a LOT has changed from a year ago.

Fortunately, in the first experiment session we uncovered a minor “race condition” bug that hadn’t presented itself in either my dissertation data collection, or testing for this experiment (I say “fortunately” because we discovered the problem early). A race condition exists when multiple related (but separate) requests are sent from the client to the web server. Because these are *separate* requests, there’s no explicit sequencing, and unpredictable or undesirable application behavior can result if/when these requests are processed in the wrong order. This was a simple bug to fix, and so far no other bugs have presented themselves.

The reason I am glad I’m not a web application developer, is with all these infrastructural components that can change (browsers, servers, toolkits…), keeping a web application working seems to be like hitting a moving target. Firefox 3.6 included optimizations to speed up javascript, for example, which may have contributed to the race condition bug in the experiment app. A new version of Internet Explorer was released, and the toolkit the experiment app uses also released a new version with changes based on the changes to IE. It amazes me that Gmail and all those other web apps I use on a daily basis continue to work at all!

So my advice to anyone considering using a home-grown web application in their research is, come up with a test suite, document it, and run through all the test cases *every time* you intend to use the application in a new study. Even if the application itself hasn’t changed.

large datasets and threats to validity

I just read “Limits of Predictability in Human Mobility” by Chaoming Song, et al. (Science, Vol. 327, 2010). The paper reports an analysis of a really amazing dataset: three months of cell phone records for ~10 million customers of a large European carrier (“anonymized by the data source”). These records include information about the cell phone towers people connected to when they placed calls, what time calls were placed, and how long the calls lasted.

The research question/motivation for the analysis stated in the paper is, “What is the role of randomness in human behavior and to what degree are individual human actions predictable?” The main finding is that people’s daily mobility patterns are very predictable, even when they travel quite far on a daily basis.

I suppose researchers have not been able to model human mobility quite like this before—that’s a LOT of data, even the sample of 50,000 they used for the analysis reported in this paper. But I’m not sure I understand the claim that this finding is surprising: “Yet it is not the 93% predictability that we find the most surprising. Rather, it is the lack of variability in predictability across the population.” How surprising is it that humans are creatures of routine and habit? What was more surprising to me about this paper was the idea that “current models of human activity are fundamentally stochastic”, i.e. the previous status quo was the assumption that there is an important random component to human activity.

The paper seems to be saying “look, the assumptions made by people who study this kind of thing are WRONG, and we can prove it”. I typically like those kind of papers. But my feeling about this one is, while it is useful to know that people’s mobility patterns aren’t random if you study these kinds of things, it also seems like a giant WELL, DUH. I suspect this paper is receiving attention because of the sexy dataset, and not because it presents counterintuitive results. (I first heard about the paper on Twitter.)

For me, the most interesting aspect of the paper is how it glosses over what could very well be a huge sampling bias. This is not unique to this particular paper—many analyses of large “social computing” datasets have similar sampling biases due to technical constraints or dataset limitations. When one analyses an existing dataset, one must make do with the information one is given. (A related example is using IP geolocation to identify users’ locations when the only information you have about them is their IP address—there’s systematic bias in that data for sure, but in many cases there’s just no better way to do it.)

So for example, the cell phone mobility dataset contains location information only for instances when phone calls were placed (or received?)—i.e., the phone had to be in communication with a tower for the tower location to be recorded. This means the dataset contains locations for people *only when they’re using their phones*. If a person doesn’t make any calls, their location is not captured in the dataset. Are there systematic differences in mobility behavior between people who make calls and those who don’t? I’m someone who makes maybe one phone call a day, although I send a lot of text messages and use the packet data service quite a lot. Could it be possible that people who make a lot of calls have more predictable mobility patterns? While I am *sure* this possibility has occurred to the authors, the paper doesn’t address that question. I also wonder what systematic differences exist between people who have cell phones and those who don’t. But the paper doesn’t include a discussion of sampling bias, or any other potential threats to validity.

This is a big problem, I think, in the reporting of results from super large datasets. The datasets are SO big, that they are thought of as more like population data than sample data, and the results are taken for “truth” without being subjected to appropriate scrutiny. Take this blurb written about the article, from an NPR story:

A new study used cell phone billing data for 50,00 people in a European country to show that people’s travel patterns are extremely predictable. That’s true for both homebodies and jet setters. Regardless of age, language group, etc, people’s movements were predictable 93 percent of the time. The study shows the emerging power of using cell phone data for social science research. (from http://www.npr.org/templates/story/story.php?storyId=123879603)

I think it is extremely important when reporting analyses of large datasets to be exceedingly clear about issues like sampling bias and generalizability, and I’d like to see a requirement that papers address these issues. For example, this particular paper might have reported statistics on what proportion of the dataset had to be excluded due to lack of location data. Or, the authors could have undertaken a secondary data collection to try to find out whether those excluded people differed from the analyzed sample in some systematic way.

I’m not saying I think the findings of this particular paper are invalid—the results make perfect sense, and perhaps that’s why the paper doesn’t even mention threats to validity. But then, how is the finding counterintuitive if it makes so much sense we don’t even question it a little bit?