<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Emilee Rader &#187; data</title>
	<atom:link href="http://bierdoctor.com/category/data/feed/" rel="self" type="application/rss+xml" />
	<link>http://bierdoctor.com</link>
	<description>Assistant Professor, Technology &#38; Social Behavior @ Northwestern University</description>
	<lastBuildDate>Thu, 02 Sep 2010 04:50:39 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>statistics. sigh.</title>
		<link>http://bierdoctor.com/2010/09/01/statistics-sigh/</link>
		<comments>http://bierdoctor.com/2010/09/01/statistics-sigh/#comments</comments>
		<pubDate>Thu, 02 Sep 2010 04:50:39 +0000</pubDate>
		<dc:creator>emilee</dc:creator>
				<category><![CDATA[analysis]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[infrastructure]]></category>
		<category><![CDATA[research design]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://bierdoctor.com/?p=595</guid>
		<description><![CDATA[I find myself once again this week reading stats papers that range from &#8220;slightly over my head&#8221; to &#8220;I have no idea what you people are talking about,&#8221; in an attempt to figure out the right thing to do with a dataset involving observations that are not independent. The dataset consists of conversations between dyads [...]]]></description>
			<content:encoded><![CDATA[<p>I find myself once again this week reading stats papers that range from &#8220;slightly over my head&#8221; to &#8220;I have no idea what you people are talking about,&#8221; in an attempt to figure out the right thing to do with a dataset involving observations that are not independent.</p>
<p>The dataset consists of conversations between dyads that took place while they completed two different interactive tasks. The conversations were recorded, transcribed, and segmented into utterances according to some criteria. This means that there are repeated utterances from each participant, and from each dyad. Different research areas use different terms to refer to this kind of setup: repeated measures, panel data, clustered data, etc. The analysis is further complicated by the fact that the predictors and variables are all categorical. Some are binary, the presence or absence of something. The more interesting variables have more than two categories (in some cases, MANY more).</p>
<p>I am trying to estimate the strength with which each of a set of 15+ utterance goals is associated with one of three roles participants assumed as part of the study. To do this, I need to specify a mixed-effects multinomial logit model, with a set of fixed-effects categorical predictors and a hierarchical random effects control for participant within dyad. This involves choosing a reference category of the response variable, and then running a series of binomial logit models that compare all the other levels of the response variable in turn with the reference category.</p>
<p>Here is where I am running into a situation, again, where I am pushing up against what mainstream statistical software packages are reliably capable of, and even R does not seem to be able to do what I want without more programming than my meager statistical background has prepared me for. The problem as I understand it is, each one of the binomial logit models that makes up the multinomial results uses a different subset of the data, excluding those observations that are related to the levels of the response variable not included in the model. This means that the random effects are estimated differently for each binomial logit model, depending on which observations are included in the subset. The upshot of all of this is the overall multinomial model estimates come out differently, depending substantially on which category is chosen as the reference category.</p>
<p>So that&#8217;s the problem. However, I did not write this to whine about how I am stuck. I&#8217;ve been trying to figure out a solution that I can live with&#8230; do I bail completely? Hire a real statistician? How can I figure out how biased the results would be if I were to to do a purely fixed-effects model? (Without random effects controls, any results produced might in fact be due to some unique aspect of the conversation within a particular dyad in a particular role, rather than indicative of something that shows up across all of the dyads.)</p>
<p>Researchers in many fields work with categorical data, and at least some of them over the years must have encountered this problem, whether they knew it or not, and were faced with the same tradeoffs. In order to get the paper out the door they had to just pick a compromise and go with it. But, any results reached due to a compromise are biased in some way. Models like this are just now becoming possible for people like me, with just enough stats knowledge to be dangerous, to run using fairly standard statistical software packages. But what about all the research that has come before &#8212; how accurate are those models, and the results they produced? How much do people allow what is statistically feasible to determine their research design, vs. compromising on the analysis after the fact? We all stand on the shoulders of giants, but how often were the giants using naive or incorrect statistics?</p>
Copyright &copy; 2010 <strong><a href="http://bierdoctor.com/">Emilee Rader</a></strong>]]></content:encoded>
			<wfw:commentRss>http://bierdoctor.com/2010/09/01/statistics-sigh/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>digg censorship</title>
		<link>http://bierdoctor.com/2010/08/15/digg-censorship/</link>
		<comments>http://bierdoctor.com/2010/08/15/digg-censorship/#comments</comments>
		<pubDate>Sun, 15 Aug 2010 19:46:04 +0000</pubDate>
		<dc:creator>emilee</dc:creator>
				<category><![CDATA[data]]></category>
		<category><![CDATA[in the news]]></category>
		<category><![CDATA[social filtering]]></category>

		<guid isPermaLink="false">http://bierdoctor.com/?p=581</guid>
		<description><![CDATA[In a recent post, I mentioned Facebook&#8217;s &#8220;Like&#8221; button for the web, and wrote about how using the information contributed through all those &#8220;Like&#8221; button presses is more complicated than just inferring that a &#8220;Like&#8221; means that someone likes the web page. I recently came across mention of an alleged &#8220;censorship&#8221; controversy related to Digg.com [see [...]]]></description>
			<content:encoded><![CDATA[<p>In a <a href="http://bierdoctor.com/2010/08/03/traffic-accidents-and-social-media-part-iii/">recent post</a>, I mentioned Facebook&#8217;s &#8220;Like&#8221; button for the web, and wrote about how using the information contributed through all those &#8220;Like&#8221; button presses is more complicated than just inferring that a &#8220;Like&#8221; means that someone likes the web page.</p>
<p>I recently came across mention of an alleged <a href="http://blogs.alternet.org/oleoleolson/2010/08/05/massive-censorship-of-digg-uncovered/">&#8220;censorship&#8221; controversy</a> related to <a href="http://digg.com/">Digg.com</a> [see <a href="http://www.guardian.co.uk/technology/2010/aug/06/digg-investigates-claims-conservative-censorship">here</a>, and <a href="http://www.fastcompany.com/1678342/digg-censorship-wikileaks-conservatives">here</a> for mentions in mainstream media], in which a group of coordinated users apparently succeeded in preventing certain stories they found politically objectionable from reaching the front page of Digg, so these stories would not receive wide exposure. The users achieved this end through what is essentially a thumbs up / thumbs down mechanism fundamental to the way Digg works, by which users vote on whether stories should be promoted or buried. As <a href="http://www.examiner.com/conservative-in-national/the-digg-censorship-controversy-an-alternative-view">one blogger</a> points out, opinions about censorship aside, these users were operating within Digg&#8217;s available functionality and did not necessarily violate any rules. People who are upset about this use of Digg&#8217;s voting mechanism claim the group of users were gaming the system &#8212; coordinating which stories to target via another social media application (Yahoo! Groups).</p>
<p>The &#8220;gaming the system&#8221; and &#8220;censorship&#8221; aspects of this controversy are less interesting to me personally, than the flexibility of such a simple voting mechanism, used to express an entire political agenda rather than individual, personal preferences. This is an instance of the point I was trying to make (badly?) a few days ago &#8212; that even simple mechanisms can be tools for expressing a wide variety of meaning, but that meaning is not obvious from single contributions. In this case, the coordinated intentions of the group only became apparent in aggregate, and only after people who were pissed off about having their stories consistently targeted for &#8220;burial&#8221; were motivated enough to figure out what was going on. In other words, the meaning behind these actions was not present in the aggregate voting data; was only visible if you already knew where to look.</p>
Copyright &copy; 2010 <strong><a href="http://bierdoctor.com/">Emilee Rader</a></strong>]]></content:encoded>
			<wfw:commentRss>http://bierdoctor.com/2010/08/15/digg-censorship/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>now that&#8217;s a lot of link shorteners</title>
		<link>http://bierdoctor.com/2010/05/01/now-thats-a-lot-of-link-shorteners/</link>
		<comments>http://bierdoctor.com/2010/05/01/now-thats-a-lot-of-link-shorteners/#comments</comments>
		<pubDate>Sat, 01 May 2010 20:56:15 +0000</pubDate>
		<dc:creator>emilee</dc:creator>
				<category><![CDATA[data]]></category>
		<category><![CDATA[methods]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[sampling]]></category>

		<guid isPermaLink="false">http://bierdoctor.com/?p=530</guid>
		<description><![CDATA[I&#8217;m writing a script to parse links out of tweets on Twitter (for example, this tweet contains a link), and then look up the URLs in other social media applications like delicious.com and digg.com. One challenge I&#8217;m facing is the plethora of URL shorteners available to people who post to Twitter. A URL shortener is [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m writing a script to parse links out of tweets on Twitter (for example, <a href="http://twitter.com/tfinholt/status/13027426535">this tweet</a> contains a <a href="http://cli.gs/baYUq">link</a>), and then look up the URLs in other social media applications like <a href="http://delicious.com/">delicious.com</a> and <a href="http://digg.com/">digg.com</a>. One challenge I&#8217;m facing is the plethora of URL shorteners available to people who post to Twitter.</p>
<p>A URL shortener is an online service that assigns a short URL to a web page&#8217;s original address, such that the web page can be accessed either by the original URL or by the new, shorter URL. People use these services in conjunction with Twitter, when they&#8217;re including a URL in a tweet. Because tweets can only be 140 characters long, a shorter URL means more characters left over for saying other stuff.</p>
<p>When I say there is a plethora of URL shorteners available, I truly do mean it in the &#8220;extreme excess&#8221; sense of the word. For example, there&#8217;s a Flickr set of screen captures that contains images of the home pages of <a href="http://www.flickr.com/photos/factoryjoe/sets/72157602178338004/">129 different link shorteners</a>, including <a href="http://www.shadyurl.com/">shadyurl.com</a> for when you want your &#8220;shortened&#8221; link to be suspicious and frightening. Bizarre.</p>
<p>Speaking of shady URLs, one issue I have with URL shorteners is that I am never sure what I am going to get when I click on a link from a service like <a href="http://bit.ly/">bit.ly</a>, which is the &#8220;default&#8221; URL shortener of Twitter. Phishing attacks in social network applications (like Facebook and Twitter) are becoming more common, and tricking people into clicking on links that execute code intended to steal passwords, etc., is often the goal.  As a result, I rarely ever follow links that show up in my Twitter feed, or in my email. It just seems to risky to me to click on links from URL shorteners that disguise the ultimate destination.</p>
<p>Interestingly, the distribution of Twitter posts using different URL shorteners has shifted quite a bit over time, as described by a blogger on TechCrunch.com in &#8220;<a href="http://techcrunch.com/2010/01/06/bit-ly-market-share/">What happened to bit.ly&#8217;s market share?</a>&#8221; The article describes a new &#8220;pro&#8221; service offered by <a href="http://bit.ly/">bit.ly</a> that allows content providers to offer shortened links that resemble the actual domain name more closely (i.e., TechCrunch.com becomes <a href="http://twitter.com/#search?q=tcrn.ch">tcrn.ch</a>), and how this makes it appear on the surface that the proportion of tweets using bit.ly has decreased quite a bit.</p>
<p><a href="http://bit.ly/">bit.ly</a> is still the most-used link shortener on Twitter, according to this <a href="http://tweetmeme.com/about/statistics" class="broken_link">lovely pie-chart</a> created by tweetmeme.com and refreshed daily. But bit.ly accounts for 50% or so of shortened links on Twitter, and the next most popular service, <a href="http://tinyurl.com/">tinyurl.com</a>, only 5%. So, back to thinking about my data collection script, I have a couple of options for resolving the URLs I am parsing out of tweets. I can stick with the most popular link shorteners as reported by tweetmeme.com, use the available APIs provided by those services to look up the original URLs, and end up throwing out 30-50% of links. Or, I can resolve all of the links I find, to figure out if they redirect to a different place or not. I&#8217;m leaning towards resolving all the links at this point, but will have to do some testing to make sure it will actually work the way I think it will work.</p>
Copyright &copy; 2010 <strong><a href="http://bierdoctor.com/">Emilee Rader</a></strong>]]></content:encoded>
			<wfw:commentRss>http://bierdoctor.com/2010/05/01/now-thats-a-lot-of-link-shorteners/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>datasets available online</title>
		<link>http://bierdoctor.com/2010/04/28/datasets-available-online/</link>
		<comments>http://bierdoctor.com/2010/04/28/datasets-available-online/#comments</comments>
		<pubDate>Thu, 29 Apr 2010 04:09:23 +0000</pubDate>
		<dc:creator>emilee</dc:creator>
				<category><![CDATA[data]]></category>
		<category><![CDATA[sampling]]></category>

		<guid isPermaLink="false">http://bierdoctor.com/?p=525</guid>
		<description><![CDATA[This is a mini-rant about datasets. Specifically, other people&#8217;s datasets that they&#8217;ve made available online. In the past few days I&#8217;ve taken a look at Twitter datasets made available on Infochimps.com, and a tagging dataset made available by Yahoo! Labs through its Sandbox website. The first part of my rant is about how the people [...]]]></description>
			<content:encoded><![CDATA[<p>This is a mini-rant about datasets. Specifically, other people&#8217;s datasets that they&#8217;ve made available online. In the past few days I&#8217;ve taken a look at <a href="http://infochimps.org/collections/twitter-census">Twitter datasets made available on Infochimps.com</a>, and a <a href="http://webscope.sandbox.yahoo.com/">tagging dataset made available by Yahoo! Labs</a> through its Sandbox website.</p>
<p>The first part of my rant is about how the people providing these datasets don&#8217;t give me enough information to decide whether or not the dataset will be useful to me before shelling out hundreds of dollars or filling out a lengthy form or pestering my supervisor for approval.</p>
<p>For example, take a look at the &#8220;<a href="http://infochimps.org/datasets/twitter-census-hashtags-urls-smileys-by-hour">Twitter Census: Hashtags, URLs, Smileys by Hour</a>&#8221; dataset on Infochimps. They want $300 for this dataset without providing a clear description of what it contains, or even how big it is. Nor is there much documentation about how these data were collected, and what, if any, sampling bias may exist.</p>
<p>And the free datasets aren&#8217;t really usable either&#8212;the &#8220;<a href="http://infochimps.org/datasets/twitter-census-tweets-by-hour-tweeted">Twitter Census: Tweets by Hour Tweeted</a>&#8221; dataset has the same problem as the &#8220;Hashtags&#8230;&#8221; one; in addition, when you download the file, you get a tab-delimited text file that contains four columns with no labels! Come on, people, what am I supposed to do with this? Guess?</p>
<p>Regarding the Yahoo! Labs data, one must fill out a lengthy form, including providing a phone number, and describe the &#8220;proposed research project&#8221;, before even being told what types of data might be available. To actually obtain access to a dataset, one must enter the name and email address of their department head, presumably to provide some kind of accountability or oversight of how the data will be used. I definitely don&#8217;t want to bug my boss before I have even seen a snippet of the data.</p>
<p>Places like <a href="http://www.icpsr.umich.edu/icpsrweb/ICPSR/index.jsp">ICPSR</a> at the University of Michigan have been <a href="http://www.icpsr.umich.edu/icpsrweb/ICPSR/access/index.jsp">making data available</a> for a long time; there should be plenty of examples for organizations like these to follow regarding what kind of information researchers need to determine whether a dataset will be useful.</p>
<p>I also want to briefly consider the difference between &#8220;raw data&#8221;, &#8220;dataset&#8221;, and &#8220;sample&#8221;. I don&#8217;t know how others think about using these terms to represent chunks of data they happen to be working with. Take this description of the Yahoo! Labs delicious.com dataset:</p>
<blockquote><p>This dataset represents 100,000 URLs that were bookmarked on Delicious by users of the service. Each URL has been saved at least 100 times. For each URL, the date that it was first bookmarked by a Delicious user is indicated, along with the total number of saves. Also indicated are the ten most commonly used tags for each URL, along with the number of times each tag was used. This dataset provides a view into the nature of popular content in the Delicious social bookmarking system, including how users apply tags to individual items.</p></blockquote>
<p>To me, this describes a very specific, targeted sample taken from the entire delicious.com database (i.e., the &#8220;raw data&#8221;). I think of a &#8220;dataset&#8221; as something smaller than the whole database, but without the degree of aggregation seen in the &#8220;samples&#8221; I mentioned above (i.e., counts of tweets per hour rather than the actual tweets, and tag usage counts rather than the actual bookmarking history). Maybe I am the only one thinking about it in this way.</p>
<p>Ultimately, the Yahoo! Labs delicious.com sample is not very useful to me. For our <a href="http://bierdoctor.com/papers/delicious-cscw-logistic+simulations.pdf">CSCW 2008 tagging paper</a>, we also limited our sample of URLs to those that were bookmarked by 100+ people. But, we scraped delicious.com to obtain the *complete* bookmark histories for each URL, including the timestamp and ALL tags applied by every single user that had bookmarked the web page. The temporal information and complete history were very important for our analysis. However, since we were scraping, we only obtained the data for 30 URLs; I&#8217;d love to be able to do this kind of analysis with a bigger sample, so that we might be able to find clusters of people who all bookmarked the same URLs. But alas, this dataset from Yahoo! Labs is not suitable for this purpose.</p>
Copyright &copy; 2010 <strong><a href="http://bierdoctor.com/">Emilee Rader</a></strong>]]></content:encoded>
			<wfw:commentRss>http://bierdoctor.com/2010/04/28/datasets-available-online/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>managing data analysis scripts</title>
		<link>http://bierdoctor.com/2010/04/13/managing-data-analysis-scripts/</link>
		<comments>http://bierdoctor.com/2010/04/13/managing-data-analysis-scripts/#comments</comments>
		<pubDate>Tue, 13 Apr 2010 06:25:00 +0000</pubDate>
		<dc:creator>emilee</dc:creator>
				<category><![CDATA[analysis]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[software tools]]></category>

		<guid isPermaLink="false">http://bierdoctor.com/?p=521</guid>
		<description><![CDATA[I&#8217;ve been revisiting the various scripts I wrote to analyze my thesis data, so I can use them again on a new dataset. The problem is, I&#8217;m finding it both easier and harder than I expected to reconstruct what I did. The &#8220;easy&#8221; part is due to the fact that I was apparently totally anal [...]]]></description>
			<content:encoded><![CDATA[<div>I&#8217;ve been revisiting the various scripts I wrote to analyze my thesis data, so I can use them again on a new dataset. The problem is, I&#8217;m finding it both easier and harder than I expected to reconstruct what I did. The &#8220;easy&#8221; part is due to the fact that I was apparently totally anal about writing down EVERYTHING I was doing, and sometimes even why I was doing it. The &#8220;hard&#8221; part is because I wasn&#8217;t always as consistent as I should have been, and I recorded a lot of useless stuff along with what I really needed to keep track of.</div>
<p><div>For this project, I ended up writing scripts in both Ruby and R, and lots of SQL both incorporated into the scripts and in standalone text files. The experiment application I used to collect the data has its own specific implementation details that exist only in the head of the developer (not me), the experiment itself has a structure that is important for the analysis and is incorporated into the structure of the backend database, and I used a bunch of different R packages for connecting to the database and for specific analyses that have their own requirements and constraints. This is all stuff I had to document and keep track of, in addition to the actual analysis scripts. I also kept a detailed record of ALL the data cleaning I did, so that if I ever had to re-create the final dataset, it would actually be possible.</div>
<p><div>As I worked through the analysis over a period of 4-5 months, I was apparently pretty obsessed with keeping a record of *everything* I tried&#8212;meaning every script, data file, graph, or other product of analysis, even if it didn&#8217;t work out very well&#8212;on the off chance I might want to use it later. I thought I was doing myself a favor, and indeed, it is WAY better to have gone a little overboard with this than not to have done it at all.</div>
<p><div>However, one problem I&#8217;m running into is that while I have documentation (of varying levels of detail) in nearly every script file, the intermediate data files are not themselves commented. So I have to make guesses based on which script file names go with what data file names (also an area where I was pretty consistent, but not 100%) and go crawling through various scripts to figure out which one produced a particular data file and which other one takes it as input. I kept a &#8220;lab notebook&#8221; of sorts&#8212;just a text file, stored in the Mac app <a href="http://www.barebones.com/products/Yojimbo/">Yojimbo</a> with the rest of my research-related notes and ideas&#8212;but this is yet ANOTHER separate file I have to look at, and it doesn&#8217;t have all the information I need about dependencies.</div>
<p><div>Another problem I&#8217;m having is that I didn&#8217;t know exactly what might end up being garbage and what would actually be useful while I was doing the analysis; typically, I don&#8217;t figure this out until a non-trivial chunk of time has passed after I have written a paper that used a particular set of scripts and data files analysis. But months after a paper has been submitted, it is really hard to go back and separate the useful from the useless bits of analysis; enough time has passed that I don&#8217;t remember off the top of my head what actually ended up being used, and both useful and useless code is mixed together in the same files so it would require re-acquainting myself with everything before I&#8217;d be able to separate things out. The impetus to do this housekeeping work just doesn&#8217;t exist at any point in the research cycle for me, I guess.</div>
<p><div>What I&#8217;m looking for is a better way to manage all of this information, that isn&#8217;t too onerous when I&#8217;m in the throes of analysis, but also makes it relatively painless to reconstruct what I did at a later time. For example, after spending several hours poring over my &#8220;lab notebook&#8221; and files, I feel like I probably have all of the information I need to reconstruct my thesis analysis; but, that reconstruction is going to hurt.</div>
<p><div>I&#8217;m not even sure how to ask the Internets for a solution to my problem&#8230; version control might help with part of it, but to my knowledge that type of system won&#8217;t help me manage dependencies between three different file types and a bunch of intermediate data files. Maybe my analysis process is at fault&#8212;my scripts are too big and cumbersome (i.e., they try to do too many things), and my (irrational?) need to save out *every* data table to a file rather than re-computing it when I need it just confuses things. Anybody solve this problem for themselves, and want to give me some tips?</div>
Copyright &copy; 2010 <strong><a href="http://bierdoctor.com/">Emilee Rader</a></strong>]]></content:encoded>
			<wfw:commentRss>http://bierdoctor.com/2010/04/13/managing-data-analysis-scripts/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>&#8220;participatory news&#8221; descriptives</title>
		<link>http://bierdoctor.com/2010/03/24/participatory-news-descriptives/</link>
		<comments>http://bierdoctor.com/2010/03/24/participatory-news-descriptives/#comments</comments>
		<pubDate>Wed, 24 Mar 2010 19:14:39 +0000</pubDate>
		<dc:creator>emilee</dc:creator>
				<category><![CDATA[data]]></category>
		<category><![CDATA[social filtering]]></category>

		<guid isPermaLink="false">http://bierdoctor.com/?p=494</guid>
		<description><![CDATA[The Pew Internet &#38; American Life Project recently published a report, Understanding the Participatory News Consumer, that contains some descriptive statistics about the prevalence of what I&#8217;ve been calling social filtering or link sharing. The data for this report were collected between December 28, 2009 and January 19, 2010. N=2259 English-speaking adults 18 or older  [...]]]></description>
			<content:encoded><![CDATA[<p>The Pew Internet &amp; American Life Project recently published a report, <a href="http://www.pewinternet.org/Reports/2010/Online-News.aspx">Understanding the Participatory News Consumer</a>, that contains some descriptive statistics about the prevalence of what I&#8217;ve been calling <em><a href="http://bierdoctor.com/2010/03/02/adventures-in-social-filtering/">social filtering</a> </em>or <em>link sharing</em>. The data for this report were collected between December 28, 2009 and January 19, 2010. N=2259 English-speaking adults 18 or older  (1675 internet users). The bullet points below are either direct quotes or paraphrases from the report.</p>
<p>Regarding the prevalence of online news consumption, social networking site user, and status update use:</p>
<ul>
<li>61% of Americans get news online in a typical day&#8230; just 2% rely exclusively on the internet for their daily news.</li>
<li>43% of Americans use social networking sites such as Facebook, MySpace or LinkedIn – and 97% of them are online news consumers.</li>
<li>14% of Americans use Twitter or other status update functions. Virtually all (99%) are online news consumers.</li>
</ul>
<p>Regarding where internet users (75% of survey respondents) get their &#8220;socially filtered&#8221; news online:</p>
<ul>
<li>71% get news forwarded to them through email or posts on social networking sites</li>
<li>28% get news from people they follow on social networking sites like Facebook, on a typical day</li>
<li>6% of all internet users get news via Twitter feeds</li>
</ul>
<p>Regarding where internet users share news online:</p>
<ul>
<li>48% pass along email links to news stories or videos</li>
<li>17% have posted links and thoughts about news on a social networking site like Facebook. That translates into 30% of social network site users.</li>
<li>3% have used Twitter to post or re-Tweet a link to a news story or blog. That amounts to 18% of Twitter users.</li>
</ul>
<p>And finally, a breakdown of where &#8220;online news users&#8221; (71% of survey respondents) get their news:</p>
<ul>
<li>75% get news forwarded to them through email</li>
<li>30% get news from someone they follow on a social networking site like Facebook</li>
<li>6% get news from someone they follow on Twitter</li>
<li>7% get news from a news website such as Digg where users rank stories</li>
<li>11% get news from the website of an individual blogger who does not work for a major news organization</li>
</ul>
<p>One difficulty I had when reading this report was keeping the segmentation of the sample straight with respect to the percentages reported. For example, in the bullet points above, I mentioned &#8220;Americans&#8221;, &#8220;internet users&#8221;, and &#8220;online news users&#8221;, and sometimes it wasn&#8217;t clear to me which sub-sample a particular percentage referred to. So if you see something that seems like it doesn&#8217;t quite add up, that&#8217;s probably why.</p>
<p>Another confusing thing is that Twitter is in its own category for most of the results in the report, separate from social networking sites. This is interesting to me, since Facebook is lumped together with other social networking sites (like MySpace and LinkedIn). But there was at least one question about &#8220;status update functions&#8221;. I haven&#8217;t looked up <a href="http://www.pewinternet.org/~/media//Files/Questionnaire/2010/Online_News_Topline.pdf">how people were asked about this</a>, so I don&#8217;t know whether it lumps Facebook and Twitter together.</p>
<p>What stood out most to me in the report was the contrast between my perception of the Twitter hype in the news media (i.e. OMG! Twitter! Everybody&#8217;s doing it!), vs. what percentage of Americans are actually using Twitter. I also found it interesting that there&#8217;s such a large difference between the proportion of internet users who get news via Facebook vs. Twitter (28% vs. 6% of internet users). The report includes some information about what these so-called &#8220;news participators&#8221; (37% of internet users) are like:</p>
<blockquote>
<div>News participators are information omnivores and technophiles. They stand out from the pack in the same way as those who have set up their cell phones to be &#8220;on alert.&#8221; In fact, among news participators, 19% have news alerts sent to their cell phones. News participators are fond of social media: 76% of news participators use social networking sites; 34% of news participators use Twitter, and 26% of news participators are bloggers. The average participator uses 4-6 media platforms on a typical day; seeks out nine or more news topics online; and surfs 3-5 different kinds of news websites on a typical day.</p>
</div>
<div>The typical online news participator is white, 36 years-old, politically moderate and Independent, employed full-time with a college degree and an annual income of $50,000 or more. Interestingly, while white adults make up the bulk of the online news participator population, black internet users are significantly more likely to be news participators than their white and Hispanic counterparts. Almost half of black internet users (47%) are news participators, compared with just 36% of white internet users and 33% of Hispanic internet users. Not surprisingly, the youngest internet users (18-29 year-olds) are more likely than their older counterparts to be online news participators, with just under half of that age group (46%) contributing to the creation, commentary, or dissemination of news online. Men and women are equally likely to participate in online news production.</div>
</blockquote>
Copyright &copy; 2010 <strong><a href="http://bierdoctor.com/">Emilee Rader</a></strong>]]></content:encoded>
			<wfw:commentRss>http://bierdoctor.com/2010/03/24/participatory-news-descriptives/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>large datasets and threats to validity</title>
		<link>http://bierdoctor.com/2010/02/23/large-datasets-and-threats-to-validity/</link>
		<comments>http://bierdoctor.com/2010/02/23/large-datasets-and-threats-to-validity/#comments</comments>
		<pubDate>Tue, 23 Feb 2010 16:05:29 +0000</pubDate>
		<dc:creator>emilee</dc:creator>
				<category><![CDATA[analysis]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[in the news]]></category>
		<category><![CDATA[research design]]></category>

		<guid isPermaLink="false">http://bierdoctor.com/?p=464</guid>
		<description><![CDATA[I just read &#8220;Limits of Predictability in Human Mobility&#8221; by Chaoming Song, et al. (Science, Vol. 327, 2010). The paper reports an analysis of a really amazing dataset: three months of cell phone records for ~10 million customers of a large European carrier (&#8220;anonymized by the data source&#8221;). These records include information about the cell [...]]]></description>
			<content:encoded><![CDATA[<p>I just read &#8220;<a href="http://www.sciencemag.org/cgi/content/abstract/327/5968/1018">Limits of Predictability in Human Mobility</a>&#8221; by Chaoming Song, et al. (Science, Vol. 327, 2010). The paper reports an analysis of a really amazing dataset: three months of cell phone records for ~10 million customers of a large European carrier (&#8220;anonymized by the data source&#8221;). These records include information about the cell phone towers people connected to when they placed calls, what time calls were placed, and how long the calls lasted.</p>
<p>The research question/motivation for the analysis stated in the paper is, &#8220;What is the role of randomness in human behavior and to what degree are individual human actions predictable?&#8221; The main finding is that people&#8217;s daily mobility patterns are very predictable, even when they travel quite far on a daily basis.</p>
<p>I suppose researchers have not been able to model human mobility quite like this before&#8212;that&#8217;s a LOT of data, even the sample of 50,000 they used for the analysis reported in this paper. But I&#8217;m not sure I understand the claim that this finding is surprising: &#8220;Yet it is not the 93% predictability that we find the most surprising. Rather, it is the lack of variability in predictability across the population.&#8221; How surprising is it that humans are creatures of routine and habit? What was more surprising to me about this paper was the idea that &#8220;current models of human activity are fundamentally stochastic&#8221;, i.e. the previous status quo was the assumption that there is an important random component to human activity.</p>
<p>The paper seems to be saying &#8220;look, the assumptions made by people who study this kind of thing are WRONG, and we can prove it&#8221;. I typically like those kind of papers. But my feeling about this one is, while it is useful to know that people&#8217;s mobility patterns aren&#8217;t random if you study these kinds of things, it also seems like a giant WELL, DUH. I suspect this paper is receiving attention because of the sexy dataset, and not because it presents counterintuitive results. (I first heard about the paper on Twitter.)</p>
<p>For me, the most interesting aspect of the paper is how it glosses over what could very well be a huge sampling bias. This is not unique to this particular paper&#8212;many analyses of large &#8220;social computing&#8221; datasets have similar sampling biases due to technical constraints or dataset limitations. When one analyses an existing dataset, one must make do with the information one is given. (A related example is using <a href="http://www.ip2location.com/">IP geolocation</a> to identify users&#8217; locations when the only information you have about them is their IP address&#8212;there&#8217;s systematic bias in that data for sure, but in many cases there&#8217;s just no better way to do it.)</p>
<p>So for example, the cell phone mobility dataset contains location information only for instances when phone calls were placed (or received?)&#8212;i.e., the phone had to be in communication with a tower for the tower location to be recorded. This means the dataset contains locations for people *only when they&#8217;re using their phones*. If a person doesn&#8217;t make any calls, their location is not captured in the dataset. Are there systematic differences in mobility behavior between people who make calls and those who don&#8217;t? I&#8217;m someone who makes maybe one phone call a day, although I send a lot of text messages and use the packet data service quite a lot. Could it be possible that people who make a lot of calls have more predictable mobility patterns? While I am *sure* this possibility has occurred to the authors, the paper doesn&#8217;t address that question. I also wonder what systematic differences exist between people who have cell phones and those who don&#8217;t. But the paper doesn&#8217;t include a discussion of sampling bias, or any other potential threats to validity.</p>
<p>This is a big problem, I think, in the reporting of results from super large datasets. The datasets are SO big, that they are thought of as more like population data than sample data, and the results are taken for &#8220;truth&#8221; without being subjected to appropriate scrutiny. Take this blurb written about the article, from an NPR story:</p>
<blockquote><p>A new study used cell phone billing data for 50,00 people in a European country to show that people&#8217;s travel patterns are extremely predictable. That&#8217;s true for both homebodies and jet setters. Regardless of age, language group, etc, people&#8217;s movements were predictable 93 percent of the time. The study shows the emerging power of using cell phone data for social science research. (from <a href="http://www.npr.org/templates/story/story.php?storyId=123879603">http://www.npr.org/templates/story/story.php?storyId=123879603</a>)</p></blockquote>
<p>I think it is extremely important when reporting analyses of large datasets to be exceedingly clear about issues like sampling bias and generalizability, and I&#8217;d like to see a requirement that papers address these issues. For example, this particular paper might have reported statistics on what proportion of the dataset had to be excluded due to lack of location data. Or, the authors could have undertaken a secondary data collection to try to find out whether those excluded people differed from the analyzed sample in some systematic way.</p>
<p>I&#8217;m not saying I think the findings of this particular paper are invalid&#8212;the results make perfect sense, and perhaps that&#8217;s why the paper doesn&#8217;t even mention threats to validity. But then, how is the finding counterintuitive if it makes so much sense we don&#8217;t even question it a little bit?</p>
Copyright &copy; 2010 <strong><a href="http://bierdoctor.com/">Emilee Rader</a></strong>]]></content:encoded>
			<wfw:commentRss>http://bierdoctor.com/2010/02/23/large-datasets-and-threats-to-validity/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>google buzz</title>
		<link>http://bierdoctor.com/2010/02/14/google-buzz/</link>
		<comments>http://bierdoctor.com/2010/02/14/google-buzz/#comments</comments>
		<pubDate>Sun, 14 Feb 2010 21:08:14 +0000</pubDate>
		<dc:creator>emilee</dc:creator>
				<category><![CDATA[advice]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[in the news]]></category>
		<category><![CDATA[privacy]]></category>
		<category><![CDATA[tangential]]></category>

		<guid isPermaLink="false">http://bierdoctor.com/?p=448</guid>
		<description><![CDATA[I am dismayed by the way Google has rolled out Buzz, and I am not alone. Many bloggers and news organizations have raised issues with Google&#8217;s misguided assumption that email contacts form the same kind of social network as users of Facebook and Twitter (etc.) have built up over time. For example, a NY Times [...]]]></description>
			<content:encoded><![CDATA[<p>I am dismayed by the way Google has rolled out Buzz, and I am not alone. Many bloggers and news organizations have raised issues with Google&#8217;s misguided assumption that email contacts form the same kind of social network as users of Facebook and Twitter (etc.) have built up over time. For example, a NY Times article, <a href="http://www.nytimes.com/2010/02/13/technology/internet/13google.html">Critics Say Google Invades Privacy With New Service,</a> makes the following point:</p>
<blockquote>
<div>
<p>“People thought what they had was an address book for an e-mail program, and Google decided to turn that into a friends list for a new social network,” said Marc Rotenberg, executive director of the Electronic Privacy Information Center, an advocacy group in Washington. “E-mail is one of the few things that people understand to be private.”</p>
<p>Mr. Rotenberg said that his organization planned to file a complaint with the Federal Trade Commission claiming that the Google’s use of e-mail conversations to build a social network was unfair and deceptive.</p>
</div>
</blockquote>
<div>
<p>I use Gmail and many other Google products. In fact, several times a week I get unsolicited email from strangers that is NOT spam &#8212; it is more like &#8220;wrong number&#8221; email. I suppose Google Buzz would include those people in my social network, eh?</p>
<p>Whenever I thought about all the data about me that was in Google&#8217;s possession, I always felt a twinge of discomfort. But I believed them when they said protecting my privacy was of the utmost importance. In fact, Google lists five privacy principles on its <a href="http://www.google.com/privacy.html">Privacy Center</a> webpage, that sound pretty good:</p>
</div>
<blockquote>
<div>1. Use information to provide our users with valuable products and services.</div>
<div>2. Develop products that reflect strong privacy standards and practices.</div>
<div>3. Make the collection of personal information transparent.</div>
<div>4. Give users meaningful choices to protect their privacy.</div>
<div>5. Be a responsible steward of the information we hold.</div>
</blockquote>
<div>Unfortunately, it seems to me that Google has violated pretty much all of their privacy principles with the rollout of Buzz. I rationalized my discomfort with allowing Google access to pretty much every type of private, personal data I can think of by telling myself that they could be trusted with this responsibility.</p>
</div>
<div>However, their choice to jumpstart Buzz critical mass seems to have been motivated out of a desire to compete with Twitter and Facebook, NOT to provide a valuable service while protecting privacy. Disappointing, to say the least. I no longer feel like I can trust Google with my data. I wonder how many other people feel this way too, and how much time it would take to extract myself from all the Google services I use&#8230;</p>
</div>
<div>If you want to stop using Buzz, Gmail Help has <a href="http://mail.google.com/support/bin/answer.py?hl=en&amp;answer=171460">some instructions</a>, which have changed at least once in the past 24 hours as Google responds to the public outcry (<a href="http://bierdoctor.com/website/wp-content/uploads/2010/02/disabling_buzz-feb-12.png">Feb 12 2010</a> | <a href="http://bierdoctor.com/website/wp-content/uploads/2010/02/disabling-buzz_feb-13.png">Feb 13 2010</a>). Simply hiding the Buzz link in Gmail is NOT enough &#8212; the key is modifying one&#8217;s Google Profile in 4 steps, or deleting the profile altogether. And for those of you who have a public Google *Groups* profile, this seems to be a *separate* Google profile from the Google capital-P Profile. Confusing? You betcha.</div>
Copyright &copy; 2010 <strong><a href="http://bierdoctor.com/">Emilee Rader</a></strong>]]></content:encoded>
			<wfw:commentRss>http://bierdoctor.com/2010/02/14/google-buzz/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>productivity</title>
		<link>http://bierdoctor.com/2009/06/02/productivity/</link>
		<comments>http://bierdoctor.com/2009/06/02/productivity/#comments</comments>
		<pubDate>Tue, 02 Jun 2009 18:27:48 +0000</pubDate>
		<dc:creator>emilee</dc:creator>
				<category><![CDATA[administrivia]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[reflection]]></category>
		<category><![CDATA[software tools]]></category>
		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://madmission.bierdoctor.com/2009/06/02/productivity/</guid>
		<description><![CDATA[well, the paper is submitted. but man, i NEVER want to do that again. and by &#8220;that&#8221; i mean write a single-author paper in about a week. i&#8217;d been working on analysis (along with all my other dissertation- and work-related stuff) for months, but when we returned from the holiday weekend &#8212; where i tried [...]]]></description>
			<content:encoded><![CDATA[<p>well, the paper is submitted. but man, i NEVER want to do that again. and by &#8220;that&#8221; i mean write a single-author paper in about a week. i&#8217;d been working on analysis (along with all my other dissertation- and work-related stuff) for months, but when we returned from the holiday weekend &#8212; where i tried and failed to write &#8212; all i had done was a bunch of statistics, graphs, and notes.</p>
<p>i&#8217;ve been using this service called <a href="http://www.rescuetime.com/">RescueTime</a> for the past several weeks as a way to track my hours for different projects i am working on, and as an indicator of my productivity in general. basically, you install a little app on your computer, and it sends data about what applications are active to the RescueTime server. you can log in and see reports of how much time you are spending looking at which apps and web pages (for $8/mo. you can get reports broken down by window title, not just application).</p>
<p>i have been happy to learn that i don&#8217;t &#8220;waste&#8221; as much time as i might have thought. but this past week isn&#8217;t a very accurate indication of my normal work habits. i went from notes and graphs to a 10-page ACM-format paper in a week:</p>
<p><a href="http://bierdoctor.com/images/gif/rescuetime.gif" target="_blank"><img src="http://bierdoctor.com/images/png/0526.png" border="0" height="481" width="391" /></a><br />
(<a href="http://bierdoctor.com/images/gif/rescuetime.gif" target="_blank">click for animated gif</a> showing May 26 &#8211; June 1)</p>
<p>it&#8217;s nice to see that i&#8217;ve still &#8220;got it&#8221;, i guess. but that was not a fun week.</p>
<p>i highly recommend RescueTime, if like me you want to be more meta about how you spend your time, and like looking at data.</p>
Copyright &copy; 2010 <strong><a href="http://bierdoctor.com/">Emilee Rader</a></strong>]]></content:encoded>
			<wfw:commentRss>http://bierdoctor.com/2009/06/02/productivity/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>hierarchies and semantics</title>
		<link>http://bierdoctor.com/2009/05/26/hierarchies-and-semantics/</link>
		<comments>http://bierdoctor.com/2009/05/26/hierarchies-and-semantics/#comments</comments>
		<pubDate>Tue, 26 May 2009 20:13:23 +0000</pubDate>
		<dc:creator>emilee</dc:creator>
				<category><![CDATA[analysis]]></category>
		<category><![CDATA[audience design]]></category>
		<category><![CDATA[common ground]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[dissertation]]></category>
		<category><![CDATA[measures]]></category>

		<guid isPermaLink="false">http://madmission.bierdoctor.com/2009/05/26/hierarchies-and-semantics/</guid>
		<description><![CDATA[In my dissertation experiment, I asked ~60 people from two different graduate schools (or &#8220;communities&#8221;) on campus to label and organize a set of short documents into a hierarchy (tree structure). They used a web-based interface created specifically for the experiment, that closely resembled the file-and-folder metaphor everybody is used to in Microsoft Windows and [...]]]></description>
			<content:encoded><![CDATA[<p>In my dissertation experiment, I asked ~60 people from two different graduate schools (or &#8220;communities&#8221;) on campus to label and organize a set of short documents into a hierarchy (tree structure). They used a web-based interface created specifically for the experiment, that closely resembled the file-and-folder metaphor everybody is used to in Microsoft Windows and MacOS.</p>
<p>Each person was instructed to organize the documents with a different &#8220;target audience&#8221; in mind: for themselves, for somebody in the same graduate program, and for somebody in the other graduate program. Twenty people were randomly assigned to each &#8220;target audience&#8221; group, ~10 from each community, and each person organized the documents once, for a single target audience. This resulted in the creation of 6 different &#8220;types&#8221; of file-and-folder hierarchies, by PRODUCER and AUDIENCE; the N in the chart below represents both the number of participants and the number of hierarchies created, by type:</p>
<p><img src="http://bierdoctor.com/images/png/organizing-task-N.png" border="0" height="150" width="500" /></p>
<p>I have been exploring different ways to analyze the hierarchies participants created, and I am starting to think there are three types of measures:</p>
<ol>
<li><strong>vocabulary</strong> &#8212; word-level measures, like label agreement, average word rank, number of unique words, length of labels</li>
<li><strong>&#8220;topology&#8221;</strong> &#8212; structural measures, like number and size of folders, average path length, etc.</li>
<li><strong>semantics</strong> &#8212; this one is a little harder to measure than the others. i wanted to know whether the conceptual groupings of files might look different based on the community of the hierarchy creator, and the target audience</li>
</ol>
<p>I used multidimensional scaling (MDS), <a href="http://madmission.bierdoctor.com/2009/05/06/more-fun-with-mds-and-r/">which I wrote about a few days ago</a>, which seemed to show that there were indeed meaningful patterns in the way documents were grouped together. But, I lost too much information with this technique &#8212; the MDS showed three distinct conceptual groups, but it was hard to determine whether structure existed within those groups.</p>
<p>Based on previous categorization research (<a href="http://dx.doi.org/10.1016/0010-0285(76)90013-X">Rosch et al. 1976</a>), I expected that students from CS would create more nuanced conceptual structures for the CS-related documents, and MSI students would do the same for the information-science-related documents. but MDS was not the right technique to use for this &#8212; so I used hierarchical cluster analysis instead.</p>
<p>Below are two <a href="http://en.wikipedia.org/wiki/Dendrogram">dendrograms</a>, one that represents the clustering based on data from all of the CS students, and the other that represents data from all of the MSI students. The same three groups from the MDS are also represented here: CS, Information Science, and Security.</p>
<p>In the aggregate MSI student dendrogram, the Information Science cluster is broken into two parts:</p>
<p><img src="http://bierdoctor.com/images/png/msi.all.labeled.png" border="0" height="500" width="500" /></p>
<p>In the aggregate CS student dendrogram, the same documents that make up Info Sci 1 and 2 above are merged into one cluster, while the same documents that make up the CS cluster above are broken into two groups:</p>
<p><img src="http://bierdoctor.com/images/png/cse.all.labeled.png" border="0" height="500" width="500" /></p>
<p>My next analysis steps will be to figure out how to use this information to systematically examine all of the hierarchies for evidence of these clusters. Ideally, I would like some kind of quantitative measure that indicates to what extent individual participants created structures with these same kind of patterns &#8212; but I&#8217;m not sure how to do that yet. My ultimate goal is to be able to compare hierarchies along all three dimensions mentioned in this post: vocabulary, topology, and semantics, and find out whether differences exist according to common ground and audience design factors.</p>
Copyright &copy; 2010 <strong><a href="http://bierdoctor.com/">Emilee Rader</a></strong>]]></content:encoded>
			<wfw:commentRss>http://bierdoctor.com/2009/05/26/hierarchies-and-semantics/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
