I find myself once again this week reading stats papers that range from “slightly over my head” to “I have no idea what you people are talking about,” in an attempt to figure out the right thing to do with a dataset involving observations that are not independent.
The dataset consists of conversations between dyads that took place while they completed two different interactive tasks. The conversations were recorded, transcribed, and segmented into utterances according to some criteria. This means that there are repeated utterances from each participant, and from each dyad. Different research areas use different terms to refer to this kind of setup: repeated measures, panel data, clustered data, etc. The analysis is further complicated by the fact that the predictors and variables are all categorical. Some are binary, the presence or absence of something. The more interesting variables have more than two categories (in some cases, MANY more).
I am trying to estimate the strength with which each of a set of 15+ utterance goals is associated with one of three roles participants assumed as part of the study. To do this, I need to specify a mixed-effects multinomial logit model, with a set of fixed-effects categorical predictors and a hierarchical random effects control for participant within dyad. This involves choosing a reference category of the response variable, and then running a series of binomial logit models that compare all the other levels of the response variable in turn with the reference category.
Here is where I am running into a situation, again, where I am pushing up against what mainstream statistical software packages are reliably capable of, and even R does not seem to be able to do what I want without more programming than my meager statistical background has prepared me for. The problem as I understand it is, each one of the binomial logit models that makes up the multinomial results uses a different subset of the data, excluding those observations that are related to the levels of the response variable not included in the model. This means that the random effects are estimated differently for each binomial logit model, depending on which observations are included in the subset. The upshot of all of this is the overall multinomial model estimates come out differently, depending substantially on which category is chosen as the reference category.
So that’s the problem. However, I did not write this to whine about how I am stuck. I’ve been trying to figure out a solution that I can live with… do I bail completely? Hire a real statistician? How can I figure out how biased the results would be if I were to to do a purely fixed-effects model? (Without random effects controls, any results produced might in fact be due to some unique aspect of the conversation within a particular dyad in a particular role, rather than indicative of something that shows up across all of the dyads.)
Researchers in many fields work with categorical data, and at least some of them over the years must have encountered this problem, whether they knew it or not, and were faced with the same tradeoffs. In order to get the paper out the door they had to just pick a compromise and go with it. But, any results reached due to a compromise are biased in some way. Models like this are just now becoming possible for people like me, with just enough stats knowledge to be dangerous, to run using fairly standard statistical software packages. But what about all the research that has come before — how accurate are those models, and the results they produced? How much do people allow what is statistically feasible to determine their research design, vs. compromising on the analysis after the fact? We all stand on the shoulders of giants, but how often were the giants using naive or incorrect statistics?