in my dissertation experiment, participants recruited from two different intellectual communities on campus organized a set of documents into file-and-folder hierarchies (single-categorization tree structures), using an online system created specifically for the experiment. as part of my analysis, i’m interested in comparing the hierarchies they created, to find out whether there are any reliable differences between the “community membership” groups.
i needed to select a couple of dimensions on which to compare the hierarchies. so far, i’ve picked three:
- file path and label (do people pick the same words for the same files)
- file grouping (which files are grouped together in the same folder)
- breadth vs. depth (how structurally “complex” are the hierarchies)
i calculated the dissimilarity between all possible pairs of participants, on the first two of the measures above (i have not yet come up with a measure i am happy with for the third). the measures represent the percent of the time two users did not choose the same exact words (#1), or group the same two files together in a folder (#2).
i used multidimensional scaling (MDS) to cluster these dissimilarity values. this produces a visual representation that plots each participant in relation to every other participant, according to the similarity/dissimilarity information. a really nice, informative web page about MDS can be found here.
MDS takes a set of proximities that can represent any number of dimensions (i.e., many factors might have contributed to the particular pattern of proximities observed), repeatedly transforms the information such that only the most important dimensions (mathematically speaking) are retained, and then smushes it into 2 dimensions so that we humans can make sense of the resulting graph (3d plots are just hard to parse). each MDS solution has an associated stress value, indicating how much distortion occurred as part of this “smushing” process. this is kindof similar to the distortion in the size of various continents apparent in maps when moving from a 3d representation of (globe) to a 2d representation like the Mercator projection. the lower the stress value in MDS, the less distortion.
due to all this dimensional reduction and smushing, the specific coordinate system in a MDS plot usually has no relationship to the real world (unless you started with data that could already be represented accurately in 2 dimensions); when interpreting the MDS it is the relative distances between the points, not the absolute distances in the coordinate system, that are important. MDS is a complicated mathematical analysis technique, but interpreting the results is usually a qualitative activity; one looks at the plots and tries to identify clusters or patterns that are meaningful in the context of the data and research questions.
ok, so, below is an MDS plot for the file grouping measure (there are no axis labels because coordinates are not meaningful) — these are SVG images; apologies if your browser cannot display them:
to me, it seems like there aren’t any clear patterns in this graph. i interpret this to mean there aren’t any consistent similarities or differences that are unique to one community or the other, in the files participants chose to group together into the same folder. the stress on this one is a little high, but not horrible.
now take a look at the MDS plot for the labeling measure:
this one looks more like it has a pattern; there’s some overlap but i can definitely see more blue circles on top, and more red triangles toward the bottom. and, the stress on this one is much lower.
so from these graphs, it seems like participants from the same community are more similar in the vocabulary they use, than participants from different communities. however, when deciding which files “go together” in the same folder, there doesn’t seem to be a clear pattern based on community membership. i also did some statistical tests, which confirm this qualitative assessment.