Noun-Phrase Analysis. Huh?

As I’m building the theoretical framework for my methodology, I’m going over some worthwhile texts that look at data mining online communities. In other words, how do we, as social scientists, make use of all that ‘stuff’ that’s going on online? One elementary text is “A Noun Phrase Analysis Tool for Mining Online Community Conversations,” by the mouthful Haythronthwaite and Gruzd (H&G).

The key phrase to take away from what follows is H&G’s pinpointing sentence:

“The underlying assumption in this kind of analysis is that language can reveal characteristics of community.” (2)

In a nutshell, they look at data collected over five years from undergrads, and unleash a ‘Natural Language Tool Kit’ based on Python to see what it yields. To do so they use a “noun-phrase extraction method,” meaning a tool that only counts a word if it’s accompanied by some kind of denominator. Example: The word information can mean a variety of things depending on the other words nearest to it. The noun-phrase information science, they claim, is much more meaningful. The benefit of this noun-phrase contraption is two-fold: it’s the more informative part of a sentence, and it allows effective extraction. (3)

So far, so good. The next logical step after counting what shows up most, is to categorize it. In their sample, H&G for instance observed ‘profession’ as the unifying theme for words like “library/libraries, information, book/s, librarian/s,” etc. The “Internet Community Text Analyzer (ICTA),” is a fancy word for using a tag cloud that displays words that occur more often as bigger.

What’s missing is a more solid connection of the relationship between the frequency of a specific word (or theme) and the nature of a community. They suggests, for example, that the declining occurrence of the word ‘database’ indicates an increased familiarity with its concept. But how do I know that this is a trend instead of a chance recording?

A more substantial observation is the littering presence of “don’t”: don’t know, don’t have, etc. It suggests that people use the message board to inform each other and actively ask questions. Precisely, of course, what this board was intended to do in the first place, so the relation between the words and nature of the community seems pretty straightforward, albeit entirely unproven.

Lastly, the use of emoticons triggers the discussion whether or not this carries any weight. This study shows that about 5% of all messages carry one of four standard smileys or frownies. Elsewhere researchers found a 10% presence in a similar environment (Sixl-Daniell & Williams, 2005). I’d like to see this more related to work by people like Douglas Thomas (“Hacker Culture”) that explain that how you write communicates as much as what you write.

The exploratory nature of this paper leaves a lot to be desired, but noun-phrase extraction offers two useful insights:

1.”one can quickly grasp important issues in a community by just simply skimming terms in its tag cloud.” (16)
2. Since the word “library/ies” showed up many times, H&G “display the most frequent noun phrases that include this word,” instead of having “to examine all 1,543 occurrences. I think that this is where using some sampling and stats would come in handy (see Krippendorf)

Subsequently, in my own research, I will take on a similar tag cloud tool (courtesy of BAS), and if I find the time to master me some Python also some NLTK, and then combine it with a sampling method. Particularly when you’re looking for a binary positive or negative judgment of a concept, place, person or thing, I believe sampling will enable you to generalize beyond the restrictions of a “time consuming procedure.” (17) Discuss.

Haythornthwaite, C., Gruzd, A. (2007) “A Noun Phrase Analysis Tool for Mining Online Community Conversations,” presented at the 3rd International Conference on Communities and Technologies, Michigan State University, East Lansing, Michigan, June 28 – 30, 2007.

Your comment

About Waffler