Tuesday, May 7, 2013

Data Mining for Emotions

So I took a hiatus after my last post because it was massive, and because I've acquired a new job doing social media marketing for a couple of local businesses. I'm sure some digital humanities content will stem from what I'm doing with these new jobs.

But you don't care about that right now, so let's get started.

In my last post, I took a look at Stanley Fish's half-hearted argument sort-of against the digital humanities. The first sentence in this paragraph is intentionally wishy-washy to reflect Fish's attitude toward the DH. His conclusion (after three gruelling posts) was that he basically has no need for it, nor it for him. I don't want to spend the majority of my time writing on this blog with attacks on Stanley Fish. I do, however, want to open up with a real example of something that he discusses in one of his posts.

In "Mind Your P's and B's: The Digital Humanities and Interpretation", Fish talks about data mining using the digital humanities. Data mining is essentially using a logarithm to scan digitized texts to look for patterns. These patterns can, some digital humanists argue, lead to some kind of meaning. It is placing the computer slightly above the interpretive capacities of the human mind, for a moment, in order to process vast quantities of information. The example given in the post is from a researcher who mined through mid-nineteenth century literature and found an increase in the names of foreign countries, cities, etc. mention, and was thus lead to hypothesize that literature during that time was more "diversely outward-looking" than had been noticed before. Fish argues that the data itself can not lead us to that conclusion; to understand the contextual framework, we need legitimate textual analysis of the text itself -- to put it simply, we need to read it.

The researcher, Matthew Wilkens, argues no, we don't have to read every single text. That would take forever. Not to mention, we keep reading the same texts over and over again to prove slightly different points. He argues that by scanning a multitude of other texts, we may be able to find many more meanings and interpretations that would have taken years to find. Then, once the patterns are detected, more close reading can be done.

Something that seems to bother Fish and upset his form of literary criticism is the way that digital humanists go about their research. Whereas Fish follows the traditional path of reading, developing a hypothesis, and then using the text to defend his hypothesis, digital humanists -- particularly data miners -- fire logarithms at the text, seemingly at random. Then, once the dust has settled and the numbers come out, they look for trends and patterns and then formulate a hypothesis. This is why Fish finally comes to the conclusion that the digital humanities have no use for him and his superior literary theory.

Difference between -scores of Joy and Sadness for years from 1900 to 2000 (raw data and smoothed trend). Values above zero indicate generally ‘happy’ periods, and values below the zero indicate generally ‘sad’ periods.
A perfect example of data mining and the style of analysis Fish has a problem with comes from a story on NPR published in early April entitled, "Mining Books to Map Emotions Through a Century." Perfect title, huh? Several years ago, a team of researchers went mining through millions of texts digitized by Google. They started at the beginning of the 20th century and went until 2008. They started, "with lists of 'emotion' words: 146 different words that connote anger; 92 words for fear; 224 for joy; 115 for sadness; 30 for disgust; and 41 words for surprise. All were from standardized word lists used in linguistic research." They were looking for the usage of these words over time to see if any of them increased in popularity across the English language. Click here to read the whole report.

What they found was that the usage of certain emotion words were highly correlated to major historical economic, social, and political trends. The 1920s, they reported, was the happiest decade, in terms of positive emotion words. The time during WWII (particularly 1941) was the saddest. What's more is that they have found a steady decline in emotion words being used in general.
"'Generally speaking, the usage of these commonly known emotion words has been in decline over the 20th century,' [Alex] Bentley says. We used words that expressed our emotions less in the year 2000 than we did 100 years earlier — words about sadness and joy and anger and disgust and surprise."
Difference between -scores of the six emotions and of a random sample of stems (see Methods) for years from 1900 to 2000 (raw data and smoothed trend). Red: the trend for Fear (raw data and smoothed trend), the emotion with the highest final value. Blue: the trend for Disgust (raw data and smoothed trend), the emotion with the lowest final value.
This is surprising, the reporter writes, in a world that seems to be teeming with feelings from blogs, Facebook, advertisements, and the like. James Pennebaker, a psychologist at University of Texas, Austin, thinks it is a little "too soon" to come to any hard conclusions about the decline in emotion words, but he also thinks that the data is extremely interesting and could yield some interesting results in the field of psychology and history. Using language analysis, Pennebaker thinks it's possible to tap into the emotional consciousness and cultural attitudes of bygone eras.
"That's why this language analysis seems so promising to him — as a new window that might offer a different, maybe even more objective, view into our culture. Because, he says, it's difficult for people today to guess the emotions of people of different times."
What really interested me about this story was the content that the researchers skimmed through.
"...the books the computers searched in the Google database included an incredibly wide range of topics. They weren't just novels or books about current events, Bentley says. Many were books without clear emotional content — technical manuals about plants and animals, for example, or automotive repair guides."
The field of technical writing is generally supposed to be void of bias and of emotion -- generally, it seems hard to put emotion into technical journals and user manuals. But this is exactly why it interests me in the wake of Fish's analysis of the digital humanities. If we were to take this same study and, instead of data mining with a computer, do a close reading of texts in an effort to come to some similar conclusion, it makes sense that a researcher would stick (primarily) to literature. Great literature, at that. Who would have the time, energy, sanity, and, most importantly, forethought to research the language in the technical manual for a 1950s refrigerator, or the introduction from an Audubon collection from 1926?

These are, of course, examples, but it makes a strong argument for the use of data mining in certain contexts. It would be interesting to know how much the data would change if all of the technical material were left out -- presumably as they would be if close reading was performed by a literary researcher. Data mining in this context can allow us to observe certain trends across a myriad of different media that we might ignore if doing traditional research.

Another reason this story is particularly interesting to me is that it allows me to segway into more of what I want to focus on in posts to come: the application and purpose of the digital humanities. What I think Fish missed (or ignored, at least) was the interdisciplinary nature of the digital humanities. He presented it in a relatively 2D format where text / (x + y) = a result or something like it, and he only vaguely mentioned its application in scenarios outside of data-crunching. The author of the NPR article mentions that Pennebaker wants to use data mining and distant reading in an effort to practice language analysis. The practice employs various aspects of literature, literary theory linguistics, psychology, sociology, history, computer information, coding, and statistics, just to get started. The digital humanities, when viewed through this lens, all of a sudden seem very necessary -- a way to tie everything together to embrace many fields of study in varying, limitless combinations.

No comments:

Post a Comment