2. building a database of music reviews ; index ; 4. discussing the results

3. making sense of the data

  1. Generating useful statistics for each word
  2. Understanding word usage
  3. Classifying the words
  4. Finding important words through word relationships

Generating useful statistics for each word

The statistics
Before I designed the database, I sat down and thought about the kind of statistics that might help me understand how the reviewers used each word. I wanted to get a sense of how often a given word was used, and also whether that word was being used in positive or negative reviews.

After a considerable amount of deliberation and experimentation, I settled on the following ten statistics.

name of statistic description of this statistic
tot Total number of occurrences of a particular word in the database
pos Number of occurrences of the word in positive articles
neu Number of occurrences of the word in neutral articles
neg Number of occurrences of the word in negative articles
pwr Positive Word Ratio (pos / tot)
nwr Negative Word Ratio (neg / tot)
avg Average rating of articles containing the word
art Number of articles containing the word
aut Number of authors who have used the word
wordscore A metric that combines tot and avg into one number; very useful when comparing words.

Setting boundaries for positive and negative reviews
Pitchfork has a helpful “ratings key” on their website where rating numbers on their 100-point scale get certain meanings. For example, scores between 7.0 and 7.4 get “Not brilliant, but nice enough,” while scores between 7.5 and 7.9 get “Very good.” Since 4.0 to 4.9 are considered “Just below average” by Pitchfork, I set the boundaries here:

score 0 to 49 - a negative review (873 articles fall in this category)
score 50 to 74 - a neutral review (2362 articles)
score 75 to 100 - a positive review (2339 articles)

(To make things simpler in the database, I multiply all ratings by 10 to get a simple 100-point scale with no decimal points.)

The code to generate and view statistics
The statistics are stored in a separate table and are generated by this script:
make_statistics.pl

To view the words along with their statistics, I made this cgi web frontend: words.pl
For each word, it prints a line that includes some of the statistics associated with that word, and also provides links to view details for that word, view a concordance (KWIC), find other words that appear in the same sentence (coex), and find other words that appear in the same phrase (close). A line of output from this script looks like this:

  1. continually -> tot:67 pos:48 neu:15 neg:4 pwr:0.72 nwr:0.06 avg:76.45 art:66 aut:31 -> continually KWIC coex close

Go to the words cgi to see the results of this script on the live database. The top line of the page is a form where you can select criteria for viewing and sorting the words; you can sort words by positivity (pwr), negativity (nwr), or total number of occurrences (tot).

Also, I made another cgi to view detailed statistics for a given word: word.pl
You can see this cgi by clicking one of the word links on the right side of the words cgi above.
Here are some examples of the output of this cgi:
details for primal ; violin ; clouds

Understanding word usage

To be able to tell how words are being used by the reviewers, I needed to see the word in context. Many words can have multiple meanings (“clouds”, “saw”), or could refer to a number of different instruments (“tenor”, “strings”). A list of the sentences that a particular word appears in is called a concordance for that word, and concordances are sometimes referred to as KWICs (Keyword In Context).

Here's an example of a few concordance lines for the word “tenor”. The format is:

Artist: Title of album [Record label; rating nn/100] Author, Date of publication
Here is the sentence where word appears. (paragraph number)

  1. Menomena: I Am the Fun Blame Monster [Muuuhahaha!; rating 87/100] Joe Tangari, 2003-10-22
    The surprises are packed tightly into every song: guitars crash in only to anticipate silence, pianos weave through minefields of modular percussion, and Knopf's pleasing tenor runs a gauntlet of fading and processing to deliver what are fundamentally very basic melodies. (p.4)
  2. Superchunk: Cup of Sand [Merge; rating 84/100] Eric Carr, 2003-10-10
    Despite it all, two discs' worth of jangly pop and McCaughan's exuberant tenor can defeat even the most steadfast listener (it really is a lot to take in at once), and as mentioned, not all of these tracks bear out the same level of quality. (p.6)
  3. Cyann & Ben: Spring [Gooom; rating 84/100] Nick Sylvester, 2003-10-03
    Atop this mesmeric mist hovers a soft male tenor with a female partner wading in and out of accompaniment. (p.2)

The cgi that I wrote to generate concordances on the fly is: concordance.pl
You can view some concordances using these links:
concordances for clouds ; tenor ; accordion

I also added the very useful ability to generate a double concordance; passing it two different words will return only the sentences where both of those words appear. (This feature isn't exactly perfect, since it tends to misreport the total occurrences at the top, but the actual concordance is always correct.)
Try doing a double concordance: guitar and clouds

Classifying the words

I spent a long time using my words interface to try and get a sense of what the Pitchfork critics were saying. Since my goal was to create lists of words that I could use to write some music, I knew I'd need to start classifying words into various categories at some point.

I ended up storing the names of the word categories in a table called “class”, and words get tied to those categories via a joining table called “word_class”. To help classify words, I made a cgi interface called edit_class.pl so I didn't have to go through and edit the entries in the word_class table by hand.

I couldn't go through and classify all 100,000 words, so I came up with a method for deciding which words to classify:

  1. Start at the top of the positive words and classify them until the words seem to be losing their positive connotations. (Almost all of the first 300 positive words have a "positive feel" to them.)
  2. Classify a proportionate number of negative words (based on the ratio of positive to negative articles).
  3. Classify all words that appear 1000 times or more.

To classify a word, I usually read some of the concordance for that word to figure out its meaning and context in the reviews before adding it to a particular class. I ended up classifying 369 positive words, 139 negative words, and 254 frequently-appearing words. I created classes as I needed them, ending up with the following categories in the database:

name of class description of this class # of words in this class
none this word corresponds to no class in particular (too vague or common) 347
good positive aesthetic value judgment words 41
bad negative aesthetic value judgment words 51
genre references to musical genres 30
mood the music's mood or the music's effect on the listener's mood 47
transition descriptions of the transitions in the music 12
dynamics descriptions of musical dynamics 9
instrument instruments and words describing the way that certain instruments are being played 56
vocal words referencing the vocals or vocal style 17
structure words describing compositional structure 34
sound words directly describing the sound of the recording or of the instruments 15
metaphor metaphors that don't fit into other categories 13
complex words which fall into multiple categories depending on context or are noteworthy but hard to categorize 67
consumerism words which reference buying, advertising, and business 9
intelligence words describing the intelligence of the artists or listeners 12

To view the classes and perform operations on the words in a given class, I made a cgi called class.pl.
You can view the classes yourself if you like. By default, this cgi shows you all of the classified words at once, but you can use the dropdown menu at the top of the page to view only positive words, only negative words, or only frequently-used words.

Finding important words through word relationships

To complete my search for important words, I wanted to move past single words and start looking at phrases and whole sentences. Thanks to the last step, I now have a list of positive and frequently-used instrument words, but what adjectives are likely to describe those instruments? The word “guitar” is found in the database 4885 times, but what other words is it likely to appear with? And how do the results change if I also look at words that appear alongside similar words like “guitars”?

What I really needed was a cgi script that finds all occurrences of a word or group of words and then counts the words that appear in the same sentence or phrase. I called this concept coexistence, and I wrote two different scripts for this purpose:

coexistence.pl takes a word or group of words, finds all of the sentences in the database that contain those words, and counts up the other words in those sentences. View sample output from coexistence.pl.

close_coex.pl does the same thing, except it only counts words within a variable length "phrase" of words that you specify. (You get two blanks for this purpose: one is the number of words before the given word that the script should look at, and the other is the number of words after the given word.) View sample output from close_coex.pl.

I used these coexistence scripts a lot when I was putting together the final word lists. The scripts take a long time to run, so there was a lot of waiting involved.

Here is a table of many of the words that I found by looking at word relationships using the above scripts. These words are used both in positive and negative contexts, so I consulted their wordscores to see how each word fared in the database.

instrument words likely to appear in phrases with this instrument word
guitar acoustic bass electric solo chords distorted steel lead strumming feedback noise backwards effects repetitive melodies chiming plaintive
bass deep upright groove distorted throbbing thumping melodic synth rumbling heavy fuzz drone bassline
drums pounding brushed steel hand programmed crashing distorted crisp rolling booming tinny
vocals backing his female lead breathy harmony distorted ethereal layered fragile hushed nasal distant spoken plaintive
lyrics "good" words: clever poetic melodies simple witty pretty
"bad" words: vague nonsensical meaningless
piano electric chords ballad melody solo simple flourishes plaintive
electronics analog experimental piercing gurgling primitive glitchy abstract burbling lo-fi warm
strings plucked swelling sweeping sampled bowed eerie swirling harmonies

2. building a database of music reviews ; index ; 4. discussing the results