1. overview and goals ; index ; 3. making sense of the data

2. building a database of music reviews

  1. Finding a source for music reviews
  2. Designing a reviews database
  3. Parsing the text and feeding the database

Finding a source for music reviews

In order to build my database of music reviews, I had to find a place where I could get a lot of reviews in a digital format. Figuring that the easiest way would be to download reviews from the Internet in an automated fashion, I started looking at websites for various online and printed music publications. After a bit of deliberation, I ended up choosing the Pitchfork online music magazine (
pitchforkmedia.com) as the basis of my reviews database.

I originally found Pitchfork because the site was likely to show up near the top of a Google search for any given album. I ended up returning to the site frequently because of its easy-to-navigate reviews history—every review ever published in Pitchfork is available on their website, which encourages lengthly browsing sessions.

Pitchfork is a full-time shop that offers new reviews and music news daily, and their website is one of the most popular music review sites on the Internet. They review a balanced combination of well-known and obscure artists in many genres, but they're mostly centered around rock/pop music and have a moderately strong indie/anti-corporate slant. Their politics are not clearly stated on their site, but the majority of the artists that get favorable reviews from Pitchfork are on independent labels, and this study unveiled a tendency for Pitchfork reviewers to use words that refer to corporations and capitalism when describing albums that they don't like.

I used the unix command-line utility wget to pull the record reviews from Pitchfork's website. At the time I did this (October 24 2003), Pitchfork had posted 5575 articles with an average of 520 words a piece. Each article carries the artist's name, album name, record label, author, and a rating on a scale of 100. (If you want more statistics like these, here are some detailed database statistics.)

Designing a reviews database

The software and relational design
For the database software, it was important to me to use an SQL database that is compatible with the
Perl DBI/DBD modular database interface. This way, I can change database software without having to rewrite large sections of my code. I ended up choosing MySQL, an open-source and free database package that is available for many operating systems.

I tried to stick to proper relational database design when making this database. I wanted to give this project some weight on the computer science side of things, and I was also attracted to the speed and flexibility of relational databases. I hadn't done anything exactly like this before, so I spent a lot of time researching relational database theory and went through several different versions of the design before deciding on this one.

Representing an article
The following diagram shows the way that an article is represented in the database. Each rectangle represents a table in the database, and the pointer arrows relate tables together using their fields' ID numbers (hence the "relational" in "relational database"). Fields with "index" in the name are indexes of other fields, which means that more space is taken up in the database in order to make queries on the indexed fields faster.

As you can see, I added a "publication" table in case I ever wanted to add reviews from places besides Pitchfork. The "rating" table holds the rating of an article on a scale from 1 to 100. To make the database properly relational and to avoid the unnecessary duplication of data, articles don't point directly to bands; a given article is a piece of writing about an album, which is in turn recorded by a certain band.

The text of an article is stored in the "paragraph" table, and each paragraph refers back to the article from whence it came. The paragraph also contains a "position", which is the paragraph's order in the article's sequence of paragraphs (the first paragraph has position 1, the second has position 2, and so on).

Representing words and sentences
In order to be able to generate statistics based on word count, I had to parse each paragraph into words and put those words in a "word" table. Since a given word can appear in any number of paragraphs and a paragraph can contain any number of words, these two elements have a many-to-many relationship which needs to be represented by a third table that joins words to paragraphs (the "word_paragraph" table).

Similarly, words and sentences have a many-to-many relationship to each other that requires the "word_sentence" table. That table is useful for generating concordances for words and for finding semantic links between words that are likely to appear in the same sentence as each other. Sentences themselves are inhabitants of particular paragraphs, so each sentence has a link back to a paragraph id.

I didn't want the contents of quotes to interfere with the words of the reviewers, so the "word_paragraph" table doesn't include words that are found inside quotes. Those words were to be stored separately in a "word_quote" table to be used later in case I needed to generate statistics about quotes. (I didn't end up doing anything with them.)

Here is a diagram that illustrates the way that words and sentences are represented in the database. It also shows the aforementioned many-to-many relationships between words and paragraphs and between words and sentences.

Parsing the text and feeding the database

Here are the scripts I wrote to build the database, parse the reviews, and feed the database:

Reviews.pm
This is a module for handling database interactions. The code is kind of long and it's needed in every script, so I made a module out of it. I'm glad I did it, because it makes the database queries look a lot cleaner in the other scripts.

pitchfork_review_parser.pl
This takes a Pitchfork review file as its input, extracts the important components of the review, and makes an entry for that review in the database. It does everything in Figure 2-1 above. The review files come straight from the Pitchfork website and are in HTML format.

make_words_and_paragraphs.pl
After running the pitchfork_review_parser script on every review, this script needs to be run. It goes through the database and parses words out from paragraphs, creating most of the tables in Figure 2-2 above. (It doesn't create the sentence or word_sentence tables, which get created by the next script.)

make_sentences.pl
This script generates the sentence and word_sentence tables. To perform the surprisingly difficult task of parsing sentences from paragraphs, I used a Perl module from CPAN called Lingua::EN::Sentence. It isn't perfect, but it works in the vast majority of cases.

sql_database.txt
When fed into MySQL, this file creates all of the tables in the database. If you're a database techie, it's especially interesting to see the various datatypes that I decided to use for each column. There are things in here that I don't explain until the next chapter, but everything in here gets explained eventually.

1: overview and goals ; index ; 3: making sense of the data