1. overview and goals ; index ; 3. making sense of the data
I originally found Pitchfork because the site was likely to show up near the top of a Google search for any given album. I ended up returning to the site frequently because of its easy-to-navigate reviews history—every review ever published in Pitchfork is available on their website, which encourages lengthly browsing sessions.
Pitchfork is a full-time shop that offers new reviews and music news daily, and their website is one of the most popular music review sites on the Internet. They review a balanced combination of well-known and obscure artists in many genres, but they're mostly centered around rock/pop music and have a moderately strong indie/anti-corporate slant. Their politics are not clearly stated on their site, but the majority of the artists that get favorable reviews from Pitchfork are on independent labels, and this study unveiled a tendency for Pitchfork reviewers to use words that refer to corporations and capitalism when describing albums that they don't like.
I used the unix command-line utility wget to pull the
record reviews from Pitchfork's website. At the time I did this
(October 24 2003), Pitchfork had posted 5575 articles with an average
of 520 words a piece. Each article carries the artist's name, album
name, record label, author, and a rating on a scale of 100. (If you
want more statistics like these, here are some detailed database statistics.)
Designing a reviews database
The software and relational design
For the database software, it was important to me to use an SQL
database that is compatible with the
Perl DBI/DBD modular database
interface. This way, I can change database software without having to
rewrite large sections of my code. I ended up choosing MySQL, an open-source and free
database package that is available for many operating systems.
I tried to stick to proper relational database design when making this database. I wanted to give this project some weight on the computer science side of things, and I was also attracted to the speed and flexibility of relational databases. I hadn't done anything exactly like this before, so I spent a lot of time researching relational database theory and went through several different versions of the design before deciding on this one.
As you can see, I added a "publication" table in case I ever wanted to add reviews from places besides Pitchfork. The "rating" table holds the rating of an article on a scale from 1 to 100. To make the database properly relational and to avoid the unnecessary duplication of data, articles don't point directly to bands; a given article is a piece of writing about an album, which is in turn recorded by a certain band.
The text of an article is stored in the "paragraph" table, and each paragraph refers back to the article from whence it came. The paragraph also contains a "position", which is the paragraph's order in the article's sequence of paragraphs (the first paragraph has position 1, the second has position 2, and so on).
Similarly, words and sentences have a many-to-many relationship to each other that requires the "word_sentence" table. That table is useful for generating concordances for words and for finding semantic links between words that are likely to appear in the same sentence as each other. Sentences themselves are inhabitants of particular paragraphs, so each sentence has a link back to a paragraph id.
I didn't want the contents of quotes to interfere with the words of the reviewers, so the "word_paragraph" table doesn't include words that are found inside quotes. Those words were to be stored separately in a "word_quote" table to be used later in case I needed to generate statistics about quotes. (I didn't end up doing anything with them.)
Here is a diagram that illustrates the way that words and sentences are represented in the database. It also shows the aforementioned many-to-many relationships between words and paragraphs and between words and sentences.
1: overview and goals ;
index ;
3: making sense of the data
Parsing the text and feeding the database
Here are the scripts I wrote to build the database, parse the reviews,
and feed the database: Reviews.pm
This is a module for handling database interactions. The code is
kind of long and it's needed in every script, so I made a module out of
it. I'm glad I did it, because it makes the database queries look a lot
cleaner in the other scripts. pitchfork_review_parser.pl
This takes a Pitchfork review file as its input, extracts the important
components of the review, and makes an entry for that review in the
database. It does everything in Figure 2-1 above. The review files come
straight from the Pitchfork website and are in HTML format. make_words_and_paragraphs.pl
After running the pitchfork_review_parser script on every review, this
script needs to be run. It goes through the database and parses words
out from paragraphs, creating most of the tables in Figure 2-2 above.
(It doesn't create the sentence or word_sentence tables, which get
created by the next script.) make_sentences.pl
This script generates the sentence and word_sentence tables. To perform
the surprisingly difficult task of parsing sentences from paragraphs,
I used a Perl module from CPAN called
Lingua::EN::Sentence. It isn't perfect, but it works in the vast majority
of cases. sql_database.txt
When fed into MySQL, this file creates all of the tables in the database.
If you're a database techie, it's especially interesting to see the
various datatypes that I decided to use for each column. There are
things in here that I don't explain until the next chapter, but
everything in here gets explained eventually.