Vizsage: 50 years of Baseball Play-by-play data mashed with 50 years of hourly weather data FTW

Note: I found this sitting in my drafts folder, unpublished. It actually dates from October.

I've had two interesting realizations from the Retrosheet Baseball data vs. Hourly Weather information mashup I've implemented. The first is how my two favorite scripting languages (Python and Perl) compare. The second is how the hard parts of this process is actually the stupidest part... there's four steps in doing an interesting visualization of open data. In order of steps as well as decreasing difficulty and decreasing stupidity:

Bring the data from behind its bureaucratic barriers
Unlock it into a universal format
Process and digest the data
Actually explore, visualize and share the data

The hardest and least justifiable steps are the first two, a problem we have to fix.[Edit: this is why I'm starting infochimp.org]

Here's a longer description of how I did the baseball games / weather data mashup.

Several significant parts of this project were written in Perl, for its superior text handling and for the ease of XML::Simple (which I love); several other parts were done in Python, for its more gracious object-orientation.

To suck in the Hourly Weather Data files, you have to click through a 4-screen web form process to prepare a query. Although it sends the final form submission as a POST query, the backend script does accept a GET url (you know, where the data is sent in the URL form.pl?param=val&param2=val&submit=yay instead of in the HTTP request). There's an excellent POST to GET bookmarklet that will take any webpage form and make the parameters appear in the URL. No guarantees that the backend script will accept this, but it's always worth a twirl for screenscraping webpages or just trying to understand what's going on behind the curtain.

Now I need to know what queries to generate. First I needed the location of each major league baseball stadium: Brian Foy posted a Google Earth index of Major League Stadiums, a structured XML file with latitude, longitude and other information. I used the Perl XML::Simple package to bring in this file. These simple routines just pull in the XML files and create a data structure (hashes and arrays of hashes) that mirror the XML tree. The stream-based (SAX) parsers are burlier and more efficient, but for this one-off script, who cares?

Next I needed the locations of all the weather stations. Perl and Python both have excellent flat-file capabilities. The global weather station directory is held in a flat file (meaning that each field is a fixed number of characters that line up in columns). Here's the column header, a sample entry, and numbers showing the width of each field:

USAF   NCDC  STATION NAME                  CTRY  ST CALL  LAT    LON     ELEV*10
010010 99999 JAN MAYEN                     NO JN    ENJA  +70933 -008667 +00090
123456 12345 12345678901234567890123456789 12 12 12 1234  123123 1234123 123456

To break this apart, you just specify an 'unpack' format string. 'A' means an (8-bit) ASCII character; 'x' means a junk character:

A6    xA5   xA29                          xA2xA2xA2xA4  xxA6    xA7     xA6

The result is an array holding each interesting (non-'x') field. The Perl code snippet:

    # Flat file format
    my $fmt    = "A6x    A5x   A29x                          A2xA2xA2xA4xx  A6x  A7x   A6";
    my @fields = qw{id_USAF id_WBAN name region country state callsign lat lng elev};
    # Pull in each line
    for my $line () {
        next if length($line) < 79; chomp $line;
        # Unpack flat record
        my @flat = unpack($fmt, $line);
        # Process raw record 
        ...
    }

I also grabbed the station files for Daily weather reports, since that data goes back much farther (generally, we have since ~1945 for Hourly and since ~1900 for Daily).

Then I score each station by (Proximity and Amount-of-Date), and select the five best stations for each stadium.

Now, I could of course use Perl to generate the POST request using the HTTP modules, but it was simpler to mindlessly just control click on a dozen links at a time and then answer each form. and spit out an HTML file with a big matrix of URLs for each station, for a subset of years. P (You can see the linkdump file here: http://vizsage.com/apps/baseball/results/weather/ParkWeatherGetterDirectory.html)

I also use perl to clean up the XML generated by the MySQL Query Browser -- which returns a flat XML file with all fields as content, not attributes. I just suck the file in with XML::Simple, walk down the resultant hash to create a saner (and semantic) data structure, then spit back out as XML.

The python parts are not terribly interesting. I pull in the flat file, clean up a few data fields and convert in-band NULLs into actual NULLs (they use 99999 to represent a null value in a 5-digit field, for instance) then export the data as a CSV file (for a MySQL LOAD DATA INFILE query). I chose python for this part because I find its object model cleaner -- it's easier to toss structured records around -- and the CSV module is a tad nicer.

The idea I find most interesting is that we're starting to get enough rich data on the web to make these cross-domain data mashups easy and fun -- I did all this in less than a week. With the effortless XML handling and text processing of modern scripting languages (and relieved from any efficiency concerns) it's easy to see forward to a future where we'll have all these datasets sitting at our fingertips. This data set lets you examine ideas such as "How does the break distance of curveballs change with atmospheric temperature and pressure for a full baseball season?" "Effectiveness of pitchers against gametime temperature, stratified by age of pitcher or inning?" "Batting average on fly balls vs. ground balls against % of total cloud cover?". It's easy to come up with a variety of other "This Rich Dataset vs. That Rich Dataset" opportunities. Stock price and Earnings of Harley-Davidson vs. average household income, unemployment and percent of the population that has reached retirement age? Year-by-year movie attendance at comedies compared to dramas, Attendance at Baseball Games, and Sales of Fast Food vs. Consumer Satisfaction Index, national Suicide Rate, and Persons treated for mental health/substance abuse? Presidential approval rating vs. gasoline prices and Consumer Price Index? Amazon.com sales rank, # mentions on Technorati blogs and # of mentions in mainstream media vs. time?

The hard part is actually the stupidest part: to unlock the data from behind bureaucratic barriers (the first script I described), then to convert into a universal semantically rich data format (the second set of scripts I described). Once one person has unlocked this data, however, it's there for the whole world to enjoy, and tools will evolve to capitalize on this bounty of rich, semantically tagged and freely available information.

Labels: baseball, data, explore, get, html, infochimp, mash-up, mashup, ncdc, perl, POST, python, retrosheet, retrosheet.org, screenscrape, visualization, vizsage, weather, xml

Posted by flip on Saturday, January 5, 2008 at 1/05/2008 12:38:00 PM | Permalink

Vizsage

Building tools to help organize, explore and visualize massive raw information streams

50 years of Baseball Play-by-play data mashed with 50 years of hourly weather data FTW

Search

Previous posts

Archives

Links

About me