Hourly Weather data for each Retrosheet game
I noticed some suspect entries for game conditions in the eventfiles and realized I could not only fix it but add a pretty useful dimension to the retrosheet collection. The National Climate Data Center makes available "Global Hourly Surface Data" -- several dozen physical and observational characterizations of the current weather, taken hourly. This data goes back to the forties and sometimes to the start of the century.
Please enjoy this preliminary dataset giving the hourly weather data for each game in Fenway since 1957: http://vizsage.com/apps/baseball/results/weather/
(open the WeatherData-BOS07.* file of your choice) I don't have all the data in hand yet, but I thought I'd get your thoughts and see if anyone would like to help with some of the drudge work.
I'm excited about doing some fun things with the data, like see knuckleball effectiveness vs. humidity or elderly pitchers vs. temperature. Combined with the MLB gameday pitch trajectory info you could do physics "experiments": show the break distance of all curveballs vs. atmospheric pressure.
Email me back if you're interested or with comments.
----------------------- DATA FIELDS AVAILABLE ----------------------- The fields I've spit out are -- game_ID, gamedate, gamenum_in_day, start_time, daygame_flag from the cwgame output. - temp deg C The temperature of the air in degrees Celsius. - press_atmos HPa The atmospheric pressure at the observation point. - press_sealvl HPa The air pressure relative to Mean Sea Level (MSL). - press_altim HPa The pressure value to which an aircraft altimeter is set so that it will indicate the altitude relative to mean sea level of an aircraft on the ground at the location for which the value was determined. - press_chg_3hr_del HPa The absolute value of the quantity of change in atmospheric pressure measured at the beginning and end of a three hour period. - press_chg_3hr_obs -- The code that denotes the characteristics of an ATMOSPHERIC-PRESSURE-CHANGE that occurs over a period of three hours. - wind_dir deg The angle, measured in a clockwise direction, between true north and the direction from which the wind is blowing. - wind_obs -- The code that denotes the character of the WIND-OBSERVATION. - wind_speed m/s The rate of horizontal travel of air past a fixed point. - wind_gust_speed m/s The rate of speed of a wind gust. - cloud_cover_low (frac) The code that represents the fraction of the celestial dome covered by all low clouds present. If no low clouds are present; the code denotes the fraction covered by all middle level clouds present. - vis_dist m The horizontal distance at which an object can be seen and identified. - sunshine_time min The quantity of time sunshine occurred over the reporting period. - wea_pr_m_obs_1 -- The code that denotes a specific type of weather observed manually. - wea_pr_m_obs_2 -- The code that denotes a specific type of weather observed manually. - wea_pr_m_obs_3 -- The code that denotes a specific type of weather observed manually. - groundcond -- The code that denotes a type of Ground condition - precip_hist_contin bool The code that denotes whether precipitation is continuous (true) or intermittent (false). - precip_lq1_depth mm The depth of LIQUID-PRECIPITATION that is measured at the time of an observation. Unit:Millimeters - precip_lq1_period hours The quantity of time over which the LIQUID-PRECIPITATION was measured.---------- WHAT I DID ----------
I used Brian Foy's Google Earth index of Major League Stadiums:
http://www252.pair.com/comdog/google_earth/major_league_baseball_stadiums.kml
and the NCDC ISH-HISTORY file (gives locations for each weather station)
ftp://ftp.ncdc.noaa.gov/pub/data/inventories/
to find the closest station with continuous data. (Turns out I could
have saved a ton of trouble by just using the nearest airport -- in
almost every case it was the best match.)
Then I pulled down data sets from http://cdo.ncdc.noaa.gov/pls/plclimprod/poemain.accessrouter?datasetabbv=DS3505 (If you're interested in replicating any of this I have a script that sends a GET url to help automate the weather data collection.) The last step is to match games with stadiums with locations, and dates and times with hourly observations.
I could be clever and subtle and use the start time and game duration to grab only the hours of gameplay, but instead I just pull in the records from 10:00am to 11:59pm for day games, and 5:00pm to 11:59pm for night games. I suppose I'll fix it to see if a game overhangs midnight and get the post-12am data for those only.
----------------------- WHAT YOU CAN DO TO HELP -----------------------
Geolocation for the rest of the stadiums
Inspect the data for consistency and correctness
If you have access to a computer at a .edu or .k12.us, or fancy GIS data, help me grab the rest of the weather files.
Email me if you'd like to help.
Labels: baseball, data, dataset, earth, geolocation, google, mash-up, mashup, park, retrosheet, retrosheet.org, stadium, weather