Friday, October 26, 2007

Hourly Weather data for each Retrosheet game

I noticed some suspect entries for game conditions in the eventfiles and realized I could not only fix it but add a pretty useful dimension to the retrosheet collection. The National Climate Data Center makes available "Global Hourly Surface Data" -- several dozen physical and observational characterizations of the current weather, taken hourly. This data goes back to the forties and sometimes to the start of the century.

Please enjoy this preliminary dataset giving the hourly weather data for each game in Fenway since 1957: http://vizsage.com/apps/baseball/results/weather/

(open the WeatherData-BOS07.* file of your choice) I don't have all the data in hand yet, but I thought I'd get your thoughts and see if anyone would like to help with some of the drudge work.

I'm excited about doing some fun things with the data, like see knuckleball effectiveness vs. humidity or elderly pitchers vs. temperature. Combined with the MLB gameday pitch trajectory info you could do physics "experiments": show the break distance of all curveballs vs. atmospheric pressure.

Email me back if you're interested or with comments.

-----------------------
DATA FIELDS AVAILABLE
-----------------------


The fields I've spit out are

-- game_ID, gamedate, gamenum_in_day, start_time, daygame_flag from
the cwgame output.
- temp deg C
The temperature of the air in degrees Celsius.
- press_atmos HPa
The atmospheric pressure at the observation point.
- press_sealvl HPa

The air pressure relative to Mean Sea Level (MSL).
- press_altim HPa
The pressure value to which an aircraft altimeter is set so that it
will indicate the altitude relative to mean sea level of an aircraft
on the ground at the location for which the value was determined.
- press_chg_3hr_del HPa
The absolute value of the quantity of change in atmospheric pressure
measured at the beginning and end of a three hour period.
- press_chg_3hr_obs --

The code that denotes the characteristics of an
ATMOSPHERIC-PRESSURE-CHANGE that occurs over a period of
three hours.
- wind_dir deg
The angle, measured in a clockwise direction, between true north and
the direction from which the wind is blowing.
- wind_obs --
The code that denotes the character of the WIND-OBSERVATION.
- wind_speed m/s

The rate of horizontal travel of air past a fixed point.
- wind_gust_speed m/s
The rate of speed of a wind gust.
- cloud_cover_low (frac)
The code that represents the fraction of the celestial dome covered
by all low clouds present. If no low clouds are present; the code
denotes the fraction covered by all middle level clouds present.
- vis_dist m
The horizontal distance at which an object can be seen and identified.

- sunshine_time min
The quantity of time sunshine occurred over the reporting period.
- wea_pr_m_obs_1 --
The code that denotes a specific type of weather observed manually.
- wea_pr_m_obs_2 --
The code that denotes a specific type of weather observed manually.
- wea_pr_m_obs_3 --
The code that denotes a specific type of weather observed manually.
- groundcond --

The code that denotes a type of Ground condition
- precip_hist_contin bool
The code that denotes whether precipitation is continuous (true) or
intermittent (false).
- precip_lq1_depth mm
The depth of LIQUID-PRECIPITATION that is measured at the time of an
observation. Unit:Millimeters
- precip_lq1_period hours
The quantity of time over which the LIQUID-PRECIPITATION was measured.

---------- WHAT I DID ----------

I used Brian Foy's Google Earth index of Major League Stadiums:
http://www252.pair.com/comdog/google_earth/major_league_baseball_stadiums.kml and the NCDC ISH-HISTORY file (gives locations for each weather station) ftp://ftp.ncdc.noaa.gov/pub/data/inventories/
to find the closest station with continuous data. (Turns out I could have saved a ton of trouble by just using the nearest airport -- in almost every case it was the best match.)

Then I pulled down data sets from http://cdo.ncdc.noaa.gov/pls/plclimprod/poemain.accessrouter?datasetabbv=DS3505 (If you're interested in replicating any of this I have a script that sends a GET url to help automate the weather data collection.) The last step is to match games with stadiums with locations, and dates and times with hourly observations.

I could be clever and subtle and use the start time and game duration to grab only the hours of gameplay, but instead I just pull in the records from 10:00am to 11:59pm for day games, and 5:00pm to 11:59pm for night games. I suppose I'll fix it to see if a game overhangs midnight and get the post-12am data for those only.

----------------------- WHAT YOU CAN DO TO HELP -----------------------

Geolocation for the rest of the stadiums

Inspect the data for consistency and correctness

If you have access to a computer at a .edu or .k12.us, or fancy GIS data, help me grab the rest of the weather files.

Email me if you'd like to help.

Labels: , , , , , , , , , , , ,

Retrosheet Eventfile Inconsistencies II

I've found a few more inconsistencies and minor inaccuracies in the retrosheet event files and game logs.

I made a diff (applied using the 'patch' tool) to mechanically recreate these corrections: http://vizsage.com/apps/baseball/results/rseventfiles_20070923_patch.diff

I pulled these out by whipping up a few simple scripts (one-liners, mostly) that extracts all unique values for each event file field. For example, the only values for the "info,pitches" field are 'count, 'none' and 'pitches' -- just as promised in the documentation. The "info,temp" field, however, has not only normal temperatures ("78", or "104", or "0" for [unknown]) but also spurious values of '670' and '700' (wrong), '8/7' (ill-formed) and '' (differs with the format documentation).

I'll posting all the dubious entries (event files version 2007 Sep 23) I find at http://vizsage.com/blog/2007/10/retrosheet-eventfile-inconsistencies.html as comments.

==================== Incorrect Data ====================

In 1993MIL.EVA:
info,start,spieb001,"Bim| Spiers",1,9,4
should be
info,start,spieb001,"Bill Spiers",1,9,4

These temperatures need fixing:
1988MON.EVN,info,temp,670
1988MON.EVN,info,temp,700
1964NYA.EVA,info,temp,8/7

I looked at a few suspiciously short games (< 60 minutes):
This should be 1:58, according to the NYT box score:

http://select.nytimes.com/gst/abstract.html?res=FB0614F73D59107B93C4A8178FD85F4C\
8585F9
1958BOS.EVA,info,timeofgame,58
These two are correct:
1971BAL.EVA,info,timeofgame,48 BAL197107300 -- Game called due to rain
1976BOS.EVA,info,timeofgame,57 BOS197609100 -- Game called due to rain
Another thing to look at would be suspicious game length/number of
outs ratio, but I haven't done this yet.

I also checked a few games with attendance below 1000, but these seem
to be very cold or rescheduled days. I'll taka a peak sometime soon at
"game attendance less than two and a half standard deviations from
that year's average attendance" to see what sticks out. (I also
peeked at 2.5+ above -- those look like bandwagon game)

==================== Badly Formatted ====================

These are probably correct but just ill-formatted:
1959CHN.EVN,info,timeofgame,0158
2001PIT.EVN,info,attendance, 34915
1962BOS.EVA,info,daynight,day,
1966ATL.EVN,info,howscored,"park"
1966HOU.EVN,info,howscored,"park"
1970CHA.EVA:data,er,roung101,4#
1958PIT.EVN:data,er,wills102,1y

In these files, the "howscored" field is spelled "howentered":
1990BOS.EVA,info,howentered,game
1990DET.EVA,info,howentered,game
1990DET.EVA,info,howentered,game
1990DET.EVA,info,howentered,game
1990DET.EVA,info,howentered,game
1990HOU.EVN,info,howentered,game
1990HOU.EVN,info,howentered,game
1990LAN.EVN,info,howentered,game
1990MON.EVN,info,howentered,game
1990MON.EVN,info,howentered,game
1990PIT.EVN,info,howentered,game
1990SFN.EVN,info,howentered,game
1990SFN.EVN,info,howentered,game
1990SLN.EVN,info,howentered,game
1990TEX.EVA,info,howentered,game
1990TEX.EVA,info,howentered,game

There are no "info,edittime" records -- is this purposeful?

==================== Inconsistent with Documentation ====================

In the 2003TBA.EVA file, the umpires are given by name and not by ID.

These are supposed to use 0 as the unknown value but in a few places
use a blank.
1990NYA.EVA,info,temp,
1978ATL.EVN,info,attendance,
1978NYA.EVA,info,attendance,
1979SDN.EVN,info,attendance,
2000PIT.EVN:info,windspeed,

There are some "info,ump[...],(None)" fields, and there are some
"info,ump[...]," fields. Does one indicate "unknown" and the other
indicate "none"? Or is this a formatting inconsistency?

These files have a bunch of "info,windspeed,unknown" fields (the dox
say "An unknown windspeed is indicated by -1."):
1969ATL.EVN 1969HOU.EVN 1969MON.EVN 1969PIT.EVN 1969SDN.EVN
1970ATL.EVN 1970HOU.EVN
These files have an "info,temp,unknown" field (the dox say "An unknown
temp is indicated by 0."):
1969ATL.EVN 1969HOU.EVN 1969MON.EVN 1969PIT.EVN 1969SDN.EVN 1970ATL.EVN
1970HOU.EVN 1990NYA.EVA

These lines have trailing spaces, which is harmless but still
shouldn't be there:
1958CHA.EVA:info,save,
1957BOS.EVA:com,"xwas a lot of action. Had this game been played
today, it no doubt"
1957BRO.EVN:com,"$In addition to 12,559 paid, 6000 knothole,"
1957CLE.EVA:com,"xCC4 changed E9/F.2-3;BX2(9)# to 9/F.2-3(E9)#"
1957MLN.EVN:com,"xCC4 per film, TSN 26 is DP"
1958CLE.EVA:com,"$ Strong wind to left; cool"
1958KC1.EVA:com,"xScoresheet scores DP as 142. I Checked with newspaper"
1958NYA.EVA:com,"$Total attendance: 13323"
1958SFN.EVN:com,"$paper box and Cin s/s has Cepeda and Sauer reversed"
1958SFN.EVN:com,"$paper box has stats that match SF s/s not Cin s/s"

Here are all the well-formed windspeed values:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
23 24
25 26 27 28 29 30 31 32 33 35 36 37 38 40 59 60 66 67 68 69 74 78 87
What are the units on these? If this is in MPH, 39 is Gale force
("Difficult to walk against wind. Twigs and small branches blown off
trees."), 55 is Storm ("Trees uprooted, structural damage likely") and 64
is ("Trees uprooted, structural damage likely").

Here are games with windspeeds over 40:
id,CHA197408270|windspeed,67
id,MIN198008190|windspeed,87
id,TOR198208030|windspeed,68
id,CHN198307042|windspeed,74
id,TOR198307270|windspeed,87
id,LAN199006050|windspeed,78
id,DET199506160|windspeed,87
id,CLE199609141|windspeed,69
id,COL199606150|windspeed,59
id,DET199704300|windspeed,66
id,TEX200104220|windspeed,40
id,SLN200610010|windspeed,60

The SLN200610010 event file gives a wind speed of 60mph (from baseball-reference and ESPN), but a) that's crazy and b) the weather report from that day doesn't confirm it:

http://www.wunderground.com/history/airport/KSTL/2006/10/1/DailyHistory.html?req\ _city=NA&req_state=NA&req_statename=NA Which gives 83F, 9mph SSW wind, clear

See also my next message, about getting weather data for each game.

The BGAME.exe documentation says "WindSpeed: 0 Unknown, 1 Known, other value is the wind speed" but I think it should be "WindSpeed: -1 Unknown other value is the wind speed in miles per hour".

Labels: , , , , , , , , , , ,

The Asdrubal Carrera Hall of Fame

Inspired by one of Tim McCarver's flights of fancy during the ALCS, I present The Asdrubal Carrera Hall of Fame, open to anyone in unique possession of a particular first name among Major League baseball players: LIST.

You may be familiar with Honus Wagner, Eppa Rixey, Boog Powell or Yogi Berra. But have you heard the storied diamond exploits of Firpo Mayberry, Zoilo Versalles, Pi Schwert or Bevo LeBourveau? OK, then how about Mysterious Walker, The Only Nolan, or Phenomenal Smith?

For some dinnertime fun over the holidays, discuss the relative merits of Urban Shocker, Twink Twining, Pussy Tebeau, Bris Lord, Boob Fowler, Crazy Schmit, Creepy Crespi, Cuddles Marshall, Vinegar Bend Mizell, and Buttercup Dickerson. (Unfortunately, 12 other players keep Rusty Kuntz off this list.)

Other stunningly yclept combatants include Ambiorix Burgos, Alamazoo Jennings, Welcome Gaston, Chicken Hawks, Sixto Lezcano, Wheezer Dell, Yam Yaryan, Yo-Yo Davalillo, Admiral Schlei, Boss Schmidt, Brick Smith, Brickyard Kennedy, Broadway Jones, Cannonball Titcomb, Baby Doll Jacobson, Sweetbreads Bailey, Zaza Harvey, Bubbles Hargrave, Pickles Dillhoefer, Double Joe Dwyer, Cowboy Jones, Coot Veal, Mul Holland, Live Oak Taylor, Skyrocket Smith, Kaiser Wilhelm, Kewpie Pennington, Possum Whitted, Snooks Dowd, and Mox McQuery.

See the list for links to each player's Baseball Reference page. Nerds may additionally view the generating mySQL query here.

Labels: , , , , , , , , ,

Wednesday, October 17, 2007

Retrosheet Eventfile Inconsistencies

Here are a few inconsistent records in the retrosheet.org event files of 2007 Sep 23. I'm using chadwick and not the retrosheet DOS utils, but I think I've source all these to the original event files. Weird Attendance in gamelog GL1941.TXT:
  WS1194107220 (WS1 vs DET) has '1500 e' as its attendance
Weird Start Time in eventfiles: Many daynight records lack an AM or PM. I assume the time mapping of times are as follows:
   daynight  start_time   24hr Time
   D or N    0            Unknown
    D        1000..1259   1000h to 1259h
    D        100..459     1300h to 1659h
    N        500..1150    1700h to 1359h
In that case, here are some weird start times reported by cwgame:
  - Negative start time:
      2003 D 0  -195 SEA 2003 04 15        SEA200304150    info,starttime,-2:05PM   info,daynight,day
  - No daynight flag:
      1998 D 0   506 LAN 1998 08 30        LAN199808300    info,starttime,5:06      -- no daynight --
  - Plainly inconsistent daynight flag:
      1985 D 1   605 CIN 1985 06 21        CIN198506211    info,starttime,6:05PM    info,daynight,day
      1960 N 0   135 BOS 1960 04 19        BOS196004190    info,starttime,1:35PM    info,daynight,night
  - Second half of a double header, listed as a day game despite 5pm or later start:
      1966 D 2   507 BAL 1966 10 02        BAL196610022    info,starttime,5:07PM    info,daynight,day
      2001 D 2   500 PHI 2001 05 27        PHI200105272    info,starttime,5:00PM    info,daynight,day
      2001 D 2   519 PIT 2001 06 03        PIT200106032    info,starttime,5:19PM    info,daynight,day
      2001 D 2   625 MIN 2001 05 26        MIN200105262    info,starttime,6:25PM    info,daynight,day
      2001 D 2   719 CHA 2001 09 04        CHA200109042    info,starttime,7:19PM    info,daynight,day
      2001 D 2   738 CHN 2001 08 20        CHN200108202    info,starttime,7:38PM    info,daynight,day
      2001 D 2   752 PIT 2001 09 03        PIT200109032    info,starttime,7:52PM    info,daynight,day
      2001 D 2   753 SLN 2001 08 03        SLN200108032    info,starttime,7:53PM    info,daynight,day
  - Start times that appear to be after midnight (this could be correct):
      1996 N 1    35 CIN 1996 06 25        CIN199606251    info,starttime,0:35      info,daynight,night
      1998 N 0   105 LAN 1998 06 13        LAN199806130    info,starttime,1:05      info,daynight,night
      1966 N 2  1207 BAL 1966 06 08        BAL196606082    info,starttime,12:07AM   info,daynight,night
 
These eventfile games have more than one "info,daynight" record
  ATL197004150    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  ATL197004160    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  ATL197005260    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  ATL197006191    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  ATL197006192    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  ATL197006200    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  ATL197006210    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  ATL197007031    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  ATL197007032    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  ATL197007050    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  ATL197009220    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  ATL197009230    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  ATL197009240    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  ATL197009250    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  ATL197009260    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  ATL197009270    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  HOU197006220    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  HOU197008031    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  HOU197008032    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  HOU197008040    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  HOU197009010    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  HOU197009110    info,starttime,0:00PM   info,daynight,day       info,daynight,night
  HOU197009130    info,starttime,0:00PM   info,daynight,day       info,daynight,night
This eventfile game is missing an "info,daynight" record:
  LAN199808300    info,starttime,5:06
File Structure in eventfile 2001HOU.EVN:
  2001HOU.EVN lacks a trailing newline (unix commands hate this).
Here are the unix commands I used to dump all that info. Sorry for the one-linerism.
# How many have a negative starttime?
grep 'info,starttime,-' *.EV*

# How many have missing or extra "info,daynight" fields?
# -- pull out the info, daynight and starttime records in order
# -- slurp the whole file as one giant string with internal linebreaks;
# -- split each stretch following an id,XXXX record into one line
# -- dump lines that have none or more than one daynight record
  cat *.EV* | egrep '^(id,|info,daynight|info,starttime)' | \
    perl -e '$_ = join(" ",<>); s/[\r\n]+/!!!/g; @games= (split /id,/, $_);
      shift @games;
      for $game (@games) {
          $game =~ s/!!!/\t/g; print "$game\n" if (($game !~ m/daynight/) || ($game =~ m/daynight.*daynight/));
      }'

# How many have a start_time and daynight_flag that disagree?
# -- use cwgame to pull off the gameID,start_time,daynight_flag records;
#    put it into a temporary file    
# -- Use a big stupid regex to find
#    . start_time that is >  500 and marked day
#    . start_time that is <  500 and marked night 
#    . start_time that is > 1200 and marked night 
#    . start_time that is <  100 
#    . start_time that is negative
( for ((year=1957;$year<=2006;year++)) ; do \
     for teamfile in ${year}*.[Ee][Vv]* ; do \
     cwgame -y $year -f '0-0,4-4,6-6' $teamfile 2>/dev/null ; \
     done; \
  done ) > /tmp/starttimeIDs.txt
cat /tmp/starttimeIDs.txt | \
  perl -ne '(m/"(\w\w\w)(\d\d\d\d)(\d\d)(\d\d)(\d)",(12\d\d|[1234]\d\d|\d\d|[1-9]|-\d+),"(N)"/ ||
    m/"(\w\w\w)(\d\d\d\d)(\d\d)(\d\d)(\d)",((?:5|6|7)\d\d|.*-.*|\d\d|[1-9]),"(D)"/)    &&
    printf "%s %s %5d %s %s %s %s\n", $7, $5, $6, $1, $2, $3, $4;' | sort

Labels: , , , , , , , ,