Retrosheet Eventfile Inconsistencies II
I've found a few more inconsistencies and minor inaccuracies in the retrosheet event files and game logs.
I made a diff (applied using the 'patch' tool) to mechanically recreate these corrections: http://vizsage.com/apps/baseball/results/rseventfiles_20070923_patch.diff
I pulled these out by whipping up a few simple scripts (one-liners, mostly) that extracts all unique values for each event file field. For example, the only values for the "info,pitches" field are 'count, 'none' and 'pitches' -- just as promised in the documentation. The "info,temp" field, however, has not only normal temperatures ("78", or "104", or "0" for [unknown]) but also spurious values of '670' and '700' (wrong), '8/7' (ill-formed) and '' (differs with the format documentation).
I'll posting all the dubious entries (event files version 2007 Sep 23) I find at http://vizsage.com/blog/2007/10/retrosheet-eventfile-inconsistencies.html as comments.
==================== Incorrect Data ====================
In 1993MIL.EVA: info,start,spieb001,"Bim| Spiers",1,9,4 should be info,start,spieb001,"Bill Spiers",1,9,4 These temperatures need fixing: 1988MON.EVN,info,temp,670 1988MON.EVN,info,temp,700 1964NYA.EVA,info,temp,8/7 I looked at a few suspiciously short games (< 60 minutes): This should be 1:58, according to the NYT box score: http://select.nytimes.com/gst/abstract.html?res=FB0614F73D59107B93C4A8178FD85F4C\ 8585F9 1958BOS.EVA,info,timeofgame,58 These two are correct: 1971BAL.EVA,info,timeofgame,48 BAL197107300 -- Game called due to rain 1976BOS.EVA,info,timeofgame,57 BOS197609100 -- Game called due to rain Another thing to look at would be suspicious game length/number of outs ratio, but I haven't done this yet. I also checked a few games with attendance below 1000, but these seem to be very cold or rescheduled days. I'll taka a peak sometime soon at "game attendance less than two and a half standard deviations from that year's average attendance" to see what sticks out. (I also peeked at 2.5+ above -- those look like bandwagon game)
==================== Badly Formatted ====================
These are probably correct but just ill-formatted: 1959CHN.EVN,info,timeofgame,0158 2001PIT.EVN,info,attendance, 34915 1962BOS.EVA,info,daynight,day, 1966ATL.EVN,info,howscored,"park" 1966HOU.EVN,info,howscored,"park" 1970CHA.EVA:data,er,roung101,4# 1958PIT.EVN:data,er,wills102,1y In these files, the "howscored" field is spelled "howentered": 1990BOS.EVA,info,howentered,game 1990DET.EVA,info,howentered,game 1990DET.EVA,info,howentered,game 1990DET.EVA,info,howentered,game 1990DET.EVA,info,howentered,game 1990HOU.EVN,info,howentered,game 1990HOU.EVN,info,howentered,game 1990LAN.EVN,info,howentered,game 1990MON.EVN,info,howentered,game 1990MON.EVN,info,howentered,game 1990PIT.EVN,info,howentered,game 1990SFN.EVN,info,howentered,game 1990SFN.EVN,info,howentered,game 1990SLN.EVN,info,howentered,game 1990TEX.EVA,info,howentered,game 1990TEX.EVA,info,howentered,game There are no "info,edittime" records -- is this purposeful?
==================== Inconsistent with Documentation ====================
In the 2003TBA.EVA file, the umpires are given by name and not by ID. These are supposed to use 0 as the unknown value but in a few places use a blank. 1990NYA.EVA,info,temp, 1978ATL.EVN,info,attendance, 1978NYA.EVA,info,attendance, 1979SDN.EVN,info,attendance, 2000PIT.EVN:info,windspeed, There are some "info,ump[...],(None)" fields, and there are some "info,ump[...]," fields. Does one indicate "unknown" and the other indicate "none"? Or is this a formatting inconsistency? These files have a bunch of "info,windspeed,unknown" fields (the dox say "An unknown windspeed is indicated by -1."): 1969ATL.EVN 1969HOU.EVN 1969MON.EVN 1969PIT.EVN 1969SDN.EVN 1970ATL.EVN 1970HOU.EVN These files have an "info,temp,unknown" field (the dox say "An unknown temp is indicated by 0."): 1969ATL.EVN 1969HOU.EVN 1969MON.EVN 1969PIT.EVN 1969SDN.EVN 1970ATL.EVN 1970HOU.EVN 1990NYA.EVA These lines have trailing spaces, which is harmless but still shouldn't be there: 1958CHA.EVA:info,save, 1957BOS.EVA:com,"xwas a lot of action. Had this game been played today, it no doubt" 1957BRO.EVN:com,"$In addition to 12,559 paid, 6000 knothole," 1957CLE.EVA:com,"xCC4 changed E9/F.2-3;BX2(9)# to 9/F.2-3(E9)#" 1957MLN.EVN:com,"xCC4 per film, TSN 26 is DP" 1958CLE.EVA:com,"$ Strong wind to left; cool" 1958KC1.EVA:com,"xScoresheet scores DP as 142. I Checked with newspaper" 1958NYA.EVA:com,"$Total attendance: 13323" 1958SFN.EVN:com,"$paper box and Cin s/s has Cepeda and Sauer reversed" 1958SFN.EVN:com,"$paper box has stats that match SF s/s not Cin s/s" Here are all the well-formed windspeed values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 35 36 37 38 40 59 60 66 67 68 69 74 78 87 What are the units on these? If this is in MPH, 39 is Gale force ("Difficult to walk against wind. Twigs and small branches blown off trees."), 55 is Storm ("Trees uprooted, structural damage likely") and 64 is ("Trees uprooted, structural damage likely"). Here are games with windspeeds over 40: id,CHA197408270|windspeed,67 id,MIN198008190|windspeed,87 id,TOR198208030|windspeed,68 id,CHN198307042|windspeed,74 id,TOR198307270|windspeed,87 id,LAN199006050|windspeed,78 id,DET199506160|windspeed,87 id,CLE199609141|windspeed,69 id,COL199606150|windspeed,59 id,DET199704300|windspeed,66 id,TEX200104220|windspeed,40 id,SLN200610010|windspeed,60
The SLN200610010 event file gives a wind speed of 60mph (from baseball-reference and ESPN), but a) that's crazy and b) the weather report from that day doesn't confirm it:
http://www.wunderground.com/history/airport/KSTL/2006/10/1/DailyHistory.html?req\ _city=NA&req_state=NA&req_statename=NA Which gives 83F, 9mph SSW wind, clear
See also my next message, about getting weather data for each game.
The BGAME.exe documentation says "WindSpeed: 0 Unknown, 1 Known, other value is the wind speed" but I think it should be "WindSpeed: -1 Unknown other value is the wind speed in miles per hour".
Labels: baseball, bug, consistency, data, error, format, mash-up, mashup, mining, retrosheet, retrosheet.org, weather