« Home | The Asdrubal Carrera Hall of Fame » | Retrosheet Eventfile Inconsistencies » | Rules of thumb for Rack Leave in Scrabble » | as3mathlib (formerly WIS math libraries) » | Subway Geography and Geometry » | Patches to the AS3 Cookbook Code » | How to make a patch using diff » | Flex Demo: Matrix Math (and an error in the Action... » | Emacs modes for Flex » | Adobe Flex and Custom Namespace / manifest.xml »

Retrosheet Eventfile Inconsistencies II

I've found a few more inconsistencies and minor inaccuracies in the retrosheet event files and game logs.

I made a diff (applied using the 'patch' tool) to mechanically recreate these corrections: http://vizsage.com/apps/baseball/results/rseventfiles_20070923_patch.diff

I pulled these out by whipping up a few simple scripts (one-liners, mostly) that extracts all unique values for each event file field. For example, the only values for the "info,pitches" field are 'count, 'none' and 'pitches' -- just as promised in the documentation. The "info,temp" field, however, has not only normal temperatures ("78", or "104", or "0" for [unknown]) but also spurious values of '670' and '700' (wrong), '8/7' (ill-formed) and '' (differs with the format documentation).

I'll posting all the dubious entries (event files version 2007 Sep 23) I find at http://vizsage.com/blog/2007/10/retrosheet-eventfile-inconsistencies.html as comments.

==================== Incorrect Data ====================

In 1993MIL.EVA:
info,start,spieb001,"Bim| Spiers",1,9,4
should be
info,start,spieb001,"Bill Spiers",1,9,4

These temperatures need fixing:
1988MON.EVN,info,temp,670
1988MON.EVN,info,temp,700
1964NYA.EVA,info,temp,8/7

I looked at a few suspiciously short games (< 60 minutes):
This should be 1:58, according to the NYT box score:

http://select.nytimes.com/gst/abstract.html?res=FB0614F73D59107B93C4A8178FD85F4C\
8585F9
1958BOS.EVA,info,timeofgame,58
These two are correct:
1971BAL.EVA,info,timeofgame,48 BAL197107300 -- Game called due to rain
1976BOS.EVA,info,timeofgame,57 BOS197609100 -- Game called due to rain
Another thing to look at would be suspicious game length/number of
outs ratio, but I haven't done this yet.

I also checked a few games with attendance below 1000, but these seem
to be very cold or rescheduled days. I'll taka a peak sometime soon at
"game attendance less than two and a half standard deviations from
that year's average attendance" to see what sticks out. (I also
peeked at 2.5+ above -- those look like bandwagon game)

==================== Badly Formatted ====================

These are probably correct but just ill-formatted:
1959CHN.EVN,info,timeofgame,0158
2001PIT.EVN,info,attendance, 34915
1962BOS.EVA,info,daynight,day,
1966ATL.EVN,info,howscored,"park"
1966HOU.EVN,info,howscored,"park"
1970CHA.EVA:data,er,roung101,4#
1958PIT.EVN:data,er,wills102,1y

In these files, the "howscored" field is spelled "howentered":
1990BOS.EVA,info,howentered,game
1990DET.EVA,info,howentered,game
1990DET.EVA,info,howentered,game
1990DET.EVA,info,howentered,game
1990DET.EVA,info,howentered,game
1990HOU.EVN,info,howentered,game
1990HOU.EVN,info,howentered,game
1990LAN.EVN,info,howentered,game
1990MON.EVN,info,howentered,game
1990MON.EVN,info,howentered,game
1990PIT.EVN,info,howentered,game
1990SFN.EVN,info,howentered,game
1990SFN.EVN,info,howentered,game
1990SLN.EVN,info,howentered,game
1990TEX.EVA,info,howentered,game
1990TEX.EVA,info,howentered,game

There are no "info,edittime" records -- is this purposeful?

==================== Inconsistent with Documentation ====================

In the 2003TBA.EVA file, the umpires are given by name and not by ID.

These are supposed to use 0 as the unknown value but in a few places
use a blank.
1990NYA.EVA,info,temp,
1978ATL.EVN,info,attendance,
1978NYA.EVA,info,attendance,
1979SDN.EVN,info,attendance,
2000PIT.EVN:info,windspeed,

There are some "info,ump[...],(None)" fields, and there are some
"info,ump[...]," fields. Does one indicate "unknown" and the other
indicate "none"? Or is this a formatting inconsistency?

These files have a bunch of "info,windspeed,unknown" fields (the dox
say "An unknown windspeed is indicated by -1."):
1969ATL.EVN 1969HOU.EVN 1969MON.EVN 1969PIT.EVN 1969SDN.EVN
1970ATL.EVN 1970HOU.EVN
These files have an "info,temp,unknown" field (the dox say "An unknown
temp is indicated by 0."):
1969ATL.EVN 1969HOU.EVN 1969MON.EVN 1969PIT.EVN 1969SDN.EVN 1970ATL.EVN
1970HOU.EVN 1990NYA.EVA

These lines have trailing spaces, which is harmless but still
shouldn't be there:
1958CHA.EVA:info,save,
1957BOS.EVA:com,"xwas a lot of action. Had this game been played
today, it no doubt"
1957BRO.EVN:com,"$In addition to 12,559 paid, 6000 knothole,"
1957CLE.EVA:com,"xCC4 changed E9/F.2-3;BX2(9)# to 9/F.2-3(E9)#"
1957MLN.EVN:com,"xCC4 per film, TSN 26 is DP"
1958CLE.EVA:com,"$ Strong wind to left; cool"
1958KC1.EVA:com,"xScoresheet scores DP as 142. I Checked with newspaper"
1958NYA.EVA:com,"$Total attendance: 13323"
1958SFN.EVN:com,"$paper box and Cin s/s has Cepeda and Sauer reversed"
1958SFN.EVN:com,"$paper box has stats that match SF s/s not Cin s/s"

Here are all the well-formed windspeed values:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
23 24
25 26 27 28 29 30 31 32 33 35 36 37 38 40 59 60 66 67 68 69 74 78 87
What are the units on these? If this is in MPH, 39 is Gale force
("Difficult to walk against wind. Twigs and small branches blown off
trees."), 55 is Storm ("Trees uprooted, structural damage likely") and 64
is ("Trees uprooted, structural damage likely").

Here are games with windspeeds over 40:
id,CHA197408270|windspeed,67
id,MIN198008190|windspeed,87
id,TOR198208030|windspeed,68
id,CHN198307042|windspeed,74
id,TOR198307270|windspeed,87
id,LAN199006050|windspeed,78
id,DET199506160|windspeed,87
id,CLE199609141|windspeed,69
id,COL199606150|windspeed,59
id,DET199704300|windspeed,66
id,TEX200104220|windspeed,40
id,SLN200610010|windspeed,60

The SLN200610010 event file gives a wind speed of 60mph (from baseball-reference and ESPN), but a) that's crazy and b) the weather report from that day doesn't confirm it:

http://www.wunderground.com/history/airport/KSTL/2006/10/1/DailyHistory.html?req\ _city=NA&req_state=NA&req_statename=NA Which gives 83F, 9mph SSW wind, clear

See also my next message, about getting weather data for each game.

The BGAME.exe documentation says "WindSpeed: 0 Unknown, 1 Known, other value is the wind speed" but I think it should be "WindSpeed: -1 Unknown other value is the wind speed in miles per hour".

Labels: , , , , , , , , , , ,