Vizsage

Rails Lessons Learned the Hard Way

Things I've learned the hard way in Rails:

Layouts run inside views, not the other way round. Set an instance variable in app/views/monkeys/show.html.erb and it will be defined in app/views/layouts/monkey.html.erb but not vice versa.
- set instance vars in view
  @foo_val = find_foo_val
- pass variables to partials using
  <%= render :partial => "root/license", :locals => { :foo => @foo_val } -%>
- use the instance var freely in the layout; it will take the value defined in the view
Dump an object for bobo debugging through the console or log:
$stderr.puts tag_list.to_yaml

In a migration, if you define a unique index on an attribute, make sure both the index AND attribute are :unique => true, or else you'll get no uniqueness validation from Rails:


   create_table  :monkeys do |t|
     # set :unique here
     t.string :name, :default => "", :null => false, :unique => true
   end
   # if you have :unique here
   add_index :datasets, [:name], :name => :name,  :unique => true

If you scaffold a User or other object with private data, MAKE SURE you strip out fields you don't want a user setting or viewing:
- Set attr_accessible, which controls data coming *in* -- prevents someone setting an attribute by stuffing in a form value.
- In each view (.html.erb &c) and render method (to_xml), strip out fields you don't want anyone to see using the :only => [:ok_to_see, :this_too] parameter.
- Set filter_parameter_logging, which controls what goes into your logs. (Logs should of course be outside the public purview, but 'Defense in Depth' is ever our creed.)
Using the the restful-authentication generator as an example:
- In the model, whitelist fields the user is allowed to set (this excludes things like confirmation code or usergroup):
  attr_accessible :login, :email, :password, :password_confirmation
- In the controller file, whitelist only the fields you wish to xml serialize:
  format.xml { render :xml => @user.to_xml(:only => [:first_name, :last_name]) }
- Obviously,In the show.html.erb and edit.html.erb strip out fields that shouldn't be seen.
- In the model file, blacklist fields from the logs:
  filter_parameter_logging :password, :salt, "activation-code"
I won't even tell you how often this happens to me: If you edit or install code in a plugin, restart the server.

Labels: acts, acts_as_authenticated, as, attr_accessible, authenticated, console, debug, irb, layouts, log, migrations, plugins, rails, restful-authentication, ruby, templates, views

Posted by flip at 1/27/2008 07:42:00 PM | Permalink | Comments | links to this post

Parsing Names with Honorifics

In Railscast #16, Ryan Bates goes over Virtual Attributes in Rails, using the standard example of storing first and last names but getting/setting full names. He uses the following simple snippet:


def full_name=(name)
  split = name.split(' ', 2)
  self.first_name = split.first
  self.last_name = split.last
end

Which -- given that the focus was on virtual attributes -- is fine for explanation. However, that snippet will fail on names like "Franklin Delano Roosevelt" (last name of "Delano Roosevelt"). Here's a method which our 32d President will like better:


def clean(n, re = /\s+|[^[:alpha:]\-]/)
 return n.gsub(re, ' ').strip
end

# Returns [first_name, last_name] (or '' if there isn't any).
# Leading/trailing spaces ignored.
def first_last_from_name(n) 
    parts    = clean(n).split(' ')
    [parts.slice(0..-2).join(' '), parts.last]
end

names = [
    "Bill! Merkin,PhD.",
    "Jim               Thurston Howell III   ",
    "Charo", 
    "Heywood Jablowmie",
    "Sergei Rodriguez-Ivanoviv",
    "Polly Romanesq. ",
    "   ", 
    "",
    ]
p names.map { |n| first_last_from_name n }
# => [["Bill", "Merkin,PhD"], ["Jim Thurston Howell", "III"], ["", "Charo"], ["Heywood", "Jablowmie"], ["Sergei", "Rodriguez-Ivanoviv"], ["Polly", "Romanesq"], ["", nil], ["", nil]]

A regex is more extensible, and makes more sense for Perl refugees like me.


# Returns [first_name, last_name] (or nil if there isn't any).
# Leading/trailing spaces ignored.
def first_last_from_name_re(n)
    n = clean(n); 
    (n =~ / /) ? (n.scan(/(.*)\s+(\S+)$/).first) : [nil, n]     
end

p names.map { |n| first_last_from_name_re n }
# => [["Bill", "Merkin,PhD"], ["Jim Thurston Howell", "III"], [nil, "Charo"], ["Heywood", "Jablowmie"], ["Sergei", "Rodriguez-Ivanoviv"], ["Polly", "Romanesq"], [nil, ""], [nil, ""]]

However, as someone who can't check in at the automatic kiosks in airports because -- no joke -- the credit card thinks my last name is "IV", I like this version better.


# Returns [first_name, last_name, appendix] 
# (first name and appendix are nil if there isn't any).
# Leading/trailing spaces ignored.
# 
def first_last_appendix_from_name_re(n, appendix = nil)
    n = clean(n)
    appendix_re ||= %q((I|II|III|IV|(?:jr|sr|m\.?d|esq|Ph\.?D)\.?))
    if (n !~ / /) then
        [nil, n, nil]           # with no spaces return n as last name
    else
        n.scan(
          /\A(.*?)\s+           # everything up to the last name
           (\S+?)               # last name is last stretch of non-whitespace
           (?:                  # But! there may be an appendix.  Look for an optional group
             (?:,\s*|\s+)       #   that is set off by a comma or spaces
             #{appendix_re}     #   and that matches any of our standard honorifics.
             )?                 # but if not, don't worry about it.
           \Z/ix).first         # scan gives array of arrays; \A..\Z guarantees exactly one match
    end
end

p names.map { |n| first_last_appendix_from_name_re n }
# => [["Bill", "Merkin", "PhD"], ["Jim Thurston", "Howell", "III"], [nil, "Charo", nil], ["Heywood", "Jablowmie", nil], ["Sergei", "Rodriguez-Ivanoviv", nil], ["Polly", "Romanesq", nil], [nil, "", nil], [nil, "", nil]]

All three versions might make Japanese (and other "FamilyName GivenNames" cultures) sad.

Labels: appendix, attributes, honorific, jr, match, MD, name, parse, rails, regex, ruby, sr, virtual, whitespace

Posted by flip at 1/27/2008 06:34:00 PM | Permalink | Comments | links to this post

Copyright Disputes are usually Failures of Imagination

Hasbro is trying to shut down Scrabulous, a successful online Scrabble game -- perhaps the most successful Facebook app to date.

On the one hand, I think that Hasbro is completely within their rights: it's a clear infringement.

On the other hand, it's a departure from form (they've for a long time licensed gray-market implementations), and a failure of imagination that doesn't account for important subtleties in software engineering and social networks.

On the software engineering end, all of the interesting computer Scrabble implementations I know of were created independently and *then* brought into the fold, to both parties' mutual benefit. Hasbro is a board game company: It doesn't, and shouldn't employ brilliant independent software engineers who create a new entry in the scrabble ecosystem. The other thing to note is that Scrabulous solves some difficult problems in a way no previous product has.

Here's a brief history of the important scrabble programs I know of. The first ones let you play against a computer; this requires a powerful artificial intelligence (AI) engine and an unobtrusive interface. (The hard part is the AI; note that Scrabulous was written in Flash, a very constrained programming environment). Maven was the first Scrabble program that played at an expert level (at one time it was the best scrabble player in the world). Though developed independently, it was purchased by Hasbro (or their licensee) and adopted as the AI agent in the official Hasbro Scrabble software. I don't believe that the official software has been updated for some years, it was Windows-only, and the official scrabble site has no link to it. ACbot, another early implementation, was independently developed by James A Cherry and could play at a low-expert level.. A current offering is Quackle, a free scrabble robot developed by a student at MIT. Its AI engine is extremely strong (also one of the best players in the world) and its front end, while /quite/ rough, is useable and works on Windows/OSX/Linux. All of these programs were written outside Hasbro's aegis. They were developed by experts in computer artificial intelligence and game theory and are far superior to anything that was or could be developed in-house by a board game company.

Another approach lets you play against a person using the network in real time. One of the first was MarlDOoM -- a primitive (text only, pre-web technology) free online scrabble bulletin board. It was developed by John Chew, who at the time was simply a scrabble enthusiast but is now on the official Nat'l Scrabble Association's dictionary committee and the webmaster for their site -- I believe that implementing MarlDOoM helped bring this about. There are modern programs and websites that are officially licensed and let you compete remotely. However, their price or subscription fee exceeds the cost of the physical version, and they require that *both* parties pay for the game, which the physical version does not.

A third approach is 'scrabble by mail' -- one move every day or so, with as much or as little time commitment and deliberation as you care to devote. If there are licensed products that allow this I'm not aware of them.

In all, here's what you'd like a compelling software version of a board game to offer:

Play from any computer, anywhere; simple to acquire, install and use.
Reasonable price compared to the physical game
Skill level:
- Enjoyable for an expert player
- Enjoyable for a casual player
- A casual player and a strong player may enjoy a game where their focus is on socializing and not gameplay
Time commitment:
- Play for 10 minutes at a time -- an quick diversion.
- Play for an hour at a time -- a leisure activity
- Play without having to meet at the same time
Social play:
- Play remotely against a friend, in real time (complete a game at one sitting)
- Play remotely in a "Chess by Mail" context: make a move every day or so, when you have time.
6. Competitve play:
- Compete remotely against a skill-matched stranger, in real time or move-a-day
- Track durable competitive rankings
- Tie those ratings to a reputation system to prevent gaming the rating mechanism.

None of the licensed programs or sites, as far as I know, cost less than the one-time-only, one-person-plays price of a physical Hasbro scrabble. Scrabulous is free, requires only a browser, and is available from any computer anywhere. It provides a simple experience that my computer-incompetent mom can enjoy. (As far as she knows, facebook *is* a scrabble program.)

Scrabulous is the first solution that enables me (an intermediate tournament-level player) to play remotely against any of my casual-level friends -- friends would never pay for, or seek out, or regularly visit, a scrabble-only site. My friend Jen lives in Shanghai -- no previous approach that I'm aware lets me play on my lunch break against her on her lunch break,. None let me *easily* discover when a casual friend is on: all require that you go to their sandbox when you want to play, and that all the people you'd like to compete with patronize the same sandbox. None of them let me jump in / jump out for a quick 10-minute timewaster. Since Scrabulous/Facebook is part of a compelling portal, it's natural to check in and meet friends; it understands my social network; and the play-by-turns feature lets my scale the time commitment and schedule.

No previous approach effectively prevent a cheater from manipulating his rating. However, in Facebook you are a person: you have friends, you have a name, you are part of a community. It's still feasible to be a troll or a sock-puppet or any of the other strategies to game or disrupt a community rating, but there are barriers and consequences for doing so.

If Hasbro shuts down -- rather than licenses -- Scrabulous it will be a business failure. They should be ecstatic that people are integrating scrabble into their social lives, and should see a modest halo effect in board game sales. The revenue stream from Scrabulous' share of Facebook advertising is, I believe, quite significant -- enough for Hasbro and Scrabulous to both enjoy while keeping the game free.

More importantly, Social Network research consistently highlights the importance of "Network Effects" in technology adoption (http://en.wikipedia.org/wiki/Metcalfe%27s_law). There are many, many social games on Facebook, and if Scrabulous is taken down the large body of casual users will move to another entry in this niche. After all, these games are only interesting if your friends also play. Any Hasbro implementation must not only match the quality of Scrabulous' implementation, but must build a network of friends who select it for their social gameplay arena -- and they must build that network against the ill-will that will accrue from shutting Scrabulous down.

Creating a software program (and more importantly a community) like Scrabulous has is HARD: look at all of the previous attempts that have failed to get millions of people to play online. It's hard because there are subtle and serious software engineering challenges, and it's hard because there are subtle and serious community building challenges. If Hasbro shuts down the Scrabulous guys there's no reason to think they'll be able to reproduce their success.

Labels: business, engineering, facebook, foolish, programming, scrabble, scrabulous, software

Posted by flip at 1/22/2008 09:15:00 PM | Permalink | Comments | links to this post

The power of a good visualization

I just found a program called Grand Perspectivethat present your disk usage as an interactive mipmap (see pic on right). Helping web nerds save hard drive space isn't finding hidden heart defects or keeping planes in the air, but I was struck by how well this program demonstrates the power of intelligent data exploration tools. Here are the Tufte criteria for information presentation:

Documentary · Comparative · Causal · Explanatory · Quantified · Multivariate · Exploratory · Skeptical

Each box is a file, and each top-level directory takes a continuous rectangular portion of the view. Scanning a 350GB disk with a /lot/ of tiny files (5+ million for just the far top left corner, the MLB gameday dataset) took < 5 minutes. You may highlight any box in a segment and navigate "down" to make that segment fill the screen, and may choose to color files by location, depth, name or extension (exploratory, multivariate).

The giant orange box in the top left was 15GB of pure junk -- apparently a CGI-script generating some page I was screenscraping went crazy and sent me 15GB of junk data, the same line repeated almost billions of times. I had /no/ idea it was sitting there. That dataset was supposed to be huge, so I had never drilled into the directory beyond my standard du -sc | sort -n on the containing directory. The picture, however, showed at a glance what a table of numbers dramatically failed to do: that the directory consumed twice as much as it should. The simple metaphor of diskspace=area and the whole-disk view (explanatory, documentary) - highlighted something important I'd never noticed. The giant cluster in the bottom right corner is a huge (~51GB) collection of video ephemera I only kinda cared about. I planned, someday, to sort them -- but for that effort and 51GB usage, it was clearly not worth it. By enforcing comparisons, the data display made me reconsider the value vs. resource consumption of that project and make a more sound decision. In all, I freed up almost 100GB and put a few bucks in his tip jar. Joe Bob says Check it out. (Similar programs exist for Linux (Baobab) and Windows (WinDirStat) too.)

Labels: apps, clustering, data, disk usage, free, infographics, mac, mipmap, open, osx, usage, useful, utilities, visualization

Posted by flip at 1/16/2008 07:22:00 PM | Permalink | Comments | links to this post

The 2007 Feltron Annual Report

The 2007 Feltron Annual Report is available now. In a series of elegant infographics, see the ambit of places he walked to in Brooklyn and Manhattan, review how many albums Mr. Feltron bought in the year (12 CDs, 1LP and 98 download tracks), and how often he visited bars in October (6 times; he made 57 total bar visits in the year, down 39% from 2006). My print copy is on the way. (last year's report). Metadata is the new Eyeballs (which is the old Interaction).

Labels: data, design, graphic, infographic, metadata, music, neato, personal, restaurants, visualization, yearly

Posted by flip at 1/14/2008 08:08:00 PM | Permalink | Comments | links to this post

More things I wish someone else will write

More random software ideas:

Google search, restricted to find bug reports only. You'd crawl usenet, sourceforge/google code, debian etc. build farms, open issue trackers, mailing list archives and blogs; extract things in 'pre' tags, and look for repeated stanzas (these indicate where bug was pasted in).
NTP server along the lines of the procrastinator's clock, that would dither the time (by extending or delaying each second) by up to a set amount fast, and never slow. You'd have to be careful with rsync, server logs, kerberos/cookie session stores/other authentication... or maybe just use it at the app level, if your clocks will use NTP themselves.

Labels: bug, clock, delay, fast, idea, late, NTP, procrastinator, report, search, software, time, tracker

Posted by flip at 1/08/2008 12:48:00 PM | Permalink | Comments | links to this post

Reference Cards

Here are some pretty reference cards I made a while back:

Scale Landmarks: What's something you're familiar with that is about 10 nm big? How do the speed of continental drift, a raindrop, a champion sprinter and an SR-71A Blackbird compare? What is the range between the least massive (electron) and most massive (universe) objects science can describe?
Periodic Table
Periodic Table, Flat -- material properties as a table and not as Mendeleev puts it.
Mechanical, Geometric and Material properties of Screws, Bolts and Fasteners - probably the most useful among these, this gives thread geometry, decimal inch/screw/metric equivalents, mechanical strengths, torque ratings and more. Super handy for machining or general shop use.
A similar table for AN Hardware (milspec fasteners used in airplanes, racecars and hot rods).
A flat table of decimal equivalents: decimal and fractional inch, metric, and standard (US) screw sizes.
ASCII Chart - easily index up hex, octal, ascii, symbol font/latin font/DOS font values for characters.

Labels: bolts, fasteners, guide, machining, mechanical, periodic, reference, scale, screws, table, woodworking

Posted by flip at 1/07/2008 09:56:00 AM | Permalink | Comments | links to this post

Owning my Metadata

Dear Lazyweb,

I'd like someone to invent a 'Metadata reclaimer': a program to screenscrape all my amazon ratings, flickr tags, facebook posts, etc.

I try, as far as possible, to only use apps that let me keep ownership of my metadata. As our friend pud has remarked, all successful internet enterprises share the same business model: either

People pay to Enter Data into your Database (eBay, Google AdWords, Flickr, Second Life, World of Warcraft, IMDB pro, Craigslist), or less defensible,
People Enter Data into Your Database For Free while Other People Pay to Get it Out (rapidshare, iTunes Music Store, Pud's Internal Memos; with youtube, myspace, epinions etc viewers pay with the tenuous currency of their ad brain).

There's nothing wrong with that; all these companies levelled their playing field in some fundamental and important way. (Well, nothing wrong unless you're the loathsome gracenote.com (formerly cddb), who turned an open community-generated resource into a closed database, without even the courtesy of a copy to fork from.)

But it's fair to ask that I be able to export my copy of the data I've added to their business asset, and to do so easily.

Sites that play well with others:

my del.icio.us tags and bookmarks
my bloglines/google reader feeds
my librarything.com everything
my last.fm history
my iTunes playcounts, tags and ratings: mostly, I think?
Firefox bookmarks and history

Sites with an 'I gave up my metadata and all I got was this stupid webpage' policy:

facebook posts, friends, photos, everthing
flickr tags &c
amazon recommendations
Google calendar mostly no (at least, the last time I tried to sync my address books it was a Giant Pain in the Ass: nothing was durably id'd and recurring events were semantically incorrect. (Yes, I'd love to have 96 separate entries for my Grandmother's birthday!)
eBay bids, purchases, ratings
Blogger: Blogs, yes if you remote host your site. However, you can't even /list/ the blogger comments you've made, let alone export them.
I believe Myspace's engineers can't even spell XML

(I could be wrong about any of these except the last one).

I'm picturing something with a plugin architecture -- the main app handles the screenscraping, authentication, form submission, web crawling and file export details; the plugin supplies URL wildcards and regexp's the data back into semantic structure. With XML export, a motivated plugin author or well-itched user could supply a decent XSLT stylesheet to represent that metadata in a useful local fashion (and with helpful links back to the main site). It would be useful to have plugins (trivial) and stylesheets (no more or less so) even for sites like Last.fm and Library Thing that Do The Right Thing by granting transparent access to your metadata.

Much of this may exist in some form or another; for example the Aperture/iPhoto plugin will apparently sync your flickr and iPhoto tags, and embed the result into the app database. But going from XML => app is more flexible -- and possibly easier -- than the other way 'round.

I one off'ed this a while back for my Amazon ratings, but I just saw where I'd gone from ~350 to ~650 'things rated' since then. I'm hoping the LazyWeb has solved my problem, since I'm not sure where I put those scripts. (Ironic, considering my previous post.)

Labels: amazon, data, flickr, ratings, reclaim, screenscrape, semantic, tag, tagging, TAGS

Posted by flip at 1/06/2008 12:10:00 AM | Permalink | Comments | links to this post

50 years of Baseball Play-by-play data mashed with 50 years of hourly weather data FTW

Note: I found this sitting in my drafts folder, unpublished. It actually dates from October.

I've had two interesting realizations from the Retrosheet Baseball data vs. Hourly Weather information mashup I've implemented. The first is how my two favorite scripting languages (Python and Perl) compare. The second is how the hard parts of this process is actually the stupidest part... there's four steps in doing an interesting visualization of open data. In order of steps as well as decreasing difficulty and decreasing stupidity:

Bring the data from behind its bureaucratic barriers
Unlock it into a universal format
Process and digest the data
Actually explore, visualize and share the data

The hardest and least justifiable steps are the first two, a problem we have to fix.[Edit: this is why I'm starting infochimp.org]

Here's a longer description of how I did the baseball games / weather data mashup.

Several significant parts of this project were written in Perl, for its superior text handling and for the ease of XML::Simple (which I love); several other parts were done in Python, for its more gracious object-orientation.

To suck in the Hourly Weather Data files, you have to click through a 4-screen web form process to prepare a query. Although it sends the final form submission as a POST query, the backend script does accept a GET url (you know, where the data is sent in the URL form.pl?param=val&param2=val&submit=yay instead of in the HTTP request). There's an excellent POST to GET bookmarklet that will take any webpage form and make the parameters appear in the URL. No guarantees that the backend script will accept this, but it's always worth a twirl for screenscraping webpages or just trying to understand what's going on behind the curtain.

Now I need to know what queries to generate. First I needed the location of each major league baseball stadium: Brian Foy posted a Google Earth index of Major League Stadiums, a structured XML file with latitude, longitude and other information. I used the Perl XML::Simple package to bring in this file. These simple routines just pull in the XML files and create a data structure (hashes and arrays of hashes) that mirror the XML tree. The stream-based (SAX) parsers are burlier and more efficient, but for this one-off script, who cares?

Next I needed the locations of all the weather stations. Perl and Python both have excellent flat-file capabilities. The global weather station directory is held in a flat file (meaning that each field is a fixed number of characters that line up in columns). Here's the column header, a sample entry, and numbers showing the width of each field:

USAF   NCDC  STATION NAME                  CTRY  ST CALL  LAT    LON     ELEV*10
010010 99999 JAN MAYEN                     NO JN    ENJA  +70933 -008667 +00090
123456 12345 12345678901234567890123456789 12 12 12 1234  123123 1234123 123456

To break this apart, you just specify an 'unpack' format string. 'A' means an (8-bit) ASCII character; 'x' means a junk character:

A6    xA5   xA29                          xA2xA2xA2xA4  xxA6    xA7     xA6

The result is an array holding each interesting (non-'x') field. The Perl code snippet:

    # Flat file format
    my $fmt    = "A6x    A5x   A29x                          A2xA2xA2xA4xx  A6x  A7x   A6";
    my @fields = qw{id_USAF id_WBAN name region country state callsign lat lng elev};
    # Pull in each line
    for my $line () {
        next if length($line) < 79; chomp $line;
        # Unpack flat record
        my @flat = unpack($fmt, $line);
        # Process raw record 
        ...
    }

I also grabbed the station files for Daily weather reports, since that data goes back much farther (generally, we have since ~1945 for Hourly and since ~1900 for Daily).

Then I score each station by (Proximity and Amount-of-Date), and select the five best stations for each stadium.

Now, I could of course use Perl to generate the POST request using the HTTP modules, but it was simpler to mindlessly just control click on a dozen links at a time and then answer each form. and spit out an HTML file with a big matrix of URLs for each station, for a subset of years. P (You can see the linkdump file here: http://vizsage.com/apps/baseball/results/weather/ParkWeatherGetterDirectory.html)

I also use perl to clean up the XML generated by the MySQL Query Browser -- which returns a flat XML file with all fields as content, not attributes. I just suck the file in with XML::Simple, walk down the resultant hash to create a saner (and semantic) data structure, then spit back out as XML.

The python parts are not terribly interesting. I pull in the flat file, clean up a few data fields and convert in-band NULLs into actual NULLs (they use 99999 to represent a null value in a 5-digit field, for instance) then export the data as a CSV file (for a MySQL LOAD DATA INFILE query). I chose python for this part because I find its object model cleaner -- it's easier to toss structured records around -- and the CSV module is a tad nicer.

The idea I find most interesting is that we're starting to get enough rich data on the web to make these cross-domain data mashups easy and fun -- I did all this in less than a week. With the effortless XML handling and text processing of modern scripting languages (and relieved from any efficiency concerns) it's easy to see forward to a future where we'll have all these datasets sitting at our fingertips. This data set lets you examine ideas such as "How does the break distance of curveballs change with atmospheric temperature and pressure for a full baseball season?" "Effectiveness of pitchers against gametime temperature, stratified by age of pitcher or inning?" "Batting average on fly balls vs. ground balls against % of total cloud cover?". It's easy to come up with a variety of other "This Rich Dataset vs. That Rich Dataset" opportunities. Stock price and Earnings of Harley-Davidson vs. average household income, unemployment and percent of the population that has reached retirement age? Year-by-year movie attendance at comedies compared to dramas, Attendance at Baseball Games, and Sales of Fast Food vs. Consumer Satisfaction Index, national Suicide Rate, and Persons treated for mental health/substance abuse? Presidential approval rating vs. gasoline prices and Consumer Price Index? Amazon.com sales rank, # mentions on Technorati blogs and # of mentions in mainstream media vs. time?

The hard part is actually the stupidest part: to unlock the data from behind bureaucratic barriers (the first script I described), then to convert into a universal semantically rich data format (the second set of scripts I described). Once one person has unlocked this data, however, it's there for the whole world to enjoy, and tools will evolve to capitalize on this bounty of rich, semantically tagged and freely available information.

Labels: baseball, data, explore, get, html, infochimp, mash-up, mashup, ncdc, perl, POST, python, retrosheet, retrosheet.org, screenscrape, visualization, vizsage, weather, xml

Posted by flip at 1/05/2008 12:38:00 PM | Permalink | Comments | links to this post

Time Machine is neat-o, but I want a Time and Space Machine for my files

I've long wished for a versioned home directory, but the svn/ish seem too heavyweight, and it's nice to have a live copy and not an opaque DB-ball. The right answer the stunningly elegant Time Machine. The idea isn't actually new, and it can be approximated across platforms and remotely using standard Unix tools. Now you just need a landing spot.

Some remote backup solutions have come to the fore lately. $Zero/year gets you 2GB remote backup from Mozy. $60/year buys unlimited backup space at Mozy, where 'unlimited' means the ~40 GB/month you'll see by leaving your pipe fully saturated 24/7. A price between free and $100/year gets you the more intriguing CrashPlan. These are slick and easy. If you don't know what ssh is, you want one of these. If you do have ssh and you don't set your mom's computer up with the free Mozy account then you're a bad person.

For the uber-uber-nerds, use what I'm using: an $85/year bluehost account. You may think of it as a 600GB ssh'able rsync'able remote backup host that, by the way, can also act as a webserver. You can install svn (as I have) to do versioning over svn+ssh. Two years of bluehost costs the same as a 500GB hard drive+cheap enclosure, and they regularly increase your diskspace allowance at the same monthly price. [Note: the preceding bluehost link will give me a kickback if you sign up through it. Hit bluehost.com directly if that rubs you the wrong way.]

As described below, all my various project shards get sync'ed back to my desktop PC. The desktop then pushes my files out to a bluehost account via rsync, versioned with an rsync-as-poor-man's-time machine script. This gives me a live, versioned backup, accessible to me from anywhere by ssh, on Bluehosts' offsite, secure, RAID-UPS-and-diesel generator protected colocation, and at the end of a fat pipe. After about a week or so for the initial ~50GB backup to roll in, daily incrementals will take an hour or two each day (bandwidth choked to 25 kBps). When I leave town for a week next month I'll start pushing my music collection into its own unversioned directory.

This is only for the stuff I've created or can't replace: not for system software and not for music/movies/media (apart from my iTunes.xml, inbox, and various bookmarks/pref's/stickiesDB/MySQLCachedQueries folders). Unlike most people, I don't worry too much about backing up my system software. Maybe I'm damaged from my Windows upbringing, but just reinstall from scratch if your OS gets hosed (I keep install disks and images around). The nascent defect may be present in the restore; the accumulated OS cruft certainly is. You're already kinda screwed; better to take a certain two days and finish with a clean system than to fight a flaky restore and then spend those two days again. Yes, you are allowed to come point and laugh if this happens to me.

Beyond the backup, I have various levels of defense-in-depth for my data -- data that is created, changes daily and is essential for my economic well-being has four or more levels of redundancy. Data that is intransigently huge but can be sourced elsewhere has no redundancy. (There's no reason backing up my processed wikipedia dump, for instance: only the scripts that process it.) Right now my fileverse is spread across

Desktop computer ~ 1TB
Flivo (my homebrew Tivo computer ~0.5TB
School account ~3GB
sourceforge account
Four webservers holding sites I operate or caretake ~ 1-5GB each
GMail account ~4GB
Flickr account ~6GB
iPhone, Google Calendar, Yahoo Address book, Plaxo
blogger/twitter/facebook/etc

This is organized as

All the stuff I'm currently messing with is in a 'now' folder. This is what sync's to my school account, and every time I go on a trip I burn a DVD of the now folder to toss in my backpack. (You never know when you'll want a file, and it enforces an occasional hardcopy backup).
Within the now folder, the stuff I develop for work is versioned with svn. I house a private repository on my bluehost account and connect over svn+ssh. I'm reasonably good about checking in every few hours or when I shift conceptual gears.
Most of the 'now' folder I keep in sloppy sync with the school account. ('sloppy' because I sync when I think of it and not through a cron script as I should).
Each year I move everything that isn't under current work out of the now folder and into an 'archive/YEAR' folder; there it sits and changes almost never.
GMail holds all my mail (sync'ed with IMAP)
Flickr holds an incomplete and poorly correlated segment of my photos (they own my metadata, and yes that bugs me).
iPhone sync handles the address book; gcal is still quite difficult.
The rest of the little metacontent is trapped, meh.
Each webserver's content is replicated to the desktop. For small changes I'll sometimes diddle the file on the live server and then sync back later (tsk tsk); for heavier work I proceed locally and then deploy.

The usage breaksdown like this:

== Work space -- changes ~ daily ==
  2   GB         vizsage project software       Desktop, svn, school, vizsage, bkup
  9   GB         infochimp site & working data  Desktop, svn, school, infochimp, bkup
== Work resources -- changes ~ weekly ==
100   GB         'huge' datasets                Desktop, infochimp
 60   GB         live local DB of datasets      Desktop
 30   GB         infochimp website DB           infochimp, bkup
== Slowly changing -- changes ~ semiyearly ==
  3   GB         Other projects, docs, stuff    Desktop, bkup
  4   GB         Archive - doesn't change       Desktop, bkup
                 (~ 300 MB / year for 11 years) 
  3   GB         Library (prefs,caches,etc)     Desktop
== Metacontent -- changes daily-weekly ==
 12   GB         Photos                         Desktop, Flickr
  3   GB         Mail                           Desktop, GMail, bkup
  ~   MB         iPhone/Addr Book/Calendar      Desktop, iPhone, GCal, Yahoo AddrBk
  ~   MB         this blogger blog              vizsage, Blogger
== Websites ==
  1   GB         website1                       Desktop, website1, bkup 
  2   GB         website2                       Desktop, website2, bkup
  6   GB         website3                       Desktop, website3, bkup
== Media ==
many  GB         music                          Desktop, some on iPhone, some on DVD
many  GB         recorded tv shows, movies, etc Desktop, data DVD
== System Software ==
some  GB         OS & installed programs        on each machine

I'd like to live in a world where I wouldn't have to worry about how these are partitioned across machines. Changes made to 'website1', say, or to a project in 'now', would lazily propagate to each interested shard as well as to the remote time-machineish versioned backup. At any time I could force an immediate sync, whether to deploy a change, to repair a mistake, or to satiate an OCD twinge, if I don't want to wait for automatic syncronization.

I'm actually pretty close to having this out of a McGuyvered patchwork of rsync, svn, time machine, IMAP/Aperture+Flickr and distributed file systems, all enforced by cron. I'm planning to soon waste a weekend buttoning up my sync scripts, getting everything to run daily and being superattentive in case I screw it up.

But it sure would be nifty to augment (a cross-platform) Time Machine into a Time and Space Machine. I'd see an overview of my distributed fileverse (versioned in time, distributed in shards according to how I use it), and I could delegate various live realizations, svn/diff-versioned backups or hard-link-versioned backups to each local or remote instance. No single machine would necessarily hold the entire fileverse: note that a few things up there don't propagate back to my main desktop. And hopefully the whole thing would have polished Apple Fit And Finish instead of Mad Max Homebrew Itworksithinkihope.

Labels: backup, bluehost, diff, disk, distributed, file, hosting, quota, rsync, subversion, svn, system, time machine, version, versioned

Posted by flip at 1/03/2008 10:36:00 PM | Permalink | Comments | links to this post

Vizsage

Building tools to help organize, explore and visualize massive raw information streams

Sunday, January 27, 2008

Rails Lessons Learned the Hard Way

Parsing Names with Honorifics

Tuesday, January 22, 2008

Copyright Disputes are usually Failures of Imagination

Wednesday, January 16, 2008

The power of a good visualization

Monday, January 14, 2008

The 2007 Feltron Annual Report

Tuesday, January 8, 2008

More things I wish someone else will write

Monday, January 7, 2008

Reference Cards

Sunday, January 6, 2008

Owning my Metadata

Saturday, January 5, 2008

50 years of Baseball Play-by-play data mashed with 50 years of hourly weather data FTW

Thursday, January 3, 2008

Time Machine is neat-o, but I want a Time and Space Machine for my files

Search

Last posts

Archives

Links

About me