« Home | The 2007 Feltron Annual Report » | More things I wish someone else will write » | Reference Cards » | Owning my Metadata » | 50 years of Baseball Play-by-play data mashed with... » | Time Machine is neat-o, but I want a Time and Spac... » | Old-School Shop Guide » | Leveraging the Bittorrent Underground for semantic... » | Moving from Perl to Python with XML and Templating » | Hourly Weather data for each Retrosheet game »

The power of a good visualization

I just found a program called Grand Perspectivethat present your disk usage as an interactive mipmap (see pic on right). Helping web nerds save hard drive space isn't finding hidden heart defects or keeping planes in the air, but I was struck by how well this program demonstrates the power of intelligent data exploration tools. Here are the Tufte criteria for information presentation:
Documentary · Comparative · Causal · Explanatory · Quantified · Multivariate · Exploratory · Skeptical
Each box is a file, and each top-level directory takes a continuous rectangular portion of the view. Scanning a 350GB disk with a /lot/ of tiny files (5+ million for just the far top left corner, the MLB gameday dataset) took < 5 minutes. You may highlight any box in a segment and navigate "down" to make that segment fill the screen, and may choose to color files by location, depth, name or extension (exploratory, multivariate).
The giant orange box in the top left was 15GB of pure junk -- apparently a CGI-script generating some page I was screenscraping went crazy and sent me 15GB of junk data, the same line repeated almost billions of times. I had /no/ idea it was sitting there. That dataset was supposed to be huge, so I had never drilled into the directory beyond my standard du -sc | sort -n on the containing directory. The picture, however, showed at a glance what a table of numbers dramatically failed to do: that the directory consumed twice as much as it should. The simple metaphor of diskspace=area and the whole-disk view (explanatory, documentary) - highlighted something important I'd never noticed. The giant cluster in the bottom right corner is a huge (~51GB) collection of video ephemera I only kinda cared about. I planned, someday, to sort them -- but for that effort and 51GB usage, it was clearly not worth it. By enforcing comparisons, the data display made me reconsider the value vs. resource consumption of that project and make a more sound decision. In all, I freed up almost 100GB and put a few bucks in his tip jar. Joe Bob says Check it out. (Similar programs exist for Linux (Baobab) and Windows (WinDirStat) too.)

Labels: , , , , , , , , , , , , ,

It is liberating to have a platform that gives you analytics and control into what is really going on.

Just like infochimps gives people a powerful platform to find and utilize data.

I wanted to introduce myself and learn more about your plans to scale infochimps in the cloud?

I work at RightScale and have seen some amazing use cases of our cloud management platform-which I think may be worth infochimps time. With RightScale you have detailed analytics and controls over your cloud environments while being able to automate your deployments for autoscaling/descaling.

We already do work with huge data processing clients like Eli Lilly, Harvard Medical, and 30,000 other users.

I would like to get in touch with you and learn more about your plans at infochimps to scale?

I don't mean to spam you- just wanted to get in touch. Thanks, Jordan Evans
You can reach me at jordan.evans@rightscale.com

Post a Comment