Leveraging the Bittorrent Underground for semantic data and media
I just ran across a pretty interesting site called coverbrowser.com, which uses a variety of image APIs to pull in comic book, game, book, music, movie and other cover art. (Read the technical details here).
It reminded me of an idea I had while back but which I will never get around to implementing --- maybe you will, or for all I know someone's already been doing for years. (Sidenote: I've had some people express interest in this, and have worked out some parts of it, but just don't have the time to complete it right now. If you'd like to help develop it get in touch).
Many of the movie and music torrents on the, ahem, "Unauthorized Evaluation Copy" bittorrent sites contain hi-res scans of their cover art, and all of the major bittorrent sites maintain topic-specific RSS feeds.
As long as the torrent indexes the files individually (as not as an opaque .zip or .rar) -- and most do index individually -- you can target specific files within the torrent. I don't know whether you could chop all the large-file-size copyright-problematic files that you don't want out of the torrent, or whether you'd have to hack Azureus or other bittorrent client (instructing it to get only *.{png,gif,jpg,jpeg,bmp,tiff,tff} or what have you). Either way, you would then only be pushing out the bandwidth required to grab the photos and not the accompanying multi-megabyte file, and you would only be getting the information to which you assumedly have fair use rights for.
So you'd set up a daemon process that would
- watch the Movies and the Music RSS feeds off whichever or all of the sites,
- identify albums whose cover art you lack,
- pull in the bittorrent,
- but download only the cover art
- and perhaps also process any of the accompanying semantic data
I think this would lead to a large stream of incoming cover art for music and other media files, complete with a reasonable amount of semantic information.
There's probably a lot of other crowdsourced semantic data flowing through the underground, if someone actually created such a torrenting robot. (And yes, I feel yucky using "crowdsourced" and "semantic data" in the same sentence).
Labels: art, bittorrent, cover, data, dataset, feed, legal, legitimate, movie, music, piracy, rss, semantic, underground
If you are going to get yourself a seedbox, and know a bit about linux, might as well do it yourself. Check out my tutorial on how to easily make one yourself for under $6/mo.
Posted by Seedbox | September 10, 2010 at 8:17 PM