Dobrica Pavlinušić's random unstructured stuff
Exhibit facet browsing: Revision 3
We have few mp3 players which no longer work, but are still under warranty. So idea was to pick another device (which will hopefully work longer). However, on-line shops leave a lot to be desired if you want to just do quick filtering of data.

As a very fortunate incident, I stumbled upon "Exhibit"<http://simile.mit.edu/exhibit/> from "SMILE"<http://simile.mit.edu/> project at MIT which brought us such nice tools as "Timeline"<http://simile.mit.edu/timeline/> and "Potluck"<http://simile.mit.edu/potluck/>.

So, I scraped web, converted it to CSV and tried to do something with it. In the process I again re-visited the problem of semi-structured data: while data is separated in columns, one column has generic description, player name and all characteristics in it.

So, what did I do? Well, I started with CPAN and few hours later I had a "script which is rather good in parsing semi-structured CSV files"<http://svn.rot13.org/index.cgi/simile/view/links/csv2js.pl>. It supports following:

* guess CSV delimiter on it's own (using "`Text::CSV::Separator`"<http://search.cpan.org/~enell/Text-CSV-Separator/>)
* recognize 10 Kb and similar sizes and normalize them (using "`Number::Bytes::Human`"<http://search.cpan.org/~ferreira/Number-Bytes-Human/>)
* splitting of comma (`,`) separated values within single field
* strip common prefix from all values in one column
* group values and produce additional properties in data
* generate specified number of groups for numeric data, useful for price ranges
* produce JSON output for Exhibit using "`JSON::Syck`"<http://search.cpan.org/~audreyt/YAML-Syck/>

"
So how does it look?"<http://blog.rot13.org/demo/links/links.html>

In the end, it is very similar to the way "Dabble DB"<http://www.dabbledb.com/> parses your input. But, I never actually had any luck importing data into Dabble DB, so this one works better for me `:-)`

This will probably evolve to universal munger from CSV to arbitrary hash structure. What would be good name? `Text::CSV::Mungler`?

This is a first post in series of posts which will cover one hack a week on my blog. This will (hopefully) force me to write at least one post a week on one side, and provide some historic trace about my work for later.