May 07, 2009 Archives

Thu May 7 09:40:13 CDT 2009

Searching with Solr

Quite a while back I started playing around with Apache Lucene's Solr document indexing/searching tool. Very nifty and simple to use. My first project was to index the Apache HTTPD documents. This was a big learning experience, until this point I had never done a bit with XML or XSLT. The first challenge was to write an XSLT that would take a HTTPD xml document and transform it into the format Solr required for indexing. This was the only challenge and looking back now I can see how much of a terrible mess I made of something that is really quite simple. The results of that initial encounter with Solr can be seen here http://arreyder.com/docsearch/ . It is by no means perfect and in great need of updating, but is still quite useful. -- My next challenge with Solr was issued by noodl aka Vincent Bray. He had a client (http://www.grantandcutler.com/) that was using a MySQL database for searching a catalog of books. For whatever reason, the results were not satisfactory. With a copy of the database I went to work and in very little time had a PoC in place showing how well it could work and how easy it was to pull off. Noodl ran with it and the resulting solution satifisfied his clients. -- My latest challenge is indexing all of the Apache Software Foundations public mailing lists. A perfect task for Solr. The issue here is volume. LOTS of email to be indexed and every possible violation of an mbox format you could ever hope for. My single threaded approach to document indexing had to be adapted if I wanted the indexing to complete in my lifetime. My efforts on this task are ongoing. Simple searches work great, but I'm taking this one to the next level using Solr's faceting. I'll publish a link to the results when I have something more interesting to look at. In the meantime if you have some documents that you'd like to make searchable, have a look at http://lucene.apache.org/solr/ . Hit me up if you have any questions.

Posted by arreyder | Permalink