Posted on April 28, 2006
In addition to crawling web sites, RDig now can index local documents. Just give it one or more file:/ URLs pointing to the directories to index, optionally define some filename inclusion/exclusion patterns and there you go.
Document locations can be rewritten to ease linking to them in a web based search frontend. To rewrite all file:/base/* URIs to http://www.mydomain.com/virtual_dir/, you say
1
2
3
4
5
6
|
cfg.index.rewrite_uri = lambda do |uri|
uri.path.gsub!(/^\/base\//, '/virtual_dir/')
uri.scheme = 'http'
uri.host = 'www.mydomain.com'
end
|
in your RDig config file.
Also there’s a new feature for PDF content extraction: titles are now extracted from PDF meta data with the help of the pdfinfo utility.
Have fun!
Tagged with: search |
Posted on April 20, 2006
With this release, RDig, my Ferret-based all-in-one site search solution ;-) , can index PDF and MS Word files, too.
RDig delegates all the hard work to the pdftotext and wvHtml command line utilities, so you need to have the xpdf-utils and wv packages installed to use this feature.
Also it should be easier now to plug in custom content extractors as they will be auto-discovered and used for the content-types they declare to be able to handle.
As always, any feedback is very welcome.
RDig documentation
Tagged with: search |
Posted on March 25, 2006
RDig aims to be an easy to use tool for building and searching a full text index of the contents of a web site. It consists of an HTTP crawler and facilities to extract textual content from HTML pages, which then will be indexed using the great Ferret full text search engine.
I initially wrote this to implement the site search feature of a website where most of the contents are static html pages generated by a CMS, and some dynamic features of the site are implemented in Rails.
Basically RDig takes a start url, a number of host names to limit the crawling to, and then starts crawling the site. It comes with an executable that can be used to regularly rebuild the index, e.g. triggered by cron.
Searches are then executed with a simple
RDig.searcher.search(‘your query string here’)
from within the web app.
Have a look at the RDocs for further information. Installation should be as simple as
gem install rdig
once the gem has propagated through rubyforge’s mirrors. This is the first piece of software I release as a gem, so please notify me of any problems you encounter.
Tagged with: search |