Impressum / Imprint

RDig 0.3.5

Posted on February 26, 2008

RDig is a tiny web and file system crawler built on top of the Ferret search engine. It’s one of my less active side projects and from what I can tell doesn’t have a very large user base. However there are some people out there who actually use it, and some of those people even tell me so and suggest new features from time to time :-)

Limit crawling depth

You can now configure a maximum crawling depth to restrict RDig to only index pages up to this level. For example, setting config.crawler.max_depth = 1 will make RDig only index the configured start pages, and pages the start pages directly link to. You get the picture I guess.

This option is especially useful if restricting RDig to a pre-defined number of hosts is not an option for your use case, but you still don’t intend to have it crawl the whole web.

HTTP proxy auth support

If you are behind a proxy and have to use HTTP Basic Authentication with it to get through, you can specify proxy url, user name and password:

cfg.crawler.http_proxy = "http://yourproxy:8080"
cfg.crawler.http_proxy_user = "username"
cfg.crawler.http_proxy_pass = "secret"

Under the hood

I put some work into refactoring parts of RDig in order to make integration with acts_as_ferret easier. I’ll write more about that in another post.

Get it!

RDig is available as a gem via Rubyforge.

RDig 0.3.3

Posted on October 23, 2006

Sorry for the messed up 0.3.2 release, I hope this one goes better…

This release not only brings full Ferret 0.10.x compatiblity, it also features hpricot, the Fast, Enjoyable HTML Parser for Ruby. Thanks to hpricot, indexing of HTML files with RDig just got an order of magnitude faster.

As always, please try it out and feel free to contact me if something breaks.

RDig 0.3.2 - Ferret compatibility release

Posted on October 09, 2006

Quite late, but finally RDig is compatible with recent Ferret releases (that is, the 0.10.x series).

I plan to announce a new version of RDig featuring the cool hpricot html parser soon. Watch this space for news :-)

Btw, don’t miss the very interesting interview with Dave Balmain!

RDig

Posted on September 10, 2006

RDig is a standalone full text indexer and crawler, similar to HtDig. It can crawl a website via HTTP or index a set of documents on a local file system.

It’s implemented in Ruby and uses Ferret for the indexing. RDig is available as a gem, so you can use

gem install rdig

to install it.

Rubyforge Project page

Another Rails app launched

Posted on May 25, 2006

Recently we at webit! relaunched the website of Evangelisch-Lutherisches Landesjugendpfarramt Sachsens. Besides lots of static, CMS-generated content there are several Rails powered areas:

  • an online shop including a backend with order and delivery tracking
  • a database of play texts
  • the site search, powered by RDig.

RDig 0.3.0

Posted on April 28, 2006

In addition to crawling web sites, RDig now can index local documents. Just give it one or more file:/ URLs pointing to the directories to index, optionally define some filename inclusion/exclusion patterns and there you go.

Document locations can be rewritten to ease linking to them in a web based search frontend. To rewrite all file:/base/* URIs to http://www.mydomain.com/virtual_dir/, you say

1
2
3
4
5
6

cfg.index.rewrite_uri = lambda do |uri| 
  uri.path.gsub!(/^\/base\//, '/virtual_dir/')
  uri.scheme = 'http'
  uri.host = 'www.mydomain.com'
end

in your RDig config file.

Also there’s a new feature for PDF content extraction: titles are now extracted from PDF meta data with the help of the pdfinfo utility.

Have fun!

RDig 0.2.1

Posted on April 20, 2006

With this release, RDig, my Ferret-based all-in-one site search solution ;-) , can index PDF and MS Word files, too.

RDig delegates all the hard work to the pdftotext and wvHtml command line utilities, so you need to have the xpdf-utils and wv packages installed to use this feature.

Also it should be easier now to plug in custom content extractors as they will be auto-discovered and used for the content-types they declare to be able to handle.

As always, any feedback is very welcome.

RDig documentation

Announcing RDig

Posted on March 25, 2006

RDig aims to be an easy to use tool for building and searching a full text index of the contents of a web site. It consists of an HTTP crawler and facilities to extract textual content from HTML pages, which then will be indexed using the great Ferret full text search engine.

I initially wrote this to implement the site search feature of a website where most of the contents are static html pages generated by a CMS, and some dynamic features of the site are implemented in Rails.

Basically RDig takes a start url, a number of host names to limit the crawling to, and then starts crawling the site. It comes with an executable that can be used to regularly rebuild the index, e.g. triggered by cron.

Searches are then executed with a simple

RDig.searcher.search(‘your query string here’)

from within the web app.

Have a look at the RDocs for further information. Installation should be as simple as

gem install rdig

once the gem has propagated through rubyforge’s mirrors. This is the first piece of software I release as a gem, so please notify me of any problems you encounter.