Posted on February 26, 2008
RDig is a tiny web and file system crawler built on top of the Ferret search engine. It’s one of my less active side projects and from what I can tell doesn’t have a very large user base. However there are some people out there who actually use it, and some of those people even tell me so and suggest new features from time to time :-)
Limit crawling depth
You can now configure a maximum crawling depth to restrict RDig to only index pages up to this level. For example, setting config.crawler.max_depth = 1 will make RDig only index the configured start pages, and pages the start pages directly link to. You get the picture I guess.
This option is especially useful if restricting RDig to a pre-defined number of hosts is not an option for your use case, but you still don’t intend to have it crawl the whole web.
HTTP proxy auth support
If you are behind a proxy and have to use HTTP Basic Authentication with it to get through, you can specify proxy url, user name and password:
cfg.crawler.http_proxy = "http://yourproxy:8080"
cfg.crawler.http_proxy_user = "username"
cfg.crawler.http_proxy_pass = "secret"
Under the hood
I put some work into refactoring parts of RDig in order to make integration with acts_as_ferret easier. I’ll write more about that in another post.
Get it!
RDig is available as a gem via Rubyforge.
Tagged with: rdig |
Posted on October 23, 2006
Sorry for the messed up 0.3.2 release, I hope this one goes better…
This release not only brings full Ferret 0.10.x compatiblity, it also features hpricot, the Fast, Enjoyable HTML Parser for Ruby. Thanks to hpricot, indexing of HTML files with RDig just got an order of magnitude faster.
As always, please try it out and feel free to contact me if something breaks.
Tagged with: rdig |
Posted on October 09, 2006
Quite late, but finally RDig is compatible with recent Ferret releases (that is, the 0.10.x series).
I plan to announce a new version of RDig featuring the cool hpricot html parser soon. Watch this space for news :-)
Btw, don’t miss the very interesting interview with Dave Balmain!
Tagged with: rdig |
Posted on September 10, 2006
RDig is a standalone full text indexer and crawler, similar to HtDig. It can crawl a website via HTTP or index a set of documents on a local file system.
It’s implemented in Ruby and uses Ferret for the indexing. RDig is available as a gem, so you can use
gem install rdig
to install it.
Rubyforge Project page
Filed under: projects |
Tagged with: rdig |
Posted on May 25, 2006
Recently we at webit! relaunched the website of Evangelisch-Lutherisches Landesjugendpfarramt Sachsens. Besides lots of static, CMS-generated content there are several Rails powered areas:
- an online shop including a backend with order and delivery tracking
- a database of play texts
- the site search, powered by RDig.
Tagged with: rdig |
Posted on April 28, 2006
In addition to crawling web sites, RDig now can index local documents. Just give it one or more file:/ URLs pointing to the directories to index, optionally define some filename inclusion/exclusion patterns and there you go.
Document locations can be rewritten to ease linking to them in a web based search frontend. To rewrite all file:/base/* URIs to http://www.mydomain.com/virtual_dir/, you say
1
2
3
4
5
6
|
cfg.index.rewrite_uri = lambda do |uri|
uri.path.gsub!(/^\/base\//, '/virtual_dir/')
uri.scheme = 'http'
uri.host = 'www.mydomain.com'
end
|
in your RDig config file.
Also there’s a new feature for PDF content extraction: titles are now extracted from PDF meta data with the help of the pdfinfo utility.
Have fun!
Tagged with: rdig |
Posted on April 20, 2006
With this release, RDig, my Ferret-based all-in-one site search solution ;-) , can index PDF and MS Word files, too.
RDig delegates all the hard work to the pdftotext and wvHtml command line utilities, so you need to have the xpdf-utils and wv packages installed to use this feature.
Also it should be easier now to plug in custom content extractors as they will be auto-discovered and used for the content-types they declare to be able to handle.
As always, any feedback is very welcome.
RDig documentation
Tagged with: rdig |
Posted on March 25, 2006
RDig aims to be an easy to use tool for building and searching a full text index of the contents of a web site. It consists of an HTTP crawler and facilities to extract textual content from HTML pages, which then will be indexed using the great Ferret full text search engine.
I initially wrote this to implement the site search feature of a website where most of the contents are static html pages generated by a CMS, and some dynamic features of the site are implemented in Rails.
Basically RDig takes a start url, a number of host names to limit the crawling to, and then starts crawling the site. It comes with an executable that can be used to regularly rebuild the index, e.g. triggered by cron.
Searches are then executed with a simple
RDig.searcher.search(‘your query string here’)
from within the web app.
Have a look at the RDocs for further information. Installation should be as simple as
gem install rdig
once the gem has propagated through rubyforge’s mirrors. This is the first piece of software I release as a gem, so please notify me of any problems you encounter.
Tagged with: rdig |