Impressum / Imprint

RDig 0.3.5

Posted on February 26, 2008

RDig is a tiny web and file system crawler built on top of the Ferret search engine. It’s one of my less active side projects and from what I can tell doesn’t have a very large user base. However there are some people out there who actually use it, and some of those people even tell me so and suggest new features from time to time :-)

Limit crawling depth

You can now configure a maximum crawling depth to restrict RDig to only index pages up to this level. For example, setting config.crawler.max_depth = 1 will make RDig only index the configured start pages, and pages the start pages directly link to. You get the picture I guess.

This option is especially useful if restricting RDig to a pre-defined number of hosts is not an option for your use case, but you still don’t intend to have it crawl the whole web.

HTTP proxy auth support

If you are behind a proxy and have to use HTTP Basic Authentication with it to get through, you can specify proxy url, user name and password:

cfg.crawler.http_proxy = "http://yourproxy:8080"
cfg.crawler.http_proxy_user = "username"
cfg.crawler.http_proxy_pass = "secret"

Under the hood

I put some work into refactoring parts of RDig in order to make integration with acts_as_ferret easier. I’ll write more about that in another post.

Get it!

RDig is available as a gem via Rubyforge.

Regexps on steroids with Ruby 1.8.x

Posted on January 27, 2008

Ruby 1.9 comes with a new powerful regular expression engine called Oniguruma. It sports better handling of UTF8 encoded content, plus goodies like positive and negative look-behind or named matches. Here’s a good overview about these and some more of the new features of Oniguruma.

There are two ways to get Oniguruma into a pre-1.9 Ruby: You can patch the Ruby source tree with Oniguruma and build your own Ruby, or use the Oniguruma gem, which makes it fairly easy to use the new style regular expressions in any Ruby 1.8.x project. Here’s how:

$ wget http://www.geocities.jp/kosako3/oniguruma/archive/onig-4.7.1.tar.gz
$ tar xzf onig-4.7.1.tar.gz
$ cd onig-4.7.1
$ ./configure --prefix=/usr
$ make
$ sudo make install
$ sudo gem install oniguruma

Note the prefix argument in the call to configure - it should point to the location of your current ruby installation. So if your ruby executable is located in /usr/bin, you’ll have to use /usr here as shown above.

If everything went well so far, try it out in irb:

require 'rubygems'
require 'oniguruma'
reg = Oniguruma::ORegexp.new '(?.*)(a)(?.*)'
match = reg.match( 'terraforming' )
puts match[0]         <= 'terraforming'
puts match[:before]   <= 'terr'
puts match[:after]    <= 'forming'

The downside of not having Oniguruma patched into a self-compiled version of Ruby is that something like

'terraforming' =~ /(?.)(a)(?.)/
won’t work because it will be handled by your Ruby version’s built in regexp rengine.

acts_as_ferret 0.4.3

Posted on November 18, 2007

Long time since the last release (not counting the short-lived 0.4.2 …), and I guess most people already use trunk anyway, but for the faint of heart, here’s the new stable version of your favourite Rails fulltext search plugin.

As always, get it via svn from svn://projects.jkraemer.net/acts_as_ferret/tags/stable/acts_as_ferret. More installation information can be found on the acts_as_ferret Trac site.

No big news feature-wise, I already wrote about the more important features when I added them to trunk:

Going through the timeline looking for some cool feature I didn’t already write about I found several smaller things worth mentioning:

Dynamic document specific boosts

This comes in handy if you want to have search results automatically ranked by a criteria which is different for each record, e.g. the popularity of an article in your shop:


class Article
  acts_as_ferret :boost => :popularity
  def popularity
    # return dynamic boost value for this document
  end
end

You may also apply the dynamic boost to a specific field (or even different boosts to different fields), so it only is applied when a hit occurs in the boosted field. This way you can choose at query time if you want to have the boosting applied or not. Just query either the boosted fields, or the normal ones:


class Article
  acts_as_ferret :fields => { 
                             :title               => {}, 
                             :boosted_title => { :boost => :rating } 
                         }
  def rating
    # return rating of this article
  end

  # value for the boosted title field
  def boosted_title
    title
  end
end

New and better start/stop scripts

The DRb server now has a unified start/stop script and it ships with scripts for using the it as a Windows system service. Thanks to Peter Jones and Herryanto Siatono for contributing these.

Also the acts_as_ferret gem now has got an installer that will install the server script and sample config into your Rails project:


$ gem install acts_as_ferret
$ rails test
$ cd test/
$ aaf_install
$ script/ferret_server -e production start

And your DRb server is up and running. Easy, isn’t it?

No more :remote => true

Last but not least, aaf now is a bit more clever and goes into remote mode automatically if the DRb server is configured for the current environment. If for whatever reason you don’t want that, use :remote => false.

Railsconf Europe Roundup

Posted on September 20, 2007

With more than 3 days full of Rails and fun and meeting cool people from all around the world behind me, here’s what I took home from Railsconf Europe:

Sun and ThoughtWorks are really pushing JRuby

JRuby has been covered in 4 talks by speakers from both Sun and ThoughtWorks, who also were among the diamond sponsors of the conference.

It really looks like JRuby is ready for real life usage in J2EE environments, effectively bringing Rails into the so-called enterprise through the backdoor. Or, as Ola Bini put it,

JRuby is just another Java library.

Pretty impressed I immediately tried this stuff out - and it really works :-) Time to ditch Grails and have some fun with my favourite language instead.

Selenium

Till Vollmer’s presentation of the browser-based testing tool Selenium was quite impressive. I definitely have to try this out.

Webistrano

At Rejectconf which was organized by the Berlin Rails User Group, Jonathan Weiss gave a short introduction to Webistrano, which is a web-based frontend to Capistrano. Great way to get designers and editors without a local Ruby/Capistrano installation involved in the development process.

Report #12

Besides telling us that it’s time to put the party hats and James Dean jackets aside for a while, and instead get some real work done (and of course have fun) with Rails, DHH introduced report #12, which basically is a list of well-tested patches that have been approved by at least 3 other people. This formalization of the peer-review process is a really good thing as it takes some of the work load away from the core team and makes it easier for people to contribute.

Rails developers are happier

We all knew this, but it’s good even Sun’s Craig R. McClanahan noted in his inspiring talk that he never saw that many friendly and happy people at JavaOne, as he saw at Railsconf Europe. He also admitted that going back to Java after having worked with Rails is not that pleasant. Just to remind you - he’s working at the company whose ticker symbol is JAVA

All in all it was a great event, and even with Dr Nic speaking at the same time my own talk was pretty well attended,

Raise your hand please if you're using Ferret

Posted on September 09, 2007

Working on acts_as_ferret for more than one and a half year now, I’m really interested in why and how people are using it in their applications. Also, a list of Ferret-powered projects will be a good starting point for people who are still looking for a search solution for their Rails app.

So if you’re using Ferret, please drop me a line, comment here, or even add your application to the Powered by Ferret page over at the Ferret project’s Trac. Ideally you would also post some facts like index size, your production environment or even performance numbers. Personally I’d also like to know if you’re using acts_as_ferret, or, if that’s not the case, why you decided to go with pure Ferret.

I’ll also mention some sites that make use of Ferret in my talk at Railsconf Europe next week. So if you’re looking for some publicity, what are you waiting for?

Faster indexing with acts_as_ferret

Posted on September 02, 2007

Does your application operate on large chunks of records that are indexed by acts_as_ferret? If the answer is yes, then this is for you:

By combining two brand new features of acts_as_ferret you may now speed up batch operations like this: First, disable acts_as_ferret indexing for the model class in question. Then do your updates, but be sure to remember the primary keys of the modified records. After that, re-enable acts_as_ferret and index all modified or created records at once:

Model.disable_ferret
# create or modify records here, collect ids in id_array
Model.enable_ferret
Model.bulk_index(id_array)

You may also use the block syntax to have aaf be re-enabled automatically:

id_array = []
Model.disable_ferret do
  # create or modify records here, collect ids in id_array
end
Model.bulk_index(id_array)

Pagination goodness

Posted on August 27, 2007

Finally implementing pagination for your acts_as_ferret search results is as easy as it should be (at least if you’re using aaf trunk, everybody else will have to wait for the soon-to-be-released 0.4.2):


@results = Model.find_with_ferret params[:query], :page => params[:page], 
                                                  :per_page => 10

Acts_as_ferret’s SearchResults class now gives you all you need to implement a helper rendering your pagination links:


@results.page           # => current page
@results.page_count     # => total number of pages
@results.previous_page  # => index of previous page or nil if on the first page
@results.next_page      # => index of next page or nil if on the last page

Best of all, this even works when you combine your query with ActiveRecord conditions.

Hint: really lazy people install the will_paginate plugin and use it’s will_paginate helper method to get their pagination links for free!

Soap4r / openSSL woes

Posted on July 27, 2007

While testing a SOAP service that’s only reachable via HTTPS, I had kind of a hard time before I got soap4r 1.5.7 to connect successfully.

It always bailed out with an SSLError when trying to fetch the wsdl file from the server:

at depth 0 - 20: unable to get local issuer certificate
/usr/lib/ruby/gems/1.8/gems/httpclient-2.1.0/lib/httpclient.rb:950:in connect: certificate verify failed (OpenSSL::SSL::SSLError)

The certificate in question is shown as valid and correct in major browsers, I’m still not sure why OpenSSL behaves like that.

Getting around this problem on the Soap4r side turned out to be a bit tricky, so here’s what I did, mabe it saves somebody else some hours:

After hunting around in Soap4r’s code I found out that a set of properties gets loaded from a file named soap/property which is looked for in $:. When connecting via http Soap4r takes all props starting with client.protocol.http from this file and hands them through to the http client lib.

That knowledge combined with those hints made me put

client.protocol.http.ssl_config.verify_mode=OpenSSL::SSL::VERIFY_NONE

into lib/soap/property, et voila - openSSL still grumbles about the certificate but at least doesn’t throw errors at me any more.

acts_as_ferret 0.3.1

Posted on January 21, 2007

With several minor fixes and some small extensions like the various ways to conditionally disable automatic index updates I would not call this a big release.

However it’s the first release that is available as a RubyGem for system-wide installation, too. After installing the gem all you need to do is add the line

require 'acts_as_ferret'

to your environment.rb.

The API docs for the latest release now can be found at RubyForge, too. I’m still unsure if I should switch over the subversion repo to RubyForge, too. With the move to RubyForge I’d lose the cool Trac/subversion integration features, but on the other hand, trac has a real spam problem, anyway. Maybe I’ll have a look for an alternative Wiki/Bug reporting platform that is more spam proof to entirely replace Trac. Ideas anybody?

acts_as_ferret covered in a book!

Posted on December 19, 2006

Christian Hellsten and Jarkko Laine feature Ferret and acts_as_ferret in their book Beginning Ruby on Rails E-Commerce.

As a part of their example project, an online book store, they show how to realize a full text search with acts_as_ferret, from the acts_as_ferret statement in the model up to the controller action and corresponding view. They even cover how to index content from related objects - book authors in this case - through an indexed instance method.

Integration test your file uploads

Posted on December 13, 2006

Surprisingly (given Rails’ generally good testing support), it is not possible to do multipart post requests in integration tests.

At least until now ;-) What is needed to allow for multipart post requests is a method that builds the correctly encoded body from the given parameters, including correct MIME boundaries and headers for the different parts. For easy reuse I put this into a small plugin.

Installation

script/plugin install svn://projects.jkraemer.net/plugins/multipart_integration_test

read on for usage notes…

RDig 0.3.3

Posted on October 23, 2006

Sorry for the messed up 0.3.2 release, I hope this one goes better…

This release not only brings full Ferret 0.10.x compatiblity, it also features hpricot, the Fast, Enjoyable HTML Parser for Ruby. Thanks to hpricot, indexing of HTML files with RDig just got an order of magnitude faster.

As always, please try it out and feel free to contact me if something breaks.

RDig 0.3.2 - Ferret compatibility release

Posted on October 09, 2006

Quite late, but finally RDig is compatible with recent Ferret releases (that is, the 0.10.x series).

I plan to announce a new version of RDig featuring the cool hpricot html parser soon. Watch this space for news :-)

Btw, don’t miss the very interesting interview with Dave Balmain!

Ferret accepting donations

Posted on September 23, 2006

David Balmain, the creator of Ferret, recently announced that he is accepting donations to help with the future development of Ferret.

Starting with a complete port of Lucene to Ruby last fall, David has done an amazing job by implementing the whole beast in C during the last months (70000 lines of code, that is). Not to mention the great support he’s giving people on the Ferret mailing list. This guy really doesn’t seem to sleep ;-)

So, if you make any money from Ferret, or just think Man, that’s cool!, don’t hesitate to hit the Make a Donation button over there!

RDig

Posted on September 10, 2006

RDig is a standalone full text indexer and crawler, similar to HtDig. It can crawl a website via HTTP or index a set of documents on a local file system.

It’s implemented in Ruby and uses Ferret for the indexing. RDig is available as a gem, so you can use

gem install rdig

to install it.

Rubyforge Project page