Lessons Learned: Keeping a webservice consistent
Here are the slides of the talk I held on the 3rd of February 2011 at the Rug::B Berlin.
Rails Windows Installer
Just a quick repost of this article: http://www.rubyinside.com/rails-installer-ruby-and-rails-on-windows-in-a-single-install-4201.html , and the link to the Ruby Windows Installer .
Although I personally don’t develop on Windows anymore, I started like many others with the Ruby One-Click installer, and I was quite thankful at the time that it just worked – let’s hope the rails installer provides the same ease of getting started, only with even more of a complete setup.
War Story: Finding bottlenecks in Ruby code with ruby-prof
For my job at Travel IQ , I’ve recently had the pleasure of trying to find out why our so-called cluster job suddenly started to run extremely slowly.
This job is a rake task that runs each night on our production machines – it aggregates the different hotel descriptions of the hotel providers we’ve got partnerships with (like HRS or booking.com) and tries to find out which different descriptions actually mean the same real hotel.
The algorithms used in the task are, I dare say, pretty sophisticated and do a really good job of sifting through a lot of descriptions. The praise here goes to the original author, Dr. Florian Odronitz – a more in-depth look at the clustering process might be another post some day. Due to the amount of data, the task never really ran fast as such – a couple of hours each night, but that was fast enough.
Recently, however, it would run long enough to be still running when the next job started – 24 hours later. The last job we’ve kept alive to finish ran for three weeks X(
It looked like someone had to do something. Unfortunately, that someone was me. Nobody had touched the code in months, and the number of descriptions didn’t change significantly either (that only really happens when we add or remove a partnership). Looking through the code didn’t produce any immediately obvious bottleneck.
My first guess was ActiveRecord – usually, when massaging data, the slowest parts of the process tend to be doing more queries than necessary (which is one of the downsides of using a high-level ORM system like AR – it’s easy to forget that you’re working with a DB under the object-oriented code).
Watching the queries while the task ran didn’t show any queries during the slow parts of the task, though. Dang.
Time to break out the big guns, then: Profiling the task .
Running that on my local machine produced a nice long and interesting profile that, unfortunately, didn’t look like anything was out of order. Only lots of calls to methods I was expecting to be called, and the most time doing actual work was spent down in Geokit’s distance calculation method. Since closeness to each other is the number one criterion for grouping different hotels together, this is expected – we basically need to get an accurate distance for each description against each other description. We also didn’t update the gem – so I still was at a loss.
A colleague put me to it: I had been comparing apples to oranges – I had run the profile locally, not with the production data. ’Doh !
Very, very carefully enabling profiling on a production server and running it there produced a different picture. Unfortunately, I don’t have the profile saved in a blogpost-friendly form. But the culprit was quite obvious: For about 5000 comparisons between hotels, Array#index was called about 15000 times. And that again called Array#=== 50 million times. Ouch. The whole measured part of the code spent 75% of its time calling that method.
The actual code calling the index method wasn’t even directly connected with the clustering:
on_blacklist = blacklist ? !!blacklist[1].index([cluster_element.send(blacklist[0]), candidate.send(blacklist[0])].sort) : false on_whitelist = whitelist ? !!whitelist[1].index([cluster_element.send(whitelist[0]), candidate.send(whitelist[0])].sort) : false
That might be a little obtuse for the uninitiated. What it basically does is using the current cluster_element (a hotel description) to create a signature (an Array instance), and then using Array#index to see if the element is in the blacklist and/or whitelist array.
The problem here is that the blacklist array is currently about 10000 elements long – and that was exactly what was different between my local machine and the production data. And this was what had changed over time – the black/whitelists are used to manually override clustering results, and our product team had been busy weeding out the last erronous clusters.
This is such a classic mistake – since the above code is in an inner loop of comparisons, the amount of work to be done goes up non-linearly with each added black/whitelist entry.
The solution was also simple. I just changed the black/whitelist arrays to be Hashes, and did the lookup by seeing if the signature is a key in the Hash. Generally, this is a good habit to get into – when you do lookups, use the data structure that was designed for it !
After that, the time used for one cluster work packet dropped from 18 minutes to 30 seconds. Case closed. Write smug blog post.
One final note: People looking at profile data for the first time should not be discouraged at the mass of data a profiling usually provides. While people writing about analyzing profiles usually gloss over it (me included in this post), the process is, at least for me, not very straightforward. It involves a lot of going back and forth, trying to make sense of method called (you often see methods being called you never use directly), and looking up what the terminology used in the reports means. And you need to be quite familiar with the code to have the mental map that lets you see anything that doesn’t fit into it.
Setting the program name with $0
Here’s a quick tip that might be useful: In Ruby, you can set the title of the current Ruby process by assigning a string to $0. At Travel IQ, we’ve found this especially useful for long-running rake tasks that clean up and massage our data at night. Some of those rake tasks were running very long – and we were scratching our heads as to what they might be doing.
So we simply set the program name in every rake task and its dependencies, and the next time around, a look at htop revealed where we’ve been stuck.
One caveat, though: It seems that on Ubuntu machines, the maximum length of the program name is determined when the process is started, i.e. setting it later to something longer truncates the new value to the length of the first: http://www.ruby-forum.com/topic/187083
Weird shit with Ruby's float formatting
>> 1.upto 50 do |x| f = ('1.' + ('1' * x) + '5' ); puts "%.#{x}f % #{f}: " + "%.#{x}f" % f.to_f end
%.1f % 1.15: 1.1
%.2f % 1.115: 1.11
%.3f % 1.1115: 1.111
%.4f % 1.11115: 1.1112
%.5f % 1.111115: 1.11112
%.6f % 1.1111115: 1.111112
%.7f % 1.11111115: 1.1111111
%.8f % 1.111111115: 1.11111111
%.9f % 1.1111111115: 1.111111112
%.10f % 1.11111111115: 1.1111111112
%.11f % 1.111111111115: 1.11111111112
%.12f % 1.1111111111115: 1.111111111111
%.13f % 1.11111111111115: 1.1111111111112
%.14f % 1.111111111111115: 1.11111111111111
%.15f % 1.1111111111111115: 1.111111111111112
%.16f % 1.11111111111111115: 1.1111111111111112
%.17f % 1.111111111111111115: 1.11111111111111116
%.18f % 1.1111111111111111115: 1.111111111111111160
%.19f % 1.11111111111111111115: 1.1111111111111111605
%.20f % 1.111111111111111111115: 1.11111111111111116045
%.21f % 1.1111111111111111111115: 1.111111111111111160454
%.22f % 1.11111111111111111111115: 1.1111111111111111604544
%.23f % 1.111111111111111111111115: 1.11111111111111116045436
%.24f % 1.1111111111111111111111115: 1.111111111111111160454357
%.25f % 1.11111111111111111111111115: 1.1111111111111111604543567
%.26f % 1.111111111111111111111111115: 1.11111111111111116045435665
%.27f % 1.1111111111111111111111111115: 1.111111111111111160454356650
%.28f % 1.11111111111111111111111111115: 1.1111111111111111604543566500
%.29f % 1.111111111111111111111111111115: 1.11111111111111116045435665001
%.30f % 1.1111111111111111111111111111115: 1.111111111111111160454356650007
%.31f % 1.11111111111111111111111111111115: 1.1111111111111111604543566500070
%.32f % 1.111111111111111111111111111111115: 1.11111111111111116045435665000696
%.33f % 1.1111111111111111111111111111111115: 1.111111111111111160454356650006957
%.34f % 1.11111111111111111111111111111111115: 1.1111111111111111604543566500069574
%.35f % 1.111111111111111111111111111111111115: 1.11111111111111116045435665000695735
%.36f % 1.1111111111111111111111111111111111115: 1.111111111111111160454356650006957352
%.37f % 1.11111111111111111111111111111111111115: 1.1111111111111111604543566500069573522
%.38f % 1.111111111111111111111111111111111111115: 1.11111111111111116045435665000695735216
%.39f % 1.1111111111111111111111111111111111111115: 1.111111111111111160454356650006957352161
%.40f % 1.11111111111111111111111111111111111111115: 1.1111111111111111604543566500069573521614
%.41f % 1.111111111111111111111111111111111111111115: 1.11111111111111116045435665000695735216141
%.42f % 1.1111111111111111111111111111111111111111115: 1.111111111111111160454356650006957352161407
%.43f % 1.11111111111111111111111111111111111111111115: 1.1111111111111111604543566500069573521614075
%.44f % 1.111111111111111111111111111111111111111111115: 1.11111111111111116045435665000695735216140747
%.45f % 1.1111111111111111111111111111111111111111111115: 1.111111111111111160454356650006957352161407471
%.46f % 1.11111111111111111111111111111111111111111111115: 1.1111111111111111604543566500069573521614074707
%.47f % 1.111111111111111111111111111111111111111111111115: 1.11111111111111116045435665000695735216140747070
%.48f % 1.1111111111111111111111111111111111111111111111115: 1.111111111111111160454356650006957352161407470703
%.49f % 1.11111111111111111111111111111111111111111111111115: 1.1111111111111111604543566500069573521614074707031
%.50f % 1.111111111111111111111111111111111111111111111111115: 1.11111111111111116045435665000695735216140747070312
ruby 1.8.7 (2009-06-12 patchlevel 174) [x86_64-linux], MBARI 0×6770, Ruby Enterprise Edition 2009.10
Thanks to Gerriet Riessen: http://gist.github.com/631981
Encoding error while parsing JSON
You know, given how much JSON is loved by the Ruby and other communities, I sometimes think the tools are not quite as professional yet as I would like.
Case in point: An error I debugged today for Travel IQ, which occured occasionally in our API (which is public – become an affiliate now and earn us yourself some money !).
That’s a Rails 2.3.5 app, with the json and json_pure gems at 1.4.6 .
The helpful exception was:
JSON::ParserError: 345: unexpected token at '{"id":1920944,"name":"Homestead Seattle-Bellevue","street":"3700 132nd Ave Se","zip":"98006","city":"Bellevue"
...
The rest of the json fills some pages. Thanks, JSON gem, for this helpful message !
JSONLint also gave me this error:
syntax error, unexpected TINVALID at line 2 Parsing failed
At least it reformats the JSON and gave me the line.
With trial and error, I could narrow it down to the hotel description text:
>> puts "Homestead Seattle \x1A Bellevue" Homestead Seattle Bellevue
Alright, that’s an escaped ASCII character, the decimal value is 26, referenced as “Substitute” – one of the nonprintable ASCII characters.
Here is a very nice post explaining (for a different example, the famous non-breaking-space) different ways of escaping.
Back to my problem:
>> JSON.load("Das Homestead Seattle \032 Bellevue befindet sich in Bellevue, Wash.".to_json)
JSON::ParserError: 705: unexpected token at '"Das Homestead Seattle \u001a Bellevue befindet sich in Bellevue, Wash."'
from /usr/local/lib/ruby/gems/1.8/gems/json-1.4.6/lib/json/common.rb:146:in `parse'
from /usr/local/lib/ruby/gems/1.8/gems/json-1.4.6/lib/json/common.rb:146:in `parse'
Now that’s just weak. The to_json method produces JSON that can’t be parsed by JSON.parse ? I think the to_json method originates somewhere in Rails, but still…
Oh well, in this case it worked by replacing the to_json call with JSON.dump .
Phusion Passenger and bundler error
I’ve been promising myself that I would try to share my coding experiences, especially bugs with nonobvious solutions, with the rest of the world. After all, how many times was I saved by googling some error message I couldn’t make heads or tails of, and someone already had an answer…
So hear this: Using Phusion Passenger but not trying to use bundler, you find this error message when starting Passenger:
"no such file to load -- bundler"
The solution for me: Check for a Gemfile in the Rails root. Apparantly Passenger tries to be helpful and tries to load bundler, which fails unless you’ve got it installed (which in my case, I couldn’t). If you don’t use bundler anyway, throw the file out.