monogreen - Tank you very much !

War Story: Finding bottlenecks in Ruby code with ruby-prof

For my job at Travel IQ , I’ve recently had the pleasure of trying to find out why our so-called cluster job suddenly started to run extremely slowly.

This job is a rake task that runs each night on our production machines – it aggregates the different hotel descriptions of the hotel providers we’ve got partnerships with (like HRS or booking.com) and tries to find out which different descriptions actually mean the same real hotel.

The algorithms used in the task are, I dare say, pretty sophisticated and do a really good job of sifting through a lot of descriptions. The praise here goes to the original author, Dr. Florian Odronitz – a more in-depth look at the clustering process might be another post some day. Due to the amount of data, the task never really ran fast as such – a couple of hours each night, but that was fast enough.

Recently, however, it would run long enough to be still running when the next job started – 24 hours later. The last job we’ve kept alive to finish ran for three weeks X(

It looked like someone had to do something. Unfortunately, that someone was me. Nobody had touched the code in months, and the number of descriptions didn’t change significantly either (that only really happens when we add or remove a partnership). Looking through the code didn’t produce any immediately obvious bottleneck.

My first guess was ActiveRecord – usually, when massaging data, the slowest parts of the process tend to be doing more queries than necessary (which is one of the downsides of using a high-level ORM system like AR – it’s easy to forget that you’re working with a DB under the object-oriented code).

Watching the queries while the task ran didn’t show any queries during the slow parts of the task, though. Dang.

Time to break out the big guns, then: Profiling the task .
Running that on my local machine produced a nice long and interesting profile that, unfortunately, didn’t look like anything was out of order. Only lots of calls to methods I was expecting to be called, and the most time doing actual work was spent down in Geokit’s distance calculation method. Since closeness to each other is the number one criterion for grouping different hotels together, this is expected – we basically need to get an accurate distance for each description against each other description. We also didn’t update the gem – so I still was at a loss.

A colleague put me to it: I had been comparing apples to oranges – I had run the profile locally, not with the production data. ’Doh !

Very, very carefully enabling profiling on a production server and running it there produced a different picture. Unfortunately, I don’t have the profile saved in a blogpost-friendly form. But the culprit was quite obvious: For about 5000 comparisons between hotels, Array#index was called about 15000 times. And that again called Array#=== 50 million times. Ouch. The whole measured part of the code spent 75% of its time calling that method.

The actual code calling the index method wasn’t even directly connected with the clustering:

on_blacklist = blacklist ? !!blacklist[1].index([cluster_element.send(blacklist[0]), candidate.send(blacklist[0])].sort) : false
on_whitelist = whitelist ? !!whitelist[1].index([cluster_element.send(whitelist[0]), candidate.send(whitelist[0])].sort) : false

That might be a little obtuse for the uninitiated. What it basically does is using the current cluster_element (a hotel description) to create a signature (an Array instance), and then using Array#index to see if the element is in the blacklist and/or whitelist array.

The problem here is that the blacklist array is currently about 10000 elements long – and that was exactly what was different between my local machine and the production data. And this was what had changed over time – the black/whitelists are used to manually override clustering results, and our product team had been busy weeding out the last erronous clusters.

This is such a classic mistake – since the above code is in an inner loop of comparisons, the amount of work to be done goes up non-linearly with each added black/whitelist entry.

The solution was also simple. I just changed the black/whitelist arrays to be Hashes, and did the lookup by seeing if the signature is a key in the Hash. Generally, this is a good habit to get into – when you do lookups, use the data structure that was designed for it !

After that, the time used for one cluster work packet dropped from 18 minutes to 30 seconds. Case closed. Write smug blog post.

One final note: People looking at profile data for the first time should not be discouraged at the mass of data a profiling usually provides. While people writing about analyzing profiles usually gloss over it (me included in this post), the process is, at least for me, not very straightforward. It involves a lot of going back and forth, trying to make sense of method called (you often see methods being called you never use directly), and looking up what the terminology used in the reports means. And you need to be quite familiar with the code to have the mental map that lets you see anything that doesn’t fit into it.

Setting the program name with $0

Here’s a quick tip that might be useful: In Ruby, you can set the title of the current Ruby process by assigning a string to $0. At Travel IQ, we’ve found this especially useful for long-running rake tasks that clean up and massage our data at night. Some of those rake tasks were running very long – and we were scratching our heads as to what they might be doing.

So we simply set the program name in every rake task and its dependencies, and the next time around, a look at htop revealed where we’ve been stuck.

One caveat, though: It seems that on Ubuntu machines, the maximum length of the program name is determined when the process is started, i.e. setting it later to something longer truncates the new value to the length of the first: http://www.ruby-forum.com/topic/187083