About autocompleters and page caching
At Travel IQ , we’ve recently had to spend some time revising how our autocompleter for the cities and regions search form work – and came across something that surprised us.
For the impatient, here’s the moral of the story up front: Caching can be quite harmful to performance when done for the wrong thing .
Sounds like an obvious truism (“Something is bad when done wrong”), but let me elaborate:
We’ve always used page caching for the backend part of the autocompleters. The SQL that actually retrieves the results had to join across a couple of tables, needs partial matching of strings, and has an elaborate order clause, so it wasn’t very fast. No biggy: Most of the requests will be the same anyway (big cities like Berlin), so we’ll just cache the results as files on the disk and be done with it – repeated requests will be as fast as can be, as the responses are just files delivered by the webserver straight from the disk. Rails makes this dead easy.
That worked well until we’ve started airing our TV spot for www.hotelauskunft.de . Shortly after each TV spot ran, we’ve noticed that the autocompleter slowed waaay down, in peak times it was basically unusable. Autocompleters need to be fast, otherwise they’re useless.
How could that be ? Wasn’t everything page cached ? Our apache monitoring showed high IO wait percentages – something was using the disk a whole lot.
Armed with an irb console, I poked around the apache access logs, and grepped out all request strings people had entered over the course of two days into the hotel autocompleter.
irb(main):021:0> hits['1'].size => 10175 irb(main):022:0> hits['2'].size => 12972
Hmm, how often do people search for the same thing ?
irb(main):025:0> hits['1'].uniq.size => 5177 irb(main):026:0> hits['2'].uniq.size => 6163
Waitaminute – only about half requests over the course of a day are unique ?
irb(main):030:0> (hits['1'] + hits['2']).size => 23147 irb(main):029:0> (hits['1'] & hits['2']).size => 1455
Oh, and only 1500 of 23000 requests are shared between two days. Damn, it turns out our assumption about people always searching the same thing wasn’t quite right…
Counting and sorting the unique requests revealed that while, yes, many requests are repeated a lot, there is a massively long tail of small towns, regions, and typos that make about 75% of our daily autocompleter requests (Unfortunatley, I didn’t save all the numbers so I can’t show them here).
So that’s why we had high IO and an unresponsive autocompleter – it was trying to write bajillions of new files to the disk, and then they weren’t even used more than once…
Our solution to the problem was first writing a simple script that reproduced the problem we saw during the peak times. All it did was hit the autocompleter URL with random strings – and voilá, it became unusable almost immediatley.
Then we simply removed the page caching and ran the script again – the autocompleter slowed down when a lot of requests were hitting it, but it didn’t become completely unresponsive anymore.
After that, the course was clear. We had another long hard look at the SQL, and managed to remove the joins. The query times dropped to a third of the original value. And now we started to have problems slowing down the autocompleter with the load testing script – it couldn’t make requests fast enough anymore to make a serious dent in the response times.
What I’ve taken away from the whole thing is this: Next time when thinking about page caching something, make damn sure the caches will actually be used by a significant number of requests. Otherwise, you’re just hurting overall performance.