Scraping a crowd-funding platform for fun and (non-)profit, part 1

Hello again dear readers (I now actually have a couple, because I’m sending the articles to friends and forcing them to read them. Hi Oana!).
Today we are going to do another somewhat pointless project. Slightly less useless than last time, I hope!

I’ve been looking at a crowdfunding website, which collects donations to user-created good causes (“projects” or “campaigns”). I’ll omit the name and URL here – it’s not that the info is private, but it seems uncouth to point at someone specifically (it’s a smallish company). They don’t talk about it much, but according to the Terms and Conditions, they are financing themselves by taking a percentage (3-8%) of donations to projects.

So I’m curious how much money they are actually moving – knowing nothing else about their finances or business model, does it seem feasible that they are cashflow-positive?

And looking at the website, they have a paginated list of projects (of reasonable size, 8 pages of 8 projects each, i.e. up to 64 projects currently), which I will assume are all that are currently active.

On the list page, they show a countdown à la “1234€ to go”. The total budget of the project is only shown on the detail page, let’s say in our example, that would be 1500€. So that project, currently, would have 1500 – 1234 = 266€ pledged to it, earning the platform about 266 * 0.03 ~ 8€, enough for some tasty Falafel for two!

At first, we could scrape all projects and see how much money in total has been pledged so far via the platform. This of course will miss any previous projects that have been completed (if there are any), but there is no way of knowing, so everything found out here is a lower bound. It’s mostly a finger exercise, anyway…

We could do an additional step and record a scraping run, and come back every couple of days, and then compare to see how the “velocity” of donations is, i.e. do they handle a lot of donations per day? This is of course more involved, as we’d need to record each scraping run and/or the results somewhere (a database etc.). Let’s shelve that for now. Baby steps.

But I also want to take an opportunity here to do something I think will be quite easy for me to do in Ruby (I have been scraping websites and consuming APIs with Ruby extensively in a past life), and see how it’ll work in other languages.

So let’s dump some ideas on how this ought to look like:

  • It should be a command-line tool, run via a single (bash) script, i.e. ./scrape_crowdfunder. It’ll write detailed debugging info to stdout when given -v as an option (note to self: Look at libraries for command-line option parsing), but otherwise will just output errors or in the best case "X campaigns, Y€ total, Z€ remaining, A€ earned".
  • The URL of where to start parsing will be given via env variable CROWDFUNDER_PROJECTS_URL so I can keep this out of the repository and protect the innocent.
  • There should be a sort of test harness which takes as input saved HTML pages for the projects index page, and one or two detail pages. A test script runs the tool, and checks the output on stdout for the expected result. This is an extreme example of “integration” testing (Is there a better name for this?), which allows us to swap different language implementations. We’ll probably have to add another env variable or something to make a switch somewhere to use the canned pages instead of the real ones.
  • I’ll write the tests in Ruby, because I’m most comfortable with it, and it’s well at home on the command line, and its dynamic nature works well with testing. Performance is not really a concern here.
  • Since we want the same test to be used on different implementations, let’s make it all one big repository. In a bigger project, we’d probably want a “main” project that pulls in the different implementations as libraries, but let’s stay lowtech for now.
  • Which means we can set up the tests like this: Have, under the main folder, one for the tests, and one for each language implementation:
    project
    /testcode1.rb
    /testcode2.rb
    /test-data
    /ruby
    ...../scrape_crowdfunder
    /javascript
    ...../scrape_crowdfunder

    etc.
    then we can run the tests also via a bash script, and just give the scriptname of the runner as a parameter:
    ./test ./ruby/scrape_crowdfunder
  • We might, eventually, also look into using different HTTP clients and parallelism there. I just stumbled upon httpx – another one in a long line of Ruby HTTP clients). We’ll need to tread lightly here though, as we don’t want to hammer the site repeatedly. My impression is that there is not so much traffic going on there so we’d probably be able to visibly cause traffic spikes if we go all-out, and that would be just impolite.

So now we have an idea where we’re going – let’s first set up the Ruby stuff and just hack something out, and then clean it up while writing the tests. I’m usually doing tasks in this order:

  1. Just think about the whole project and write down lotsa notes
  2. Hack around to try out the rough edges
  3. Start writing tests once I know what structure the code is going to take, and then go all TDD and write code and tests in lockstep

I’m suspicious of people that claim to be really doing TDD by writing the test first before anything else. Maybe this works if you are adding a routine extension to an existing project, but if you are going green-field or doing some complex new feature? It seems like an invitation to a lot of rewriting :/

Of course, just now I’m also writing down these words here for the blog post, which I don’t usually do when working for a job. I wonder if I should? It would make everything a lot slower, but would provide seamless documentation over the long term. Also it seems that writing down clears up your thoughts quite a bit…Something to ponder in another post?

Let’s just get going:

$ mkdir crowdfunder_scraper
$ cd crowdfunder_scraper/
$ git init

I’m copying a .gitignore from another project, and adding a Gemfile – we’ll at least need a http library, and I hate the built-in Net::HTTP library from Ruby with a passion:

$ bundle init

At this time I remember I want to make this code public, and make a github repo and retroactively link my local one to it:

$ git remote add origin git@github.com:MGPalmer/crowdfunder_scraper.git

Now you can follow along or look at the last state here: https://github.com/MGPalmer/crowdfunder_scraper

$ mkdir ruby
$ cd ruby
$ touch scrape_crowdfunder
$ chmod +x scrape_crowdfunder

Let’s skip some more boring details. Just some notes:

After some back and forth, here we are:

Gemfile:

# frozen_string_literal: true

source "https://rubygems.org"

git_source(:github) {|repo_name| "https://github.com/#{repo_name}" }

gem "httpx"
gem "nokogiri"
scrape_crowdfunder:

#!/usr/bin/env ruby

require 'rubygems'
require 'bundler/setup'
require 'httpx'
require 'nokogiri'

puts "Hello World"

Aannnnnnd:

$ ./scrape_crowdfunder 
 Hello World

Yaaaay we got a working Ruby script with Bundler.

At this point I go in irb, load up the Gemfile, and fiddle around.

$ irb
$ require 'rubygems'; require 'bundler/setup'; require 'httpx'; require 'nokogiri'
$ 2.4.2 :003 > page = Nokogiri.parse(HTTPX.get("https://example.org").to_s).css("a.clicky")
etc.

(fiddle fiddle fiddle)

Yay, got it working!

Here’s the current code, left intentionally ugly:

Check it out on Github: https://github.com/MGPalmer/crowdfunder_scraper/commit/1784bd1b18db4230b1b099f1c943ea5d7b883413


#!/usr/bin/env ruby
# frozen_string_literal: true

require 'rubygems'
require 'bundler/setup'
require 'httpx'
require 'nokogiri'
require 'pp'

index_url  = ENV['CROWDFUNDER_PROJECTS_URL']
index_html = HTTPX.get(index_url).to_s
index      = Nokogiri.parse(index_html)

def get_n_parse(url)
  res = HTTPX.get(url)
  unless res.status == 200
    puts "AAAAAAAAAA HTTP error for #{url} - #{res.status}"
    return nil
  end
  Nokogiri.parse(res.body.to_s)
end

def parse_detail_page_urls(page)
  page.css('.campaign-details a.campaign-link').map { |a| a[:href] }
end

pages = [index]
page_urls = index.css('ul.pagination a.page-link').map { |a| a[:href] }
pp(page_urls)
page_urls.each do |page_url|
  pages << get_n_parse(page_url)
end

pages.compact!

detail_urls = []
pages.each do |page|
  detail_urls += parse_detail_page_urls(page)
end

pp(detail_urls)

campaigns = detail_urls.map do |detail_url|
  puts detail_url
  page = get_n_parse(detail_url)
  next unless page

  campaign_goal    = Integer(page.css('h5.campaign-goal').text.gsub(/€|,/, ''))
  remaining_amount = Integer(page.css('p.remaining-amount').inner_html.gsub(',', '').scan(/€(\d+)?\s/m).flatten.first)
  {
    url: detail_url,
    campaign_goal: campaign_goal,
    remaining_amount: remaining_amount
  }
end.compact

pp(campaigns)

count     = campaigns.size
total     = campaigns.inject(0) { |t, n| t + n[:campaign_goal] }
remaining = campaigns.inject(0) { |t, n| t + n[:remaining_amount] }

puts "#{count} campaigns, #{total}€ total, #{remaining}€ remaining, #{total - remaining}€ earned"

Some notes:

  • It was a little tricky getting the first-page, then each-pagination, then each-detail links right
  • Stumbled hard over one 404 page, httpx will happily give you a “” body for that :/ . This needs to be tested, i.e. the tests should include cases for all of the HTTP calls to return errors, and check that the script doesn’t choke on them.
  • The markup is a bit of a bitch for the amounts (total and pledged) – had to use some regexps which are a little more complex than I’m really comfortable with. This needs to be tested thoroughly so we can refactor it later.
  • Should’ve first added verbose mode and a trigger for it, I ended up throwing puts and pp around a lot.
  • Also adding and using a debugger would have helped a lot, I didn’t want to slow down for that…
  • The script runs for quite a while – it has to do a couple of dozen HTTP calls, and when the code fails in one of the later ones, it’s a real PITA to have to re-run everything.
  • I’ve moved repeated code into methods, but of course nothing is properly organized.
  • But note how everything happens in discrete steps, and collects data from the previous step, making it easy to inspect the data at each point, and only in the end summing up the derived information we actually want.

But we want the numbers now! After running, and omitting all the debugging output:

57 campaigns, 2829964€ total, 2818168€ remaining, 11796€ earned

At, let’s say, 5% commission, this means the current projects are earning 11796 * 0.03 ~ 353€

This buys you a lot of Falafel, but it’s not much to run a company on :/ But of course everybody has to start small, and again, we really don’t have all the facts here. But hey, our code works, even though the numbers it produces might be meaningless. Ready for a career in business intelligence 😉

Tune in next time when we clean up this mess, and add tests!

Leave a comment