openaustralia/planningalerts

View on GitHub
app/views/documentation/how_to_write_a_scraper.html.haml

Summary

Maintainability
Test Coverage
- content_for :page_title, "How to write a scraper"
%h3= yield :page_title

%h4 What are scrapers?
%p
  To be able to display and send alerts for planning applications, PlanningAlerts needs to download applications from as many councils as possible. As the vast majority of councils don't supply the data in a reusable,
  machine-readable format we need to write
  = link_to "web scrapers", "https://secure.wikimedia.org/wikipedia/en/wiki/Web_scraping"
  for each local government authority.
%p
  These scrapers fetch the data from council web pages and present it in a structured format so we can load it into
  the PlanningAlerts database.

%h4 How can I help?
%p
  If you have some computer programming experience, you should be able to work out how to prepare a scraper for
  PlanningAlerts. All PlanningAlerts scrapers are hosted on our #{link_to "morph.io", "https://morph.io/"}
  scraping platform, which takes care of all the boring bits of scraping for you (well, most of the boring bits!).
%p
  The next thing to do is to decide what council to scrape. Once you've picked one, look it up on
  our
  = link_to "crowd-sourced list of councils", "https://docs.google.com/spreadsheet/ccc?key=0AmvYMal8CGUsdG1tM0lEWUctR194eGN6bUh0VGFfc1E"
  and have a look at their published planning applications. Quickly double-check that the council isn't #{link_to "covered already", authorities_path}.
%p
  Some systems for displaying development applications on council websites are widely used. For most of those
  we have already developed scrapers that are capable of scraping many authorities using the same system.
  Check whether the council you want to scrape is using one of the systems
  #{link_to "Masterview", "https://github.com/planningalerts-scrapers/masterview_scraper"},
  #{link_to "Civica", "https://github.com/planningalerts-scrapers/civica_scraper"},
  #{link_to "Icon", "https://github.com/planningalerts-scrapers/icon_scraper"},
  #{link_to "ATDIS", "https://github.com/planningalerts-scrapers/multiple_atdis"},
  #{link_to "Horizon", "https://github.com/planningalerts-scrapers/horizon_xml"},
  #{link_to "Technology One", "https://github.com/planningalerts-scrapers/technology_one_scraper"} or
  #{link_to "Epathway", "https://github.com/planningalerts-scrapers/epathway_scraper"}.
  We don't yet have good documentation on how to recognise these different systems but that's
  something we want to create. Or maybe you can help?

%h4 An introduction to scraping with morph.io
%p
  With morph.io, you can choose to write your scraper in Ruby, Python, PHP or Perl so there's a good chance
  you're already familiar an available programming language. Since all of the code is hosted on GitHub you're
  probably also already familiar with how to share and collaborate on your scraper code.
%p
  morph.io provides great conveniences like taking care of saving your data, running your scraper regularly, and
  emailing you when there's a problem.
%p
  You can find out more in the #{link_to "morph.io documentation", "https://morph.io/documentation"}.

%h4 Now it's time to scrape
%p
  Make sure you have a #{link_to "GitHub account", "https://github.com/join"}, then you can use it to sign in to morph.io and
  = link_to "create a new scraper", "https://morph.io/scrapers/new"
  that downloads and saves the following information:

%p The following fields are <span class="highlight">required</span>. All development applications should have these bits of information.
%table.scraper_fields
  %tbody
    %tr
      %th.field field
      %th.example Example value
      %th Description
    %tr
      %td.field council_reference
      %td.example TA/00323/2012
      %td
        %p
          The ID that the council has given the planning application. This also must be the unique key for this data
          set.
    %tr
      %td.field address
      %td.example 1 Sowerby St, Goulburn, NSW
      %td
        %p
          The physical address that this application relates to. This will be geocoded so doesn't need to be a specific format but obviously the more explicit it is the more likely it will be successfully geo-coded. If the original address did not include the state (e.g. "QLD") at the end, then add it.
    %tr
      %td.field description
      %td.example Ground floor alterations to rear and first floor addition
      %td
        %p
          A text description of what the planning application seeks to carry out.
    %tr
      %td.field info_url
      %td.example https://foo.gov.au/app?key=527230
      %td
        %p
          A URL that provides more information about the planning application.
        %p
          This should be a persistent URL that preferably is specific to this particular application. In many cases councils force
          users to click through a license to access planning application. In this case be careful about what URL you provide. Test clicking the link in a browser that hasn't established a session with the council's site to ensure users of PlanningAlerts will be able to click the link and not be presented with an error.
    %tr
      %td.field date_scraped
      %td.example 2012-08-01
      %td
        %p
          The date that your scraper is collecting this data (i.e. now). Should be in
          = link_to "ISO 8601", "https://en.wikipedia.org/wiki/ISO_8601"
          format.
        %p
          Use the following Ruby code:
          %code Date.today.to_s
%p
  Note that there used to be a field "comment_url"
  above that was required. This is no longer used though you might
  still see it referenced in older scrapers.
%p The following fields are optional because not every planning authority provides them. Please do include them if data is available.

%table.scraper_fields
  %tbody
    %tr
      %th.field field
      %th.example Example value
      %th Description
    %tr
      %td.field date_received
      %td.example 2012-06-23
      %td
        %p
          The date this application was received by council. Should be in
          = link_to "ISO 8601", "https://en.wikipedia.org/wiki/ISO_8601"
          format.
    %tr
      %td.field on_notice_from
      %td.example 2012-08-01
      %td
        %p
          The date from when public submissions can be made about this application. Should be in
          = link_to "ISO 8601", "https://en.wikipedia.org/wiki/ISO_8601"
          format.
    %tr
      %td.field on_notice_to
      %td.example 2012-08-14
      %td
        %p
          The date until when public submissions can be made about this application. Should be in
          = link_to "ISO 8601", "https://en.wikipedia.org/wiki/ISO_8601"
          format.
    %tr
      %td.field comment_email
      %td.example foo@bar.com
      %td
        %p
          Only set this in
          %strong extremely unusual
          situations. Allows each application in a single
          planning authority to go to a different email address. This should never be set for 99.9%
          of authorities as a single email address is used for all comments. Currently this is only
          used for SA Planning Portal where comments are ideally sent back to the originating
          local council so that the staff in state government don't have to do the redirection by hand.
    %tr
      %td.field comment_authority
      %td.example Acme Council
      %td
        %p
          Only set this in
          %strong extremely unusual
          situations. Give the name associated with the comment_email address.

%h4 Versioning application data
%p
  It's important that scrapers collect the latest, most up-to-date, information. In fact,
  if the information about an application changes (because, for instance, a council updates the
  wording or corrects a mistake) your scraper should get the most up to date information.

%p
  For that reason, it's good practise for your scraper to look back a reasonable amount of
  time (one month is good) in which you scrape all applications that might have changed in that
  time. That way you're most likely to catch any changes. Often it's not possible to simply
  get a list of applications that recently changed. Instead you have to scrape say a list
  of applications that were recently received and applications that have recently been determined
  (whether they're approved or not).

%p
  When you save an updated version of an application make sure you use the council_reference
  field as the unique id. That way you don't end up with multiple versions of the same record.
  If you're writing your scraper in Ruby that will look something like:

%code.long-lines
  %pre
    :preserve
      ScraperWiki.save_sqlite(['council_reference'], record)

%p
  When the main PlanningAlerts system reads the latest application data from your
  scraper on morph.io it automatically keeps track of the changes that occur
  on indidivual applications. That way you can make sure that nothing truly gets
  overwritten. There is always a history of what fields changed when. At the moment
  this information is recorded in the database but isn't yet exposed to users in
  the main application or through the data published through the API.

%p
  If you get stuck, have a look at
  = link_to "the scrapers already written for PlanningAlerts", "https://morph.io/planningalerts-scrapers"
  and
  = link_to "post on the morph.io forum", "https://help.morph.io/"
  if you have any questions.

%h4 Scheduling the scraper
%p
  Set the scraper to run once per day. This can be done on morph.io on the settings page of the scraper.

%h4 Finishing up
%p
  Once you've finished your scraper and it's successfully downloading planning applications,
  = link_to "contact us", documentation_contact_path
  and we'll fork it into the
  = link_to "planningalerts-scrapers organization", "https://morph.io/planningalerts-scrapers"
  and import it into PlanningAlerts.
%p
  The last thing to do is look up on Wikipedia how many people live within the council you've just covered so you can
  pat yourself on the back knowing that you've just helped tens of thousands of people get PlanningAlerts.