GitHub - joatca/events-list

This code scrapes website calendars from multiple sites and generates a markdown summary intended for a Hugo site. Currently the configuration works with sites in London, Ontario, Canada but in principle can be adapted to any sites.

Site Configuration

The import file is the driver file event-list-config.rb

Each source describes how to scrape a particular site. Apart from descriptive information it contains several possible finders:

Name	Purpose
`main`	the section of the site containing a list of events
`events`	how to find each event within `main`
`link`	within each event, how to find a web link to the event
`title`	within each event, how to find the text title
`date`	within each event, how to find the date of the event
`time`	within each event, how to find the time of the event
`datetime`	Alternative to `date` and `time` if the date and time are found together
`filters`	within each event, an element that may cause the event to be filtered out

Each finder has at least one of:

Name	Purpose
`css`	a Nokogiri-compatible CSS specifier to find the HTML element of group
`proc`	a Ruby lambda or `Proc` object that the found element is passed to to extract text
`if`	only for filters, a Ruby lambda or `Proc` object that should return a truthy value if the event should be skipped and otherwise a falsey value

For each finder, if css: exists then the current HTML document is searched for that CSS. For main this is the entire document, for events is the the result from main and for everything else it is the subdocument for the current event. If proc exists then the found subdocument is passed to the subdocument and whatever it returns is the result. proc exists to do arbitrary processing and manging of the raw data from the website.

Note that the output of css is always a Nokogiri object so if you need plain text then the absolute minimal proc is ->(x) { x.text }

For each source the code proceeds as follows:

find main
- within main find and loop over each of the events
  - if any filters exist and any of them return true, skip this event
  - fetch link and title
  - if only date is given, pass it to Chronic's Chronic.parse to parse into a timestamp; if both date and time are given then are joined with a space and passed to Chronic.parse, otherwise if datetime is given pass the contents to Chronic.parse. It is an error to have neither date nor datetime but time is optional.

For example consider the finders for Museum London:

  finders: {
    main: {
      css: "div.event"
    },
    events: {
      css: "div.col-xs-9"
    },
    link: {
      css: "a.num",
      proc: ->(x) { x.attribute("href").value }
    },
    title: {
      css: "a.num h3",
      proc: ->(x) { x.text }
    },
    datetime: {
      css: "span.event-date",
      proc: ->(x) { x.text.gsub(/^\S+ "/, '') }
    }
  }

The code fetches div.event then loops through each div.col-xs-9 within that. For each one the event link is extracted by finding a.num then calling .attribute("href").value on it, the event title by finding a.num h3 then calling .text, and the date and time together by finding span.event-date then calling .text.gsub(/^\S+ "/, '') to strip off the leading word and a quote (Chronic.parse doesn't support day names), then passing the resulting text to Chronic.parse.

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
.bundle		.bundle
.gitignore		.gitignore
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
README.md		README.md
event-list-config.rb		event-list-config.rb
event-list.rb		event-list.rb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Site Configuration

About

Uh oh!

Releases

Packages

Uh oh!

Languages

joatca/events-list

Folders and files

Latest commit

History

Repository files navigation

Site Configuration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages