Scraping czap.cz members so you can filter available psychotherapists by any criteria you wish:
I wanted to filter a list of Czech psychotherapists according to different criteria than those available at the registry website. For example, the registry allows to filter by location, but only to the level of region. As there is 700+ therapists in Prague itself, it's not very useful.
I don't think it's particularly useful to monitor changes in the registry, but I used git scraping nevertheless, because why not:
- History of changes
- Feed of changes (aka RSS)
The scraper uses my favorite Scrapy framework.
So far I scrape only a few fields. If you want to build on top of the data and you're missing something, let me know in issues. However, because I won't have time to add the fields, you better edit the code and add them yourself.
The scraper first downloads all registry with a single request. The data is encoded not as a JSON, but as a non-standard JavaScript object literal. To parse it efficiently, the scraper uses Node.js to safely evaluate the JavaScript and convert it to JSON. The parsed result is cached so it stays around for at least a day.
That data contains some info about members. It is structured, but it's in a very cryptic structure which needs to be reverse-engineered. If you're the kind of person who is into such thing, feel free to add fields there.
If you prefer good old HTML scraping, the scraper also makes requests to all individual member profile pages. There you can use Scrapy selectors to add fields to the data.