Skip to content
Oliver Beckstein edited this page Oct 13, 2021 · 20 revisions

The search functionality is provided by algolia and is known as algolia DocSearch

Documentation

Hosted search

We are using the hosted search option where Algolia runs the docsearch-scraper.

specific issues

docsearch-scraper

One can run the scraper by oneself and then serve that index. That's also recommended for debugging. If we do this, here are links to get started:

Relevant issues

For details, look through the issue comments

  • add search box #73
  • restrict DocSearch to relevant parts of the site #77
  • sitemapindex #79

Configuration

To change the configuration, make a PR against https://github.com/algolia/docsearch-configs/blob/master/configs/mdanalysis.json

The syntax is explained at https://docsearch.algolia.com/docs/config-file/

Selectors

In order for anything to be indexed it must match one of the selectors

  • levels are mapped to heading tags
  • text is mapped to p, li, and similar tags
  • examine the produced documentation with the Firefox Web Developer Tool or similar to see which CSS elements apply to the content that should be indexed

Example selectors

selectors": {
    "lvl0": "[itemprop='articleBody'] > .section h1, .page h1, .post h1, .body > .section h1",
    "lvl1": "[itemprop='articleBody'] > .section h2, .page h2, .post h2, .body > .section h2",
    "lvl2": "[itemprop='articleBody'] > .section h3, .page h3, .post h3, .body > .section h3",
    "lvl3": "[itemprop='articleBody'] > .section h4, .page h4, .post h4, .body > .section h4",
    "lvl4": "[itemprop='articleBody'] > .section h5, .page h5, .post h5, .body > .section h5",
    "text": "[itemprop='articleBody'] > .section p, .page p, .post p, .body > .section p, [itemprop='articleBody'] > .section li, .page li, .post li, .body > .section li"
  },

mdanalysis.json

Snap shot of mdanalysis.json

{
  "index_name": "mdanalysis",
  "sitemap_urls": [
    "https://www.mdanalysis.org/sitemapindex.xml"
  ],
  "start_urls": [
    "https://docs.mdanalysis.org",
    "https://userguide.mdanalysis.org",
    "https://www.mdanalysis.org"
  ],
  "stop_urls": [
    "https://www.mdanalysis.org/.*?//.*?",
    "https://www.mdanalysis.org/blog",
    "https://www.mdanalysis.org/mdanalysis",
    "https://www.mdanalysis.org/docs",
    "https://docs.mdanalysis.org/stable/.*",
    "https://docs.mdanalysis.org/.*index.html$",
    "https://userguide.mdanalysis.org/stable/.*",
    "https://userguide.mdanalysis.org/.*-dev.*/.*",
    "https://www.mdanalysis.org/.*index.html$",
    "\\/_"
  ],
  "selectors": {
    "lvl0": "[itemprop='articleBody'] > .section h1, .page h1, .post h1, .body > .section h1",
    "lvl1": "[itemprop='articleBody'] > .section h2, .page h2, .post h2, .body > .section h2",
    "lvl2": "[itemprop='articleBody'] > .section h3, .page h3, .post h3, .body > .section h3",
    "lvl3": "[itemprop='articleBody'] > .section h4, .page h4, .post h4, .body > .section h4",
    "lvl4": "[itemprop='articleBody'] > .section h5, .page h5, .post h5, .body > .section h5",
    "text": "[itemprop='articleBody'] > .section p, .page p, .post p, .body > .section p, [itemprop='articleBody'] > .section li, .page li, .post li, .body > .section li, [itemprop='articleBody'] > .section dt, .body > .section dt"
  },
  "conversation_id": [
    "569445928"
  ],
  "nb_hits": 18529
}

Working with sitemaps

When making a PR

Please:

Debugging search (v2)

Run a local version of the scraper that has index submission to algolia disabled (to avoid running in limits for the free plan). For example, install https://github.com/orbeckst/docsearch-scraper/tree/dryrun

Have the config file handy (e.g., by cloning https://github.com/algolia/docsearch-configs).

Run the scraper and check the output

./docsearch run ../docsearch-configs/configs/mdanalysis.json 2>&1 | tee RUN.log
less RUN.log

Example output

> DocSearch: https://www.mdanalysis.org 0 records)
> Ignored: from start url https://userguide.mdanalysis.org/stable/index.html
> Ignored: from start url https://docs.mdanalysis.org/stable/index.html
> DocSearch: https://www.mdanalysis.org/pages/privacy/ 12 records)
> DocSearch: https://www.mdanalysis.org/pages/used-by/ 30 records)
...
...
> DocSearch: https://www.mdanalysis.org/2015/12/15/The_benefit_of_social_coding/ 6 records)
> DocSearch: https://www.mdanalysis.org/distopia/search.html 0 records)
> Ignored from sitemap: https://www.mdanalysis.org/distopia/genindex.html
> Ignored from sitemap: https://www.mdanalysis.org/distopia/index.html
> DocSearch: https://www.mdanalysis.org/distopia/api/vector_triple.html 0 records)
> DocSearch: https://www.mdanalysis.org/distopia/api/helper_functions.html 0 records)
> DocSearch: https://www.mdanalysis.org/distopia/api/distopia.html 0 records)
> DocSearch: https://www.mdanalysis.org/distopia/building_distopia.html 0 records)

Interpretation of results

  • lines with N records where N > 0: this is desired and shows that the scraper collected data records for the index
  • lines with 0 records: the rules do not seem to correctly catch elements on the page for scraping
  • Ignored: from start url: started scraping by following but then hit a stop_url
  • Ignored from sitemap: : started scraping from sitemap (which is good!) and then hit a stop_url
  • Missing pages (e.g., nothing on the User Guide): check the sitemap file!!

Clone this wiki locally