Skip to content

Caching

Jeremy Echols edited this page Apr 16, 2020 · 5 revisions

RAIS Caching

info.json responses

We've implemented a simple LRU cache for info.json responses, which holds 10,000 entries by default. The cached data is extremely small, making this a very efficient cache: the data saved is under 50 bytes per info response.

Of course, the info.json data is very easy to generate, so the value of caching may seem questionable, but it can reduce file IO when traffic is heavy.

Image responses

The server can optionally cache generated tiles under specific circumstances, but doesn't inherently cache the other images such as thumbnails. Tiles which are requested at a width and height of 1024 or below, in JPG format, can be cached by setting TileCacheLen in /etc/rais.toml, or the RAIS_TILECACHELEN environment variable. This is disabled by default.

Setting the tile cache length to anything greater than zero will enable the cache.

Tile caching is generally only recommended for systems with a small number of images or systems that expect a lot of traffic to hit a small subset of the collection, such as might be the case if there's a few featured images. On our Historic Oregon Newspapers site, we have a 1,000-item cache (just in case of a large influx of traffic to a particular newspaper), and it's typically only getting hit on 2% of all requests.

If you don't have a lot of extra RAM and your collection usage is fairly random, it's best to avoid a cache. But if you have some extra RAM, it can be valuable to create a small tile cache even on large collections just to better handle an unexpected influx of traffic to a small number of images, such as you might expect if part of your collection gets featured in an online exhibit.

Thumbnails

For resize requests such as thumbnails, caching is very beneficial, but for now RAIS doesn't try to accommodate this. For our needs, Apache handles this well enough, and if we needed something more powerful, we'd probably look at dedicated cache systems like varnish.

Note that RAIS returns a valid Last-Modified header based on the last time the JP2 file changed, which a cache can use to determine if RAIS should be hit.

Quick refresher: a IIIF URL looks like this:

{scheme}://{server}{/prefix}/{identifier}/{region}/{size}/{rotation}/{quality}.{format}

For thumbnail requests, the "region" is typically "full", and "size" is typically going to be some small width following by a comma, and an empty height. For instance:

{scheme}://{server}{/prefix}/{identifier}/full/1000,/{rotation}/{quality}.{format}

It should be fairly easy to cache a URL like this in a dedicated cache application, though we haven't actually done this ourselves.

Apache has a very simple cache module, mod_disk_cache. However, it caches by prefix, meaning all IIIF URLs or none. We got around this by making our thumbnail requests all use a different prefix than the rest of the IIIF requests use. Once that was in place, we configured a simple mod_disk_cache in Apache:

# Cache thumbnails (and only thumbnails)
CacheRoot /var/cache/httpd/mod_disk_cache
CacheEnable disk /images/resize

# Allow a total of 4096 content directories at two levels so we never have
# more than 64 directories in any other directory.  If we cache a million
# thumbnails, we'll still only end up with about 250 files per content
# directory.
CacheDirLength 1
CacheDirLevels 2

# Change !RAIS_HOST! below to serve tiles and thumbnails from RAIS
AllowEncodedSlashes NoDecode
ProxyPassMatch ^/images/resize/([^/]*)/full/([0-6][0-9][0-9],.*jpg)$ http://!RAIS_HOST!:12415/images/iiif/$1/full/$2 nocanon
ProxyPassMatch ^/images/iiif/(.*(jpg|info\.json))$ http://!RAIS_HOST!:12415/images/iiif/$1 nocanon

This setup splits thumbnail requests (up to 699 pixels wide) from tile requests, letting us cache thumbnails on disk for a much longer time than RAIS would store any tile in memory.

This won't be the smartest cache, but it will help when search results pages are used on large collections. It is highly advisable that the htcacheclean tool be used in tandem with Apache cache directives, and it's probably worth reading the Apache caching guide.

S3 Plugin

If the S3 plugin is in use (see Plugins for details), RAIS will cache images in order to avoid huge latencies. The S3 plugin allows you to specify the location where S3 files will be cached. You should ensure this cache is cleaned regularly to avoid filling up disk. RAIS can have its cache purged while running via the admin endpoint (see Administration), but this is meant to avoid manually trying to purge the cache when files are changing in your S3 storage; it's not necessarily going to keep your disk usage down if you have huge numbers of files which typically need to live for a long time.

It is our goal to make this more automated in the future, but for now the easiest solution is probably to simply delete the oldest files on a regular basis.

There is no way to avoid the cache through any kind of configuration. If you want to prevent caching S3 images, you'd need to build your own plugin. It is highly recommended you don't do this, however. Even when S3 is fast and in the same availability zone as our EC2 instance, it can take 1/2 second to grab the image and process it. Remember that this isn't a per-person cost; it's per-image-tile. A user who pans and zooms even just a ten-megapixel image can easily request 50 or more tiles. Not only would it be far slower than normal to skip caching, it could cause S3 costs to skyrocket.

Clone this wiki locally