Skip to content

Notes on caching queries in persistent #1605

@parsonsmatt

Description

@parsonsmatt

Query Caching in Persistent

Goal:

  1. Get query caching in persistent operations.
  2. ... Such that the cache can be busted by writes
  3. ... On specific tables
  4. ... But not accounting for other threads or other servers doing this

Or, we want to have a similar feature to ActiveRecord caching.

Persistent Background

The StatementCache

persistent is, for the most part, implemented as a ReaderT backend, where backend in our case is SqlBackend.
A SqlBackend includes a connStmtMap - a StatementCache which closes over an IORef (Map Text Statement).

The Text contains the query that we are issuing.
That means we can probably hook into this in order to modify our own cache and check whether or not we want to proceed.
Let's trace this and see how get works with the statement cache.

Implementing get

The function get is defined on PersistStoreRead - the actual definition is reproduced here:

class
    ( Show (BackendKey backend)
    , Read (BackendKey backend)
    , Eq (BackendKey backend)
    , Ord (BackendKey backend)
    , PersistCore backend
    , PersistField (BackendKey backend)
    , A.ToJSON (BackendKey backend)
    , A.FromJSON (BackendKey backend)
    ) =>
    PersistStoreRead backend
    where
    get
        :: forall record m
         . (MonadIO m, PersistRecordBackend record backend)
        => Key record -> ReaderT backend m (Maybe record)

Note that - unfortunately - we define this as polymorphic on the backend, not the m, meaning we need to make a newtype CachedSqlBackend = CachedSqlBackend SqlBackend to use override this method directly.

The implementation for SqlBackend is here:

instance PersistStoreRead SqlBackend where
    get k = do
        mEs <- getMany [k]
        return $ Map.lookup k mEs

    -- inspired by Database.Persist.Sql.Orphan.PersistQuery.selectSourceRes
    getMany [] = return Map.empty
    getMany ks@(k : _) = do
        conn <- ask
        let
            t = entityDef . dummyFromKey $ k
        let
            cols = commaSeparated . Foldable.toList . keyAndEntityColumnNames t
        let
            wher = whereStmtForKeys conn ks
        let
            sql =
                T.concat
                    [ "SELECT "
                    , cols conn
                    , " FROM "
                    , connEscapeTableName conn t
                    , " WHERE "
                    , wher
                    ]
        let
            parse vals =
                case parseEntityValues t vals of
                    Left s ->
                        liftIO $
                            throwIO $
                                PersistMarshalError ("getBy: " <> s)
                    Right row -> return row
        withRawQuery sql (Foldable.foldMap keyToValues ks) $ do
            es <- CL.mapM parse .| CL.consume
            return $ Map.fromList $ fmap (\e -> (entityKey e, entityVal e)) es

So we end up calling withRawQuery, which calls rawQueryRes, which does two things of note:

  1. getStmtConn conn sql
  2. stmtQuery stmt vals

getStmtConn looks up the statement in the SqlBackend's StatementCache.
If the statement is present, it returns it.
Otherwise, it does connPrepare conn sql to create an initial Statement, and then adds some extra logic to prevent the Statement from being used after finalized.
Finally, we call hookGetStatement, a user-customizable hook of type SqlBackend -> Text -> Statement -> IO Statement.

So what is connPrepare doing?
Fort postgresql, we define it here, as prepare':

prepare' :: PG.Connection -> Text -> IO Statement
prepare' conn sql = do
    let
        query = PG.Query (T.encodeUtf8 sql)
    return
        Statement
            { stmtFinalize = return ()
            , stmtReset = return ()
            , stmtExecute = execute' conn query
            , stmtQuery = withStmt' conn query
            }

So "finalize" and "reset" don't do anything.
This is OK because withStmt' doesn't actually create a prepared statement, it executes the query one-off.
Which, y'know, maybe isn't ideal, but hey, whatever.

Trouble

The Text that is stored includes ? placeholders, so we will need to keep track of the [PersistValue] that each Text is called with.
ie, get (OrganizationKey 1) and get (OrganizationKey 2) are going to have the same query text:

SELECT * 
FROM organizations
WHERE id = ?

This means we probably want a type like HashMap Text (HashMap [PersistValue] [[PersistValue]]) - ie, a mapping of Text queries, to a mapping of query arguments, to the results of that query.

Summary of Available Touchpoints

  1. We can make a newtype over SqlBackend and define separate instances of PersistStoreRead etc.
    1. But this doesn't help us much in a PersistentOperation m => ... world, because the backend is often tied to the record, and we'd have to change PersistentOperationSql to be compatible with SqlBackend and not equal to SqlBackend. Maybe not a huge deal, but doesn't feel promising + lots of work.
  2. We can modify a SqlBackend directly to enhance it with this caching ability.
    1. We have connPrepare which creates Statements, and hookGetStatement which is able to alter a Statement before returning it.
    2. Statement itself can be overriden at the stmtQuery point.

So,

-- called when we actually execute the statement
stmtQuery :: Statement -> forall m. MonadIO m => [PersistValue] -> Acquire (ConduitM () [PersistValue] m ())

-- called when we prepare the statement on first execution
connPrepare :: SqlBackend -> Text -> IO Statement

-- called when we retrieve the statement prior to execution
hookGetStatement :: SqlBackendHooks -> SqlBackend -> Text -> Statement -> IO Statement

A Strategy for Reading Cache

I think we can have hookGetStatement install a cache lookup in stmtQuery.
We have the Text of the query that will be keyed on in hookGetStatement, and stmtQuery receives the [PersistValue] of the arguments to the query.

If we are closing over some Cache type like Map Text (Map [PersistValue] [[PersistValue]]), then we can do a lookup on that Map with the Text query, and then the stmtQuery can do a lookup on the Map with the [PersistValue] arguments.
If that exists, we return it, if not, we actually run the query.

This level of references and mutability means we would probably do best with an StmContainers.Map Text (StmContainers.Map [PersistValue] [[PersistValue]]).
hookGetStatement will look up the mutable Map, then provide the mutable Map to stmtQuery, which can look things up on actual execution.

Writing Cache

Now, actually populating the cache may be more difficult.
stmtQuery returns a Acquire (ConduitM () [PersistValue] m ()).
An Acquire is a type that can be used to provide somewhat manual free/release - for example, see rawQuery.

In rawQueryRes, we do mkAcquire (log >> getStmtConn conn sql) stmtReset.
So stmtReset :: Statement -> IO () is called, but without any further information.
If we want to use this as a point to write to our cache (feels reasonable), we'll need to smuggle that information in somehow.
Can we modify Statement and stmtQuery to save rows as they come through?

We can fmap over an Acquire, allowing us to write ConduitM () [PersistValue] m () -> ConduitT () [PersistValue] m ().

I think we can have something like,

modifyAcquireToSaveRows 
    :: ([PersistValue] -> [PersistValue] -> IO ()) 
    -- ^ for @args@, save current @row@
    -> ([PersistValue] -> IO ())
    -- ^ trigger that cache is complete for these arguments
    -> Statement 
    -> Statement
modifyAcquireToSaveRows saveRow signalArgsComplete statement = 
    statement
        { stmtQuery = \args -> do
            src <- stmtQuery statement args
            pure 
                ( src 
                .| saveRowC args
                )
        }
  where
    saveRowC args = do
        mrow <- await
        case mrow of
            Nothing -> do
                -- query is over, signal we can save for this 
                signalArgsComplete args
            Just row -> do
                -- query is ongoing
                saveRow args row
                yield row

Then, to write cache,

modifyStatementToSaveCache
    :: Chan [PersistValue]
    -- ^ The accumulated channel of persist values
    -> IO QueryFullyCompleted
    -- ^ Action to return whether query completed
    -> ([[PersistValue]] -> IO ())
    -- ^ Action to write accumulated values to cache
    -> Statement
    -> Statement
modifyStatementToSaveCache values didQueryComplete writeRowsToCache statement =
    statement
        { stmtReset = do
            compleat <- didQueryComplete
            case compleat of
                QueryUnfinished -> do
                    --  don't want to save here, so flush cache
                    _ <- getChanContents values
                    pure ()
                QueryFinished -> do
                    rows <- getChanContents values
                    writeRowsToCache rows
            stmtReset statement
        }

This leaves a lot of implementation details unsaid.

cachedQueries implementation

I think, with the above, we can write a function like this:

cachedQueries :: (PersistentReadOperationSql m) => m a -> m a
cachedQueries action = do
    stuff <- unsafeLiftPersistentReadOperation mkCachingImplements
    localSqlBackend 
        (\sqlBackend -> addRequisiteHooks stuff sqlBackend)
        action

This will modify the local SqlBackend with caching inside of the block.

Cache Busting

We definitely want a way to say "don't use cache in this lexical scope."

cachedQueries do
    get $ OrganizationKey 10 -- database fetch
    get $ OrganizationKey 10 -- cache hit
    noCaching do
        get $ OrganizationKey 10 -- database fetch

With the above design, cachedQueries is closing over that local state, and it is not made available to the inner action.
So noCaching has no ability to read or modify this - the side channels have all been completely encapsulated.

cachedQueries could provide a handle of sorts, but this breaks the lexical nature of it.
We want to call arbitrary queries, and those arbitrary queries shouldn't have to know if they have a cache or not.

However... SqlBackend does have a connVault, a Vault which allows us to store arbitrary values.
We could store the caching stuff in the Vault, along with a flag indicating whether or not the cache should be used.
Hooks, etc could consult this entry in the Vault.

Hm.

Should noCaching force just a database lookup, or also an update to cache?

Consider:

cachedQueries do
    get $ OrganizationKey 10 -- database fetch

    -- somewhere else, someone modifies the organization in db
    noCaching do
        get $ OrganizationKey 10 -- database fetch

    get $ OrganizationKey 10 -- cache hit -- but from where?

I think noCaching should probably always hit the database, but also always update the cache.

Don't cache writes!!

We definitely don't want our mechanism here to cache write queries.
Many of them go through stmtExecute, but some will go through stmtQuery.
So we need to have checking that the query text is not an INSERT or UPDATE or DELETE.

Cache Busting on Writes

The above design has a mapping Query -> Arguments -> Results.
But this is no good. Consider:

cachedQueries do
    get $ OrganizationKey 10 -- database fetch

    update (OrganizationKey 10) [OrganizationName =. "blah"]

    get $ OrganizationKey 10 -- cache hit, known stale value

This will cause bugs.
So we want a way to know which tables have been modified by an update query, and then invalidate just those entries in the cache.

One option is to just invalidate the entire cache on every update/insert.
This works and is very easy to implement.

Another, more difficult, option is to only invalidate cached queries that read from tables that have been modified.
To do this, we'd need to parse the query text for tables, and then scan each cached query text for a mention of the table.
If that table is mentioned, we clear the cache for that query.

This may also be somewhat slow to operate.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions