Notes on caching queries in persistent

# Query Caching in Persistent

## Goal:

1. Get query caching in `persistent` operations.
2. ... Such that the cache can be busted by writes
3. ... On specific tables
4. ... But not accounting for other threads or other servers doing this

Or, we want to have a similar feature to [ActiveRecord caching](https://www.honeybadger.io/blog/rails-activerecord-caching/).

## Persistent Background

### The `StatementCache`

`persistent` is, for the most part, implemented as a `ReaderT backend`, where `backend` in our case is `SqlBackend`.
A `SqlBackend` includes a [`connStmtMap`](https://github.com/yesodweb/persistent/blob/cd97e4cbf41835514be5a62a71937db0050cf34b/persistent/Database/Persist/SqlBackend/Internal.hs?plain=1#L74-L76) - a [`StatementCache`](https://github.com/yesodweb/persistent/blob/cd97e4cbf41835514be5a62a71937db0050cf34b/persistent/Database/Persist/SqlBackend/StatementCache.hs?plain=1#L3-L10) which closes over an `IORef (Map Text Statement)`.

The `Text` contains the query that we are issuing.
That means we can probably hook into this in order to modify our own cache and check whether or not we want to proceed.
Let's trace this and see how `get` works with the statement cache.

### Implementing `get`

The function `get` is defined on `PersistStoreRead` - the [actual definition](https://github.com/yesodweb/persistent/blob/cd97e4cbf41835514be5a62a71937db0050cf34b/persistent/Database/Persist/Class/PersistStore.hs?plain=1#L162-L196) is reproduced here:

```haskell
class
    ( Show (BackendKey backend)
    , Read (BackendKey backend)
    , Eq (BackendKey backend)
    , Ord (BackendKey backend)
    , PersistCore backend
    , PersistField (BackendKey backend)
    , A.ToJSON (BackendKey backend)
    , A.FromJSON (BackendKey backend)
    ) =>
    PersistStoreRead backend
    where
    get
        :: forall record m
         . (MonadIO m, PersistRecordBackend record backend)
        => Key record -> ReaderT backend m (Maybe record)
```

Note that - unfortunately - we define this as polymorphic on the *backend*, not the `m`, meaning we need to make a `newtype CachedSqlBackend = CachedSqlBackend SqlBackend` to use override this method directly.

The implementation for `SqlBackend` is [here](https://github.com/yesodweb/persistent/blob/cd97e4cbf41835514be5a62a71937db0050cf34b/persistent/Database/Persist/Sql/Orphan/PersistStore.hs?plain=1#L422-L425):

```haskell
instance PersistStoreRead SqlBackend where
    get k = do
        mEs <- getMany [k]
        return $ Map.lookup k mEs

    -- inspired by Database.Persist.Sql.Orphan.PersistQuery.selectSourceRes
    getMany [] = return Map.empty
    getMany ks@(k : _) = do
        conn <- ask
        let
            t = entityDef . dummyFromKey $ k
        let
            cols = commaSeparated . Foldable.toList . keyAndEntityColumnNames t
        let
            wher = whereStmtForKeys conn ks
        let
            sql =
                T.concat
                    [ "SELECT "
                    , cols conn
                    , " FROM "
                    , connEscapeTableName conn t
                    , " WHERE "
                    , wher
                    ]
        let
            parse vals =
                case parseEntityValues t vals of
                    Left s ->
                        liftIO $
                            throwIO $
                                PersistMarshalError ("getBy: " <> s)
                    Right row -> return row
        withRawQuery sql (Foldable.foldMap keyToValues ks) $ do
            es <- CL.mapM parse .| CL.consume
            return $ Map.fromList $ fmap (\e -> (entityKey e, entityVal e)) es
```

So we end up calling [`withRawQuery`](https://github.com/yesodweb/persistent/blob/cd97e4cbf41835514be5a62a71937db0050cf34b/persistent/Database/Persist/Sql/Orphan/PersistStore.hs?plain=1#L56-L64), which calls [`rawQueryRes`](https://github.com/yesodweb/persistent/blob/cd97e4cbf41835514be5a62a71937db0050cf34b/persistent/Database/Persist/Sql/Raw.hs?plain=1#L33-L48), which does two things of note:

1. `getStmtConn conn sql`
2. `stmtQuery stmt vals`

[`getStmtConn`](https://github.com/yesodweb/persistent/blob/cd97e4cbf41835514be5a62a71937db0050cf34b/persistent/Database/Persist/Sql/Raw.hs?plain=1#L86-L121) looks up the statement in the `SqlBackend`'s `StatementCache`.
If the statement is present, it returns it.
Otherwise, it does `connPrepare conn sql` to create an initial `Statement`, and then adds some extra logic to prevent the `Statement` from being used after finalized.
*Finally*, we call [`hookGetStatement`](https://github.com/yesodweb/persistent/blob/cd97e4cbf41835514be5a62a71937db0050cf34b/persistent/Database/Persist/SqlBackend/Internal.hs?plain=1#L145-L147), a user-customizable hook of type `SqlBackend -> Text -> Statement -> IO Statement`.

So what is `connPrepare` doing?
Fort postgresql, we [define it here](https://github.com/yesodweb/persistent/blob/cd97e4cbf41835514be5a62a71937db0050cf34b/persistent-postgresql/Database/Persist/Postgresql.hs?plain=1#L505), as [`prepare'`](https://github.com/yesodweb/persistent/blob/cd97e4cbf41835514be5a62a71937db0050cf34b/persistent-postgresql/Database/Persist/Postgresql.hs?plain=1#L532-L542):

```haskell
prepare' :: PG.Connection -> Text -> IO Statement
prepare' conn sql = do
    let
        query = PG.Query (T.encodeUtf8 sql)
    return
        Statement
            { stmtFinalize = return ()
            , stmtReset = return ()
            , stmtExecute = execute' conn query
            , stmtQuery = withStmt' conn query
            }
```

So "finalize" and "reset" don't do anything.
This is OK because [`withStmt'`](https://github.com/yesodweb/persistent/blob/cd97e4cbf41835514be5a62a71937db0050cf34b/persistent-postgresql/Database/Persist/Postgresql.hs?plain=1#L615-L683) doesn't actually create a prepared statement, it executes the query one-off.
Which, y'know, maybe isn't ideal, but hey, whatever.

## Trouble

The `Text` that is stored includes `?` placeholders, so we will need to keep track of the `[PersistValue]` that each `Text` is called with.
ie, `get (OrganizationKey 1)` and `get (OrganizationKey 2)` are going to have the same query text:

```sql
SELECT * 
FROM organizations
WHERE id = ?
```

This means we probably want a type like `HashMap Text (HashMap [PersistValue] [[PersistValue]])` - ie, a mapping of Text queries, to a mapping of query arguments, to the results of that query.

# Summary of Available Touchpoints

1. We can make a `newtype` over `SqlBackend` and define separate instances of `PersistStoreRead` etc. 
    1. But this doesn't help us much in a `PersistentOperation m => ...` world, because the backend is often tied to the record, and we'd have to change `PersistentOperationSql` to be *compatible* with `SqlBackend` and not equal to `SqlBackend`. Maybe not a huge deal, but doesn't feel promising + lots of work.
2. We can *modify* a `SqlBackend` directly to enhance it with this caching ability.
    1. We  have `connPrepare` which creates `Statement`s, and `hookGetStatement` which is able to alter a `Statement` before returning it.
    2. `Statement` itself can be overriden at the [`stmtQuery` point](https://github.com/yesodweb/persistent/blob/cd97e4cbf41835514be5a62a71937db0050cf34b/persistent/Database/Persist/SqlBackend/Internal/Statement.hs?plain=1#L16-L20).

So, 

```haskell
-- called when we actually execute the statement
stmtQuery :: Statement -> forall m. MonadIO m => [PersistValue] -> Acquire (ConduitM () [PersistValue] m ())

-- called when we prepare the statement on first execution
connPrepare :: SqlBackend -> Text -> IO Statement

-- called when we retrieve the statement prior to execution
hookGetStatement :: SqlBackendHooks -> SqlBackend -> Text -> Statement -> IO Statement
```

# A Strategy for Reading Cache

I think we can have `hookGetStatement` install a cache lookup in `stmtQuery`. 
We have the `Text` of the query that will be keyed on in `hookGetStatement`, and `stmtQuery` receives the `[PersistValue]` of the arguments to the query.

If we are closing over some `Cache` type like `Map Text (Map [PersistValue] [[PersistValue]])`, then we can do a `lookup` on that `Map` with the `Text` query, and then the `stmtQuery` can do a `lookup` on the `Map` with the `[PersistValue]` arguments.
If that exists, we return it, if not, we actually run the query.

This level of references and mutability means we would probably do best with an `StmContainers.Map Text (StmContainers.Map [PersistValue] [[PersistValue]])`.
`hookGetStatement` will look up the mutable `Map`, then provide the mutable `Map` to `stmtQuery`, which can look things up on actual execution.

# Writing Cache

Now, actually populating the cache may be more difficult.
`stmtQuery` returns a `Acquire (ConduitM () [PersistValue] m ())`.
An [`Acquire`](https://github.com/snoyberg/conduit/blob/6b98f070fea09a3bf0a5d0897a2e27e3aa91c8fe/resourcet/Data/Acquire/Internal.hs?plain=1#L39-L52) is a type that can be used to provide somewhat manual free/release - for example, see [`rawQuery`](https://github.com/yesodweb/persistent/blob/cd97e4cbf41835514be5a62a71937db0050cf34b/persistent/Database/Persist/Sql/Raw.hs?plain=1#L22-L31).

In [`rawQueryRes`](https://github.com/yesodweb/persistent/blob/cd97e4cbf41835514be5a62a71937db0050cf34b/persistent/Database/Persist/Sql/Raw.hs?plain=1#L46-L48), we do `mkAcquire (log >> getStmtConn conn sql) stmtReset`.
So `stmtReset :: Statement -> IO ()` is called, but without any further information.
If we want to use this as a point to write to our cache (feels reasonable), we'll need to smuggle that information in somehow.
Can we modify `Statement` and `stmtQuery` to save rows as they come through?

We can `fmap` over an `Acquire`, allowing us to write `ConduitM () [PersistValue] m () -> ConduitT () [PersistValue] m ()`.

I think we can have something like,

```haskell
modifyAcquireToSaveRows 
    :: ([PersistValue] -> [PersistValue] -> IO ()) 
    -- ^ for @args@, save current @row@
    -> ([PersistValue] -> IO ())
    -- ^ trigger that cache is complete for these arguments
    -> Statement 
    -> Statement
modifyAcquireToSaveRows saveRow signalArgsComplete statement = 
    statement
        { stmtQuery = \args -> do
            src <- stmtQuery statement args
            pure 
                ( src 
                .| saveRowC args
                )
        }
  where
    saveRowC args = do
        mrow <- await
        case mrow of
            Nothing -> do
                -- query is over, signal we can save for this 
                signalArgsComplete args
            Just row -> do
                -- query is ongoing
                saveRow args row
                yield row
```

Then, to write cache,

```haskell
modifyStatementToSaveCache
    :: Chan [PersistValue]
    -- ^ The accumulated channel of persist values
    -> IO QueryFullyCompleted
    -- ^ Action to return whether query completed
    -> ([[PersistValue]] -> IO ())
    -- ^ Action to write accumulated values to cache
    -> Statement
    -> Statement
modifyStatementToSaveCache values didQueryComplete writeRowsToCache statement =
    statement
        { stmtReset = do
            compleat <- didQueryComplete
            case compleat of
                QueryUnfinished -> do
                    --  don't want to save here, so flush cache
                    _ <- getChanContents values
                    pure ()
                QueryFinished -> do
                    rows <- getChanContents values
                    writeRowsToCache rows
            stmtReset statement
        }
```

This leaves a lot of implementation details unsaid.

# `cachedQueries` implementation

I think, with the above, we can write a function like this:

```haskell
cachedQueries :: (PersistentReadOperationSql m) => m a -> m a
cachedQueries action = do
    stuff <- unsafeLiftPersistentReadOperation mkCachingImplements
    localSqlBackend 
        (\sqlBackend -> addRequisiteHooks stuff sqlBackend)
        action
```

This will modify the local `SqlBackend` with caching inside of the block.

# Cache Busting

We definitely want a way to say "don't use cache in this lexical scope."

```haskell
cachedQueries do
    get $ OrganizationKey 10 -- database fetch
    get $ OrganizationKey 10 -- cache hit
    noCaching do
        get $ OrganizationKey 10 -- database fetch
```

With the above design, `cachedQueries` is closing over that local state, and it is not made available to the inner action.
So `noCaching` has no ability to read or modify this - the side channels have all been completely encapsulated.

`cachedQueries` could provide a handle of sorts, but this breaks the lexical nature of it.
We want to call arbitrary queries, and those arbitrary queries shouldn't have to know if they have a cache or not.

However... `SqlBackend` does have a [`connVault`](https://github.com/yesodweb/persistent/blob/cd97e4cbf41835514be5a62a71937db0050cf34b/persistent/Database/Persist/SqlBackend/Internal.hs?plain=1#L137-L139), a [`Vault`](https://www.stackage.org/lts-23.27/package/vault-0.3.1.5) which allows us to store arbitrary values.
We could store the caching stuff in the `Vault`, along with a flag indicating whether or not the cache should be used.
Hooks, etc could consult this entry in the `Vault`.

Hm.

Should `noCaching` force just a database lookup, or also an update to cache?

Consider:

```haskell
cachedQueries do
    get $ OrganizationKey 10 -- database fetch

    -- somewhere else, someone modifies the organization in db
    noCaching do
        get $ OrganizationKey 10 -- database fetch

    get $ OrganizationKey 10 -- cache hit -- but from where?
```

I think `noCaching` should probably always hit the database, but also always update the cache.

# Don't cache writes!!

We definitely don't want our mechanism here to cache write queries.
Many of them go through `stmtExecute`, but some will go through `stmtQuery`.
So we need to have checking that the query text is not an INSERT or UPDATE or DELETE.

# Cache Busting on Writes

The above design has a mapping `Query -> Arguments -> Results`.
But this is no good. Consider:

```haskell
cachedQueries do
    get $ OrganizationKey 10 -- database fetch

    update (OrganizationKey 10) [OrganizationName =. "blah"]

    get $ OrganizationKey 10 -- cache hit, known stale value
```

This will cause bugs.
So we want a way to know which *tables* have been modified by an update query, and then invalidate just those entries in the cache.

One option is to just invalidate the entire cache on every update/insert.
This works and is very easy to implement.

Another, more difficult, option is to only invalidate cached queries that read from tables that have been modified.
To do this, we'd need to parse the query text for tables, and then scan each cached query text for a mention of the table.
If that table is mentioned, we clear the cache for that query.

This may also be somewhat slow to operate.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Notes on caching queries in persistent #1605

Query Caching in Persistent

Goal:

Persistent Background

The `StatementCache`

Implementing `get`

Trouble

Summary of Available Touchpoints

A Strategy for Reading Cache

Writing Cache

`cachedQueries` implementation

Cache Busting

Don't cache writes!!

Cache Busting on Writes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Notes on caching queries in persistent #1605

Description

Query Caching in Persistent

Goal:

Persistent Background

The StatementCache

Implementing get

Trouble

Summary of Available Touchpoints

A Strategy for Reading Cache

Writing Cache

cachedQueries implementation

Cache Busting

Don't cache writes!!

Cache Busting on Writes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

The `StatementCache`

Implementing `get`

`cachedQueries` implementation