Skip to content

Embedding data causing cache failuresΒ #975

@dkotter

Description

@dkotter

Describe the bug

We have a handful of Features that can be powered by embeddings generated by an LLM. These embeddings are currently stored in either post meta or term meta and then used to run comparisons.

It's a known issue that this doesn't scale very well, as running these comparisons within WordPress starts to slow down significantly once you have hundreds or thousands of items. We probably haven't done a good enough job of making that limitation known though.

But another issue that came up recently is that this embedding data can get quite large. The way this currently works is we take the content of an item (say a post) and we break that down into smaller chunks. Each chunk is then sent to the LLM to generate embeddings and each of those embeddings are then stored together under a single meta key.

For long content, this data can easily get over 1MB. WordPress has some built-in functionality that in certain situations (like when running get_posts or get_post_meta), it will run a database query to get all meta for that item and store that in the cache, with the idea that this will make any subsequent requests for this data faster.

The problem here is this means in certain situations, this embedding data gets pulled into the cache and it can easily be large enough to overwhelm the cache size limit, which then forces all cached data to be purged. For sites with lots of traffic, this can lead to performance issues as more requests need to make database queries to get the data they need.

Approaches

I think there are two approaches we should look at implementing here:

  1. For any Feature that uses embeddings that doesn't currently support storing those in elasticsearch, add that functionality (Classification and Recommended Content)
  2. For sites that don't have access to elasticsearch, add a new database table to store embeddings instead of using the meta tables

Elasticsearch

Right now, the Smart 404 and Term Cleanup Features can take advantage of elasticsearch (through ElasticPress) to store and query embeddings. This leads to significant performance improvements on the query side and does mean we don't need to store the data in the meta tables, fixing the issue described above.

We should look to bring this same functionality to all other Features that use embeddings, as well as adjust the current approach to only store in elasticsearch (right now, those two existing Features will store in both places).

New DB table

In addition to the above, we should look at introducing a new database table, designed for this embedding data. This prevents the problem discussed above and also allows us to design this table specifically to handle embeddings, whereas right now the meta tables are set to handle lots of data types. This will likely lead to better performing queries but will take some experimentation on how best to structure this (I would start by looking at https://github.com/Jameswlepage/wpvdb and seeing if there's things there we can use/learn from). Will also need to consider backwards compat here, if we should look to migrate existing embedding data from meta tables to this new table.

I would recommend we tackle this part first and then the elasticsearch part second, as I think this has more applicable use cases.

Steps to Reproduce

  1. Enable a Feature that uses embeddings
  2. Create a long post and trigger embedding generation for that
  3. View in your database the size of the classifai_openai_embeddings post meta item
  4. If desired, set up an environment that has caching enabled and see how the above impacts that

Screenshots, screen recording, code snippet

No response

Environment information

No response

WordPress information

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    Status

    Backlog

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions