-
Notifications
You must be signed in to change notification settings - Fork 64
Description
Describe the bug
We have a handful of Features that can be powered by embeddings generated by an LLM. These embeddings are currently stored in either post meta or term meta and then used to run comparisons.
It's a known issue that this doesn't scale very well, as running these comparisons within WordPress starts to slow down significantly once you have hundreds or thousands of items. We probably haven't done a good enough job of making that limitation known though.
But another issue that came up recently is that this embedding data can get quite large. The way this currently works is we take the content of an item (say a post) and we break that down into smaller chunks. Each chunk is then sent to the LLM to generate embeddings and each of those embeddings are then stored together under a single meta key.
For long content, this data can easily get over 1MB. WordPress has some built-in functionality that in certain situations (like when running get_posts or get_post_meta), it will run a database query to get all meta for that item and store that in the cache, with the idea that this will make any subsequent requests for this data faster.
The problem here is this means in certain situations, this embedding data gets pulled into the cache and it can easily be large enough to overwhelm the cache size limit, which then forces all cached data to be purged. For sites with lots of traffic, this can lead to performance issues as more requests need to make database queries to get the data they need.
Approaches
I think there are two approaches we should look at implementing here:
- For any Feature that uses embeddings that doesn't currently support storing those in elasticsearch, add that functionality (Classification and Recommended Content)
- For sites that don't have access to elasticsearch, add a new database table to store embeddings instead of using the meta tables
Elasticsearch
Right now, the Smart 404 and Term Cleanup Features can take advantage of elasticsearch (through ElasticPress) to store and query embeddings. This leads to significant performance improvements on the query side and does mean we don't need to store the data in the meta tables, fixing the issue described above.
We should look to bring this same functionality to all other Features that use embeddings, as well as adjust the current approach to only store in elasticsearch (right now, those two existing Features will store in both places).
New DB table
In addition to the above, we should look at introducing a new database table, designed for this embedding data. This prevents the problem discussed above and also allows us to design this table specifically to handle embeddings, whereas right now the meta tables are set to handle lots of data types. This will likely lead to better performing queries but will take some experimentation on how best to structure this (I would start by looking at https://github.com/Jameswlepage/wpvdb and seeing if there's things there we can use/learn from). Will also need to consider backwards compat here, if we should look to migrate existing embedding data from meta tables to this new table.
I would recommend we tackle this part first and then the elasticsearch part second, as I think this has more applicable use cases.
Steps to Reproduce
- Enable a Feature that uses embeddings
- Create a long post and trigger embedding generation for that
- View in your database the size of the
classifai_openai_embeddingspost meta item - If desired, set up an environment that has caching enabled and see how the above impacts that
Screenshots, screen recording, code snippet
No response
Environment information
No response
WordPress information
No response
Code of Conduct
- I agree to follow this project's Code of Conduct
Metadata
Metadata
Assignees
Labels
Type
Projects
Status