docs: ADR for re-processing of documents#1913
Conversation
Reviewer's GuideThis PR introduces an architecture decision record for document re-processing and implements a full data migration subsystem that allows re-processing stored SBOM and advisory documents during schema changes. It adds concrete migrations for extracting SBOM properties and advisory vulnerability scores, refactors migration orchestration and storage initialization, integrates new CLI commands for data migrations, centralizes storage backend setup and enforces async trait requirements, and updates dependencies across multiple crates. ER diagram for new and updated tables: SBOM properties and advisory vulnerability scoreserDiagram
SBOM {
UUID sbom_id PK
JSON properties
}
ADVISORY_VULNERABILITY_SCORE {
UUID id PK
UUID advisory_id FK
STRING vulnerability_id FK
ENUM type
STRING vector
FLOAT score
ENUM severity
}
ADVISORY_VULNERABILITY {
UUID advisory_id PK
STRING vulnerability_id PK
}
ADVISORY_VULNERABILITY_SCORE ||--|{ ADVISORY_VULNERABILITY : "advisory_id, vulnerability_id"
ADVISORY_VULNERABILITY_SCORE }o--|| SBOM : "(none, but SBOM now has properties)"
ADVISORY_VULNERABILITY_SCORE }o--|| ADVISORY : "advisory_id"
ADVISORY_VULNERABILITY_SCORE }o--|| VULNERABILITY : "vulnerability_id"
Class diagram for the new data migration subsystem and document handlersclassDiagram
class Runner {
+String database_url
+Option<String> database_schema
+DispatchBackend storage
+Direction direction
+Vec<String> migrations
+Options options
+run<M: MigratorWithData>()
}
class MigratorWithData {
+data_migrations() Vec<Box<MigrationTraitWithData>>
}
class MigrationTraitWithData {
+up(manager: SchemaDataManager)
+down(manager: SchemaDataManager)
}
class SchemaDataManager {
+SchemaManager manager
+DispatchBackend storage
+Options options
+process<D, N>(name, f)
}
class Handler~D~ {
+call(document: D, model: D::Model, tx)
}
class Document {
+all(tx)
+source(model, storage, tx)
}
class Sbom {
+CycloneDx
+Spdx
+Other
}
class Advisory {
+Cve
+Csaf
+Osv
+Other
}
Runner --> MigratorWithData
MigratorWithData --> MigrationTraitWithData
MigrationTraitWithData --> SchemaDataManager
SchemaDataManager --> Handler
Handler --> Document
Document <|-- Sbom
Document <|-- Advisory
Class diagram for the new advisory vulnerability score entityclassDiagram
class AdvisoryVulnerabilityScore {
+UUID id
+UUID advisory_id
+String vulnerability_id
+ScoreType type
+String vector
+f64 score
+Severity severity
}
class ScoreType {
+V2_0
+V3_0
+V3_1
+V4_0
}
class Severity {
+None
+Low
+Medium
+High
+Critical
}
AdvisoryVulnerabilityScore --> ScoreType
AdvisoryVulnerabilityScore --> Severity
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1913 +/- ##
==========================================
+ Coverage 68.66% 68.69% +0.03%
==========================================
Files 379 395 +16
Lines 21449 22203 +754
Branches 21449 22203 +754
==========================================
+ Hits 14727 15252 +525
- Misses 5855 6056 +201
- Partials 867 895 +28 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@sourcery-ai review |
There was a problem hiding this comment.
Hey @ctron - I've reviewed your changes and found some issues that need to be addressed.
Blocking issues:
- Invalid trait bound
+ use<>in return type. (link) - Invalid trait bound
+ use<>in return type. (link)
General comments:
- The example migration currently hardcodes
FileSystemBackend::for_test, so make the storage backend configurable (e.g., allow S3) to support real‐world migrations instead of only test scenarios. - The
retrievemethod signature was updated with ause<Self>bound, which looks like a typo—verify that you intended a lifetime bound (for example+ 'a + Send) and correct the signature accordingly. - In
DocumentProcessor::process, wrapping all errors asDbErr::Migration(format!(…))loses structured error context; consider using a custom error type or propagating the original error to make debugging easier.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The example migration currently hardcodes `FileSystemBackend::for_test`, so make the storage backend configurable (e.g., allow S3) to support real‐world migrations instead of only test scenarios.
- The `retrieve` method signature was updated with a `use<Self>` bound, which looks like a typo—verify that you intended a lifetime bound (for example `+ 'a + Send`) and correct the signature accordingly.
- In `DocumentProcessor::process`, wrapping all errors as `DbErr::Migration(format!(…))` loses structured error context; consider using a custom error type or propagating the original error to make debugging easier.
## Individual Comments
### Comment 1
<location> `modules/storage/src/service/s3.rs` </location>
<code_context>
- async fn retrieve<'a>(
+ async fn retrieve(
&self,
StorageKey(key): StorageKey,
- ) -> Result<Option<impl Stream<Item = Result<Bytes, Self::Error>> + 'a>, Self::Error> {
</code_context>
<issue_to_address>
Invalid trait bound `+ use<>` in return type.
`+ use<>` is invalid Rust syntax and will not compile. Please correct this trait bound.
</issue_to_address>
### Comment 2
<location> `modules/storage/src/service/fs.rs` </location>
<code_context>
- async fn retrieve<'a>(
+ async fn retrieve(
&self,
key: StorageKey,
- ) -> Result<Option<impl Stream<Item = Result<Bytes, Self::Error>> + 'a>, Self::Error> {
</code_context>
<issue_to_address>
Invalid trait bound `+ use<>` in return type.
This trait bound will cause a compilation error; please update it to valid Rust syntax.
</issue_to_address>
### Comment 3
<location> `docs/adrs/00008-re-process-documents.md:17` </location>
<code_context>
+When making changes to the database structure, we also have a migration process, which takes care of upgrading the
+database structures during an upgrade.
+
+However, in some cases, changing the database structure actually means to extract more information from documents and is
+currently stored in the database. Or information is extracted in a different way. This requires a re-processing of
+all documents affected by this change.
+
</code_context>
<issue_to_address>
Grammatical error: 'means to extract more information from documents and is currently stored' should be 'means extracting more information from documents than is currently stored'.
It should be: 'changing the database structure actually means extracting more information from documents than is currently stored in the database.'
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
However, in some cases, changing the database structure actually means to extract more information from documents and is
currently stored in the database. Or information is extracted in a different way. This requires a re-processing of
all documents affected by this change.
=======
However, in some cases, changing the database structure actually means extracting more information from documents than is
currently stored in the database. Or information is extracted in a different way. This requires a re-processing of
all documents affected by this change.
>>>>>>> REPLACE
</suggested_fix>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
55a099b to
b0866ec
Compare
|
Just some thoughts:
|
| enabled). The process will migrate schema and data. This might block the startup for a bit. But would be fast and | ||
| simple for small systems. | ||
|
|
||
| ### Approach 2 |
There was a problem hiding this comment.
There maybe a third way eg. only using default_transaction_read_only which will switch the db to read only mode ... note this is session based so maybe this is just a matter of keeping original conn for normal operations and generating a new conn (without setting default_transaction_read_only) to do any mutations. I think I would also use default_transaction_read_only for blue/green as well, but thought I would mention this.
There was a problem hiding this comment.
I think working with read-only transactions makes sense in general. Some operations just being read-only by nature.
We could ask the user to re-configure the trustify for read-only transactions. And then run the migrations. But that wouldn't be much of a difference to the blue/green approach? We'd just need to ensure we can enable the user to do so.
|
|
||
| ## Consequences | ||
|
|
||
| * The migration will block the upgrade process until it is finished |
There was a problem hiding this comment.
failure during these kind of migrations will need extra testing ... and maybe specific TX handling (transactional DDL is your friend).
There was a problem hiding this comment.
Yes, that's why I started adding tests for migrations (with data) as well.
| * 👍 Upgrade process is faster and less complex | ||
| * 👎 Requires some coordination between instances (only one processor at a time, maybe one after the other) | ||
|
|
||
| ### Option 3 |
There was a problem hiding this comment.
there maybe other options - there are a lot of pg ext for this sort of thing - ex. https://github.com/xataio/pgroll
There was a problem hiding this comment.
Definitely worth checking out. However, it seems to add a lot of complexity which needs to be handles by us then. And I'm not sure it is worth the effort.
|
|
||
| ### Approach 2 | ||
|
|
||
| The user uses a green/blue deployment. Switching the application to use green and run migrations against blue. Once |
There was a problem hiding this comment.
worth mentioning that amazon has blue/green pg deployments as part of their service though I do not have much experience with them
There was a problem hiding this comment.
Yes, that was my though here. I'd expect other PG services to have similar setups available. However, it's up to the user to provide a database. We'd just try to ensure we can work with a model like this.
I don't think there would be a way of doing this automatically. Assuming the Sea ORM migrations drive this, it would be sequential. As part of the data migration most likely there would also be a schema migration. However, if we allow the data migrations to be run up-front, we could bundle such processing. What definitely would work is to run multiple processors (in a single process/pod) in parallel. We could improve on that, but it would add much more complexity. And I'd like to avoid that in the beginning.
Very good idea. However, that also conflicts with the idea of running multiple steps/migrations in parallel. So each document must be upgraded sequentially. However that process could run in parallel.
I though about that too. However, I believe, that the current preferred idea of having A/B (green/blue) deployments would but that information not on the UI, but on some other process, detached from the UI. Assuming a user is running blue/green. The upgrade would run on blue, but the UI would serve (read-only) from green. So it would not see the state of blue until it's finished. |
b0866ec to
dbf13e3
Compare
| This would also require to prevent users from creating new documents during that time. Otherwise, we would need to | ||
| re-process documents ingested during the migration time. A way of doing this could be to leverage PostgreSQL's ability | ||
| to switch into read-only mode. Having mutable operations fail with a 503 (Service Unavailable) error. This would also |
There was a problem hiding this comment.
Just to be clear: users would be unable to ingest new data while an update/migration is in progress? And that could potentially take more than a day? I think that's a bad look.
I believe it's worth the effort to design a system that allows the re-processing of docs ingested during migration. We should aim for "zero downtime".
There was a problem hiding this comment.
Maybe you can add another option to the document describing your idea.
There was a problem hiding this comment.
Once the code and db schema have been updated, why can't new docs (post update) be ingested while old docs (pre update) are being re-ingested?
There was a problem hiding this comment.
Assuming you're adding a new, mandatory field. How would that be populated during schema migration? How would new code, relying on that data, work when that data is not present? Why should docs be re-ingested, and not just missing data be amended?
There was a problem hiding this comment.
I'm not suggesting we add a new field. I'm asking why new docs can't be ingested into the new schema with the new code while the old docs are being re-ingested into the new schema with the new code. "Old" docs could be identified by their ingestion date.
There was a problem hiding this comment.
That is a use case we need to cover. Adding a new field.
dbf13e3 to
0a96d93
Compare
391b790 to
ef04e5d
Compare
bf41417 to
ad42a6c
Compare
20e763c to
c6bcac7
Compare
c6bcac7 to
012d4d4
Compare
Signed-off-by: Dejan Bosanac <dbosanac@redhat.com>
Signed-off-by: Dejan Bosanac <dbosanac@redhat.com>
788f116 to
1e1f530
Compare
Signed-off-by: Dejan Bosanac <dbosanac@redhat.com>
1e1f530 to
2a3574d
Compare
|
I think this PR is good go now. It provides the functionality for reprocessing documents and a use case of reparsing cvss data is working. It is pretty big, so keeping it in this state requires a lot of work. All further improvements and more testing will be done in next phases. For example, I'll start working next on querying these new score tables, and we can fix and refine all things related to this functionality in this work. |
jcrossley3
left a comment
There was a problem hiding this comment.
It's 47 commits and ~3K lines so it's a lot to expect anyone to review it productively.
I mostly worry that this PR will result in updates that take too long from the perspective of the user. Our goal should be minutes, not hours, and certainly not days. We have people using TPA now, and some have lots of data. Do we have any sense how they'll react to the impact[s] of this PR?
Are we sure that this PR is better than just adopting a policy that new code will have to be robust enough to account for "missing data" while re-ingestion is occurring?
|
@jcrossley3 I understand you completely and sorry for putting you in this position. The idea was that someone else beside @ctron and me would take a peak and see if something stands out. |
|
We are talking about breaking changes of the data model here. Changes which require data that we did not have before. We could avoid this in some situations. In others we can't. If we can avoid them, we should. A case like this should be an exception rather than the norm. Writing code that works with both worlds at the same time (old data model and new data model) on a "per document" basis will be hell. And adds code that we need to maintain for a while. It will result in unexpected results for the user. Results that will change when reloading, as time (and migration) progress. That will make it even harder to understand and test. On the other side, this approach does allow one to pre-run migrations on a cloned data set. See the impact, and plan a migration. Rather than working on actual production data. Once the migration is tested, the migrated result can also be used as a new clone for the production system again. We provide tools now to prevent changes, setting the database to read-only mode during that time. And to my understanding, this is what users actually prefer. So I think it makes sense in moving forward with this. Getting some user feedback. |
… table Update VulnerabilityService, PurlStatus, and VulnerabilitySummary to query CVSS scores from the new unified advisory_vulnerability_score table instead of the legacy cvss3 table. This completes the read-side migration for the score consolidation work started in PR guacsec#1913. The ingestion side continues to write to both tables (dual-write) to ensure backwards compatibility. Removal of the dual-write, deprecation of the legacy table, and any appropriate API changes will be addressed in follow-up PRs. Assisted-By: Claude
… table Update VulnerabilityService, PurlStatus, and VulnerabilitySummary to query CVSS scores from the new unified advisory_vulnerability_score table instead of the legacy cvss3 table. This completes the read-side migration for the score consolidation work started in PR #1913. The ingestion side continues to write to both tables (dual-write) to ensure backwards compatibility. Removal of the dual-write, deprecation of the legacy table, and any appropriate API changes will be addressed in follow-up PRs. Assisted-By: Claude
Preview: https://github.com/ctron/trustify/blob/feature/adr_rescan_1/docs/adrs/00011-re-process-documents.md
Summary by Sourcery
Add an architecture decision record outlining strategies for re-processing ingested documents after schema changes.
Documentation:
Summary by Sourcery
Enable re-processing of stored documents during schema upgrades by introducing an ADR, a data migration framework, new migrations for SBOM properties and advisory scores, CLI commands for data migrations, and enhancements to storage and migrator modules.
New Features:
Enhancements:
Documentation:
Tests: