-
Notifications
You must be signed in to change notification settings - Fork 580
feat: high-availability validator signer #19031
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
| /** | ||
| * Find an existing duty record | ||
| */ | ||
| async findDuty(validatorAddress: EthAddress, slot: bigint, dutyType: DutyType): Promise<ValidatorDutyRecord | null> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does return null?
|
I feel like the database interactions are convoluted and have potential race conditions. IIUC we are manually performing quite a complicated check and set operation across multiple calls. Can this not be reduced to about 3 calls and have the DB provide the concurrency guarantees:
|
| it('should initialize database schema successfully', async () => { | ||
| (mockClient.query as jest.Mock).mockResolvedValue({ rows: [], rowCount: 0 }); | ||
|
|
||
| await db.initialize(); | ||
|
|
||
| // Should start transaction | ||
| expect(mockClient.query).toHaveBeenCalledWith('BEGIN'); | ||
|
|
||
| // Should run all schema setup statements (4 statements) | ||
| const calls = (mockClient.query as jest.Mock).mock.calls; | ||
| expect(calls.length).toBeGreaterThanOrEqual(6); // BEGIN + 4 schema statements + version + COMMIT | ||
|
|
||
| // Should commit transaction | ||
| expect(mockClient.query).toHaveBeenCalledWith('COMMIT'); | ||
|
|
||
| // Should release client | ||
| expect(mockClient.release).toHaveBeenCalled(); | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test is too brittle considering how tightly coupled it is to the migration being run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see there's a src/db/schema.ts file too which defines tables and constraints. Are both files needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SCHEMA_SETUP in schema.ts` could be removed I guess, and use what's in the migration code to initialize. Would that make sense?
| mockClient = mock<PoolClient>(); | ||
|
|
||
| // Create mock pool | ||
| mockPool = mock<Pool>(); | ||
| mockPool.connect.mockImplementation(() => Promise.resolve(mockClient)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using a mocked Posgres connection, we could instantiate pglite (Postgres compiled to wasm) in a temp folder. We could eliminate all mocked calls this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, was just looking into some form of postgres to use for proper testing bc this mock doesn't provide much confidence. Will check pglite 👍
| */ | ||
| export const CLEANUP_OLD_DUTIES = ` | ||
| DELETE FROM validator_duties | ||
| WHERE started_at < $1; | ||
| `; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this query is using the idx_validator_duties_status index since it's defined on both status and started_at while the query does not specify status so it might require a full table scan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this only clean 'completed' duties?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was just added to cleanup all duties if needed. Can add one for signed duties only.
Would the way to make this one more efficient be:
DELETE FROM validator_duties
WHERE status IN ('signing', 'signed', 'failed')
AND started_at < $1;
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like keeping the SQL queries separate but I'm slightly worried that if we change the query here and forget to update their uses in PostgresSlashingProtectionDatabase we'll only know of the issue at runtime since the unit tests use mocked clients that don't actually check the query syntax.
If we added pglite then we could run the queries in unit tests.
| async recordSuccess(params: RecordSuccessParams): Promise<void> { | ||
| const { validatorAddress, slot, dutyType, signature, nodeId } = params; | ||
|
|
||
| await this.db.updateDutySigned(validatorAddress, slot, dutyType, signature.toString()); | ||
|
|
||
| this.log.info(`Recorded successful signing for duty ${dutyType} at slot ${slot}`, { | ||
| validatorAddress: validatorAddress.toString(), | ||
| nodeId, | ||
| }); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any protection against rogue writes? E.g. some bug in the validator that causes NODE-FOO to call recordSuccess/recordFailure when the lock is actually held by NODE-BAR?
| // Complete first node's signing after a short delay | ||
| await sleep(100); | ||
| await service.recordSuccess({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would happen if the node never completes the signature? Does node 2 recover?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, as mentioned in another comment if a node fails during 'signing' we can't currently recover.
| */ | ||
| async signWithProtection( | ||
| validatorAddress: EthAddress, | ||
| signingRoot: Buffer32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I'd rename signingRoot to just message or messageHash to match the rest of the codebase.
| await this.slashingProtection.recordSuccess({ | ||
| validatorAddress, | ||
| slot, | ||
| dutyType, | ||
| signature, | ||
| nodeId: this.config.nodeId, | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't get retried so I think we'll lose signature in case of temporary network issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the third(?) migration script
| /** | ||
| * Query to update a duty to 'signed' status | ||
| */ | ||
| export const UPDATE_DUTY_SIGNED = ` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These UPDATE/DELETE queries should require the node to that which made the insert shouldn't they?
WHERE node_id = $5 or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to use lock_token
| spec: | ||
| initContainers: | ||
| - name: db-migrate | ||
| image: validator-image:latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's 'validator-image'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sry didn't realize claude added this guide when I first created the migrations code.
Seems kinda useful, I've updated to correctly reflect how it currently works (which is with aztec migrate-ha-db command)
| spec: | ||
| containers: | ||
| - name: migrate | ||
| image: your-validator-image |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be aztecprotocol/aztec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And wouldn't it use the CLI command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated REAME, it wasn't up-to-date
| * @returns true if the update succeeded, false if token didn't match | ||
| */ | ||
| async recordFailure(params: RecordFailureParams): Promise<boolean> { | ||
| const { validatorAddress, slot, dutyType, error, lockToken } = params; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still feels redundant. Can't we simply delete the record?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm yeah I think you're right. other than logging/observability this is a better solution as it simplifies things even more.
|
|
||
| while (true) { | ||
| // insert if not present, get existing if present | ||
| const { isNew, record } = await this.db.tryInsertOrGetExisting(params); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If node A acquires the lock and then dies, how long is node B going to keep looping? Is it infinite? At most it should be until the end of the slot, but I can't see where it would time out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further down it does:
} else if (record.status === DutyStatus.SIGNING) {
// Another node is currently signing - check for timeout
if (Date.now() - startTime > this.signingTimeoutMs) {
this.log.warn(`Timeout waiting for signing to complete for duty ${dutyType} at slot ${slot}`, {
validatorAddress: validatorAddress.toString(),
timeoutMs: this.signingTimeoutMs,
signingNodeId: record.nodeId,
});
throw new DutyAlreadySignedError(slot, dutyType, 'unknown (timeout)');
}
// .... keep looping
}
Could switch from having a fixed timeout to having a deadline which would be the end of the slot.
Can implement that in the other PR
alexghr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good but I'd like to see the HASigner check for invalid configs. The test changes would be nice to have though.
| pgm.sql(` | ||
| INSERT INTO schema_version (version) | ||
| VALUES (${SCHEMA_VERSION}) | ||
| ON CONFLICT (version) DO NOTHING; | ||
| `); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't pg-migrate already create a schema migration table?
| await db.close(); | ||
| }); | ||
|
|
||
| describe('schema setup', () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can delete these tests as they're just testing PGlite functions correctly (which shouldn't be our concern)
| it('should handle large block numbers correctly', async () => { | ||
| const largeBlockNumber = 9007199254740991n; | ||
|
|
||
| const result = await db.query<InsertOrGetRow>(INSERT_OR_GET_DUTY, [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these queries should be tested through our PostgresSlashingProtectionDatabase implementation rather than going through the database. Doing it this way would also validate that the positional query parameters in the service are passed correctly
| const row = result.rows[0]; | ||
| return { | ||
| isNew: row.is_new, | ||
| record: this.rowToRecord(row), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this return the lock token?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth adding a comment here specifying that on failing to get the lock the returned 'lock_token' from the db is going to be the empty string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did initially return the lock_token for non-new records. It shouldn't be an issue since this is a trusted setup but also wouldn't that defeat its purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I meant in the 'failed lock' case. It shouldn't return the lock token to Node B if Node A has the lock and the SQL handles this correctly.
| SELECT * FROM inserted | ||
| UNION ALL | ||
| SELECT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks quite complicated but I get how it's meant to work.
| const result = await db.tryInsertOrGetExisting(params); | ||
| expect(result.record.status).toBe(DutyStatus.SIGNING); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an upsert so this doesn't actually assert that the insertion was done correctly. This function could have just inserted the row and this test wouldn't know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NVM, I saw the tests further down in this test suite.
| // First one should succeed, let it complete signing | ||
| await sleep(50); | ||
| const result = await db.tryInsertOrGetExisting(params1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes node1 the winner. We can't assume the first promise will be the one to win. This should use a Promise.race
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now I added a Promise.allSettled as per other comment I think this one becomes redundant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kept just one test where 3 nodes all fire at once with Promise.race and we check expected results
| throw new SlashingProtectionError(slot, dutyType, record.messageHash, messageHash); | ||
| } | ||
| throw new DutyAlreadySignedError(slot, dutyType, record.nodeId); | ||
| } else if (record.status === DutyStatus.SIGNING) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any other statuses to check? "error" comes to mind
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah error was removed! Can we add a
else {
const _: never
}
or throw an exception in case a new status gets added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feels like it should error for unknown status, adding
| private readonly slashingProtection: SlashingProtectionService | undefined; | ||
|
|
||
| constructor( | ||
| db: SlashingProtectionDatabase | undefined, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be defined? Otherwise I'd expect the node to use a different Signer implementation.
| } else { | ||
| this.slashingProtection = undefined; | ||
| this.log.info('Validator HA Signer initialized WITHOUT slashing protection'); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is dangerous: I want HA but I have no slashing protection. We should throw in this case IMHO
| * ); | ||
| * ``` | ||
| */ | ||
| export class ValidatorHASigner { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this class shold be changed to only support fully protected signing, no fall through. Then we want a fatory to create the appropriate type at runtime.
spypsy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
responses
Fixes [A-352](https://linear.app/aztec-labs/issue/A-352/add-postgresql-integration-for-recording-signed-duties) [A-353](https://linear.app/aztec-labs/issue/A-353/implement-core-slash-protection-logic) Introduces High-Availability validator signer. The signer uses a Postgres DB that will be used by multiple sequencer nodes, running with the same set of validator keys, in order to ensure that only one payload per duty is signed.
| */ | ||
| export class ValidatorHASigner { | ||
| private readonly log: Logger; | ||
| private readonly slashingProtection: SlashingProtectionService | undefined; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can't be undefined can it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this in new PR FYI
| ): Promise<Signature> { | ||
| // If slashing protection is disabled, just sign directly | ||
| if (!this.slashingProtection) { | ||
| this.log.info('Signing without slashing protection enabled', { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this branch is possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, removed
Fixes A-352 A-353
Introduces High-Availability validator signer. The signer uses a Postgres DB that will be used by multiple sequencer nodes, running with the same set of validator keys, in order to ensure that only one payload per duty is signed.