Skip to content

Conversation

@spypsy
Copy link
Member

@spypsy spypsy commented Dec 15, 2025

Fixes A-352 A-353

Introduces High-Availability validator signer. The signer uses a Postgres DB that will be used by multiple sequencer nodes, running with the same set of validator keys, in order to ensure that only one payload per duty is signed.

@socket-security
Copy link

socket-security bot commented Dec 15, 2025

/**
* Find an existing duty record
*/
async findDuty(validatorAddress: EthAddress, slot: bigint, dutyType: DutyType): Promise<ValidatorDutyRecord | null> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does return null?

@PhilWindle
Copy link
Collaborator

I feel like the database interactions are convoluted and have potential race conditions. IIUC we are manually performing quite a complicated check and set operation across multiple calls. Can this not be reduced to about 3 calls and have the DB provide the concurrency guarantees:

  1. Insert record if not present, return record if present, We repeatedly call this if record is present and not yet signed.
  2. Update signed record. Called upon success.
  3. Delete failed record. Called upon failure.

Comment on lines 37 to 54
it('should initialize database schema successfully', async () => {
(mockClient.query as jest.Mock).mockResolvedValue({ rows: [], rowCount: 0 });

await db.initialize();

// Should start transaction
expect(mockClient.query).toHaveBeenCalledWith('BEGIN');

// Should run all schema setup statements (4 statements)
const calls = (mockClient.query as jest.Mock).mock.calls;
expect(calls.length).toBeGreaterThanOrEqual(6); // BEGIN + 4 schema statements + version + COMMIT

// Should commit transaction
expect(mockClient.query).toHaveBeenCalledWith('COMMIT');

// Should release client
expect(mockClient.release).toHaveBeenCalled();
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test is too brittle considering how tightly coupled it is to the migration being run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see there's a src/db/schema.ts file too which defines tables and constraints. Are both files needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SCHEMA_SETUP in schema.ts` could be removed I guess, and use what's in the migration code to initialize. Would that make sense?

Comment on lines 27 to 31
mockClient = mock<PoolClient>();

// Create mock pool
mockPool = mock<Pool>();
mockPool.connect.mockImplementation(() => Promise.resolve(mockClient));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using a mocked Posgres connection, we could instantiate pglite (Postgres compiled to wasm) in a temp folder. We could eliminate all mocked calls this way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, was just looking into some form of postgres to use for proper testing bc this mock doesn't provide much confidence. Will check pglite 👍

Comment on lines 177 to 194
*/
export const CLEANUP_OLD_DUTIES = `
DELETE FROM validator_duties
WHERE started_at < $1;
`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this query is using the idx_validator_duties_status index since it's defined on both status and started_at while the query does not specify status so it might require a full table scan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this only clean 'completed' duties?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was just added to cleanup all duties if needed. Can add one for signed duties only.
Would the way to make this one more efficient be:

DELETE FROM validator_duties
WHERE status IN ('signing', 'signed', 'failed')
  AND started_at < $1;

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like keeping the SQL queries separate but I'm slightly worried that if we change the query here and forget to update their uses in PostgresSlashingProtectionDatabase we'll only know of the issue at runtime since the unit tests use mocked clients that don't actually check the query syntax.

If we added pglite then we could run the queries in unit tests.

Comment on lines 129 to 138
async recordSuccess(params: RecordSuccessParams): Promise<void> {
const { validatorAddress, slot, dutyType, signature, nodeId } = params;

await this.db.updateDutySigned(validatorAddress, slot, dutyType, signature.toString());

this.log.info(`Recorded successful signing for duty ${dutyType} at slot ${slot}`, {
validatorAddress: validatorAddress.toString(),
nodeId,
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any protection against rogue writes? E.g. some bug in the validator that causes NODE-FOO to call recordSuccess/recordFailure when the lock is actually held by NODE-BAR?

Comment on lines +153 to +176
// Complete first node's signing after a short delay
await sleep(100);
await service.recordSuccess({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would happen if the node never completes the signature? Does node 2 recover?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, as mentioned in another comment if a node fails during 'signing' we can't currently recover.

*/
async signWithProtection(
validatorAddress: EthAddress,
signingRoot: Buffer32,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'd rename signingRoot to just message or messageHash to match the rest of the codebase.

Comment on lines 120 to 130
await this.slashingProtection.recordSuccess({
validatorAddress,
slot,
dutyType,
signature,
nodeId: this.config.nodeId,
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't get retried so I think we'll lose signature in case of temporary network issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the third(?) migration script

/**
* Query to update a duty to 'signed' status
*/
export const UPDATE_DUTY_SIGNED = `
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These UPDATE/DELETE queries should require the node to that which made the insert shouldn't they?

WHERE node_id = $5 or something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to use lock_token

spec:
initContainers:
- name: db-migrate
image: validator-image:latest
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's 'validator-image'?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sry didn't realize claude added this guide when I first created the migrations code.
Seems kinda useful, I've updated to correctly reflect how it currently works (which is with aztec migrate-ha-db command)

spec:
containers:
- name: migrate
image: your-validator-image
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be aztecprotocol/aztec?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And wouldn't it use the CLI command?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated REAME, it wasn't up-to-date

* @returns true if the update succeeded, false if token didn't match
*/
async recordFailure(params: RecordFailureParams): Promise<boolean> {
const { validatorAddress, slot, dutyType, error, lockToken } = params;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still feels redundant. Can't we simply delete the record?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm yeah I think you're right. other than logging/observability this is a better solution as it simplifies things even more.


while (true) {
// insert if not present, get existing if present
const { isNew, record } = await this.db.tryInsertOrGetExisting(params);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If node A acquires the lock and then dies, how long is node B going to keep looping? Is it infinite? At most it should be until the end of the slot, but I can't see where it would time out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further down it does:

 } else if (record.status === DutyStatus.SIGNING) {
    // Another node is currently signing - check for timeout
    if (Date.now() - startTime > this.signingTimeoutMs) {
      this.log.warn(`Timeout waiting for signing to complete for duty ${dutyType} at slot ${slot}`, {
        validatorAddress: validatorAddress.toString(),
        timeoutMs: this.signingTimeoutMs,
        signingNodeId: record.nodeId,
      });
      throw new DutyAlreadySignedError(slot, dutyType, 'unknown (timeout)');
    }

  // .... keep looping
}

Could switch from having a fixed timeout to having a deadline which would be the end of the slot.
Can implement that in the other PR

@spypsy spypsy requested a review from charlielye as a code owner January 7, 2026 09:34
Copy link
Contributor

@alexghr alexghr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but I'd like to see the HASigner check for invalid configs. The test changes would be nice to have though.

Comment on lines +16 to +21
pgm.sql(`
INSERT INTO schema_version (version)
VALUES (${SCHEMA_VERSION})
ON CONFLICT (version) DO NOTHING;
`);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't pg-migrate already create a schema migration table?

await db.close();
});

describe('schema setup', () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can delete these tests as they're just testing PGlite functions correctly (which shouldn't be our concern)

it('should handle large block numbers correctly', async () => {
const largeBlockNumber = 9007199254740991n;

const result = await db.query<InsertOrGetRow>(INSERT_OR_GET_DUTY, [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these queries should be tested through our PostgresSlashingProtectionDatabase implementation rather than going through the database. Doing it this way would also validate that the positional query parameters in the service are passed correctly

const row = result.rows[0];
return {
isNew: row.is_new,
record: this.rowToRecord(row),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this return the lock token?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth adding a comment here specifying that on failing to get the lock the returned 'lock_token' from the db is going to be the empty string.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did initially return the lock_token for non-new records. It shouldn't be an issue since this is a trusted setup but also wouldn't that defeat its purpose?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I meant in the 'failed lock' case. It shouldn't return the lock token to Node B if Node A has the lock and the SQL handles this correctly.

Comment on lines +125 to +127
SELECT * FROM inserted
UNION ALL
SELECT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks quite complicated but I get how it's meant to work.

Comment on lines +330 to +331
const result = await db.tryInsertOrGetExisting(params);
expect(result.record.status).toBe(DutyStatus.SIGNING);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an upsert so this doesn't actually assert that the insertion was done correctly. This function could have just inserted the row and this test wouldn't know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVM, I saw the tests further down in this test suite.

Comment on lines 405 to 407
// First one should succeed, let it complete signing
await sleep(50);
const result = await db.tryInsertOrGetExisting(params1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes node1 the winner. We can't assume the first promise will be the one to win. This should use a Promise.race

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now I added a Promise.allSettled as per other comment I think this one becomes redundant

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kept just one test where 3 nodes all fire at once with Promise.race and we check expected results

throw new SlashingProtectionError(slot, dutyType, record.messageHash, messageHash);
}
throw new DutyAlreadySignedError(slot, dutyType, record.nodeId);
} else if (record.status === DutyStatus.SIGNING) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any other statuses to check? "error" comes to mind

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah error was removed! Can we add a

else {
  const _: never
}

or throw an exception in case a new status gets added?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like it should error for unknown status, adding

private readonly slashingProtection: SlashingProtectionService | undefined;

constructor(
db: SlashingProtectionDatabase | undefined,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be defined? Otherwise I'd expect the node to use a different Signer implementation.

Comment on lines 54 to 57
} else {
this.slashingProtection = undefined;
this.log.info('Validator HA Signer initialized WITHOUT slashing protection');
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is dangerous: I want HA but I have no slashing protection. We should throw in this case IMHO

* );
* ```
*/
export class ValidatorHASigner {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this class shold be changed to only support fully protected signing, no fall through. Then we want a fatory to create the appropriate type at runtime.

Copy link
Member Author

@spypsy spypsy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

responses

Fixes [A-352](https://linear.app/aztec-labs/issue/A-352/add-postgresql-integration-for-recording-signed-duties) [A-353](https://linear.app/aztec-labs/issue/A-353/implement-core-slash-protection-logic)

Introduces High-Availability validator signer. The signer uses a Postgres DB that will be used by multiple sequencer nodes, running with the same set of validator keys, in order to ensure that only one payload per duty is signed.
*/
export class ValidatorHASigner {
private readonly log: Logger;
private readonly slashingProtection: SlashingProtectionService | undefined;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can't be undefined can it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this in new PR FYI

): Promise<Signature> {
// If slashing protection is disabled, just sign directly
if (!this.slashingProtection) {
this.log.info('Signing without slashing protection enabled', {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this branch is possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, removed

@AztecBot AztecBot added this pull request to the merge queue Jan 8, 2026
Merged via the queue into next with commit 9b1bc1e Jan 8, 2026
17 checks passed
@AztecBot AztecBot deleted the spy/ha branch January 8, 2026 10:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants