feat(benchmarks): fix bytecode attack for CALL-like opcodes to work in Osaka #1850

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

jsign wants to merge 6 commits into ethereum:forks/osaka from jsign:jsign-osaka-test_xcall

Collaborator

jsign commented Dec 5, 2025 •

edited

Loading

🗒️ Description

This PR fixes the test_xcall benchmarks to work with Osaka. test_xcall has a CALL-like variant of the bytecode attack exploiting unchunkified code, highly relevant for zkVM worst-cases.

I think the solution should feel quite natural — I’ll add some review comments.

I checked this is doing what we want with two independent checks:

Hack a bit the Geth codebase to inspect how many distinct bytecodes were accessed during the block execution.
Actually running the worst-case block for zkVM execution and inspecting that indeed the witness to be sent to the guest program indeed is blowed up — since bytecode must be part of the input to the guest program.

For 1. and --block-gas-limit 30M, my Geth reports accessing ~260MiB of bytecodes were the upper limit is ~280MiB (30M/2600 — but this isn’t accounting for glue opcodes, intrinsic costs of splitting in txs, etc). So this looks fine.

For 2. and --block-gas-limit 30M, I also confirmed it really blows up the guest program witness, which is prob the best ultimate confirmation this is doing what we wanted. i.e: the json-encoded files have ~524MiB size which makes sense since bytes are hex encoded (with other things).

🔗 Related Issues or PRs

✅ Checklist

All: Ran fast tox checks to avoid unnecessary CI fails, see also Code Standards and Enabling Pre-commit Checks:
```
uvx tox -e static
```
All: PR title adheres to the repo standard - it will be used as the squash commit message and should start type(scope):.
All: Set appropriate labels for the changes (only maintainers can apply labels).

jsign added 4 commits

December 5, 2025 10:12


          feat(tests): Refactor xcall test to use deploy_max_contract_factory f…

0de5bac

…or contract deployment


          feat(tests): change xcall test to deploy multiple contracts in separa…

d7ff98f

…te transactions


          feat(tests): update test_xcall to partition attack in tx_max_gas_limi…

90d15e1

…t transactions with proper offseting to avoid overlap


          refactor(tests): improvements

3aa28b0

jsign added the A-test-benchmark label

jsign force-pushed the jsign-osaka-test_xcall branch from b757b8e to 3aa28b0 Compare

December 5, 2025 15:35

jsign added 2 commits

December 5, 2025 12:58


          feat(benchmarks): lints

d8c8b05


          feat(bench): add assertion for transaction gas limit cap in test_xcall

5dee307

jsign commented

View reviewed changes

tests/benchmark/compute/instruction/test_system.py

Comment on lines +112 to +114

    
                  initcode, factory_address, factory_caller_address = (

                      _deploy_max_contract_factory(pre, fork)

                  )

Collaborator Author

jsign Dec 5, 2025

I extracted the deployment of the factory contract to a separate function, mainly to avoid such a big method here which hurts readability, but mostly because I think for the upcoming test_extcode_ops I'll fix, that code is duplicated so I can re-use the function and we can delete code.

tests/benchmark/compute/instruction/test_system.py

    
                      _deploy_max_contract_factory(pre, fork)

                  )

                  # Deploy num_contracts via multiple txs (each capped by tx gas limit).

Collaborator Author

jsign Dec 5, 2025

In the setup phase, we can't create all the required bytecodes in a single tx because this is disallowed by Osaka (unfortuantely for us here).

This means we have to split the creation in multiple transactions. This will have an unavoidable hit in filling performance, since we actually used the previous strategy for that reason.

For 60M it takes ~2m in my machine, which maybe isn't as bad -- but still I'm not sure we can workaround this with Osaka limitations.

tests/benchmark/compute/instruction/test_system.py

Comment on lines 118 to 140

    
                      # Rough estimate (rounded down) of contracts per tx based on dominant

                      # cost factor only. E.g., 17M gas limit + 24KiB contracts = ~3 per tx.

                      # The goal is to involve the minimum amount of gas pricing to avoid

                      # complexity and potential brittleness.

                      # If this estimation is incorrect in the future (i.e. tx gas limit cap)

                      # is increased or cost per byte, the post-state check will detect it

                      # and can be adjusted with a more complex formula.

                      num_contracts_per_tx = fork.transaction_gas_limit_cap() // (

                          gas_costs.G_CODE_DEPOSIT_BYTE * max_contract_size

                      )

                      attack_txs = math.ceil(num_contracts / num_contracts_per_tx)

                      contracts_deployment_txs = []

                      for _ in range(attack_txs):

                          contracts_deployment_txs.append(

                              Transaction(

                                  to=factory_caller_address,

                                  gas_limit=fork.transaction_gas_limit_cap(),

                                  gas_price=10**6,

                                  data=Hash(num_contracts_per_tx),

                                  sender=pre.fund_eoa(),

                              )

                          )

Collaborator Author

jsign Dec 5, 2025

I preferred to go heavy in comments for future maintainability.

But basically we do a rough estimation of how many 24KiB contracts we can create per tx, and then do that. Note I do ceil in attack_txs instead of trying to do precise creation, just because it doesn't hurt plus it avoids complexity in this code. Since this is the setup phase and the tx is gas capped, I think it is okay.

The factory contract has a new CALLDATA parameter which is how many contracts to create. Although we might only be interested in this being always the max, I think it is better to leave it generic so we can try reusing it as much as possible. It just changes the contract creation loop count to whatever CALLDATALOAD says.

tests/benchmark/compute/instruction/test_system.py

Comment on lines +142 to +159

    
                  post = {}

                  for i in range(num_contracts):

                      deployed_contract_address = compute_create2_address(

                          address=factory_address,

                          salt=i,

                          initcode=initcode,

                      )

                      post[deployed_contract_address] = Account(nonce=1)

                  attack_call = Bytecode()

                  if opcode == Op.EXTCODECOPY:

                      attack_call = Op.EXTCODECOPY(

                          address=Op.SHA3(32 - 20 - 1, 85), dest_offset=96, size=1000

                      )

                  else:

                      # For the rest of the opcodes, we can use the same generic attack call

                      # since all only minimally need the `address` of the target.

                      attack_call = Op.POP(opcode(address=Op.SHA3(32 - 20 - 1, 85)))

Collaborator Author

jsign Dec 5, 2025

All this code is the same as before -- just that the diff is a bit unfortunate.

Something I noticed is that this has this opcode == Op.EXTCODECOPY which shouldn't be relevant for this method (look at test parametrization). Originally when I created this benchmark, the same test name was used for CALL-like opcodes and also EXTCODESIZE... but I think somebody split them after and maybe this was left here?

I'll leave this as is for now, and when I fix test_extcode_ops I'll try to re-use the logic since I'm pretty sure it was repeated there when it was splitted.

tests/benchmark/compute/instruction/test_system.py

Comment on lines +160 to +171

    
                  attack_code = (

                      # Setup memory for later CREATE2 address generation loop.

                      # 0xFF+[Address(20bytes)]+[seed(32bytes)]+[initcode keccak(32bytes)]

                      Op.MSTORE(0, factory_address)

                      + Op.MSTORE8(32 - 20 - 1, 0xFF)

                      + Op.MSTORE(32, Op.CALLDATALOAD(0))

                      + Op.MSTORE(64, initcode.keccak256())

                      # Main loop

                      + While(

                          body=attack_call + Op.MSTORE(32, Op.ADD(Op.MLOAD(32), 1)),

                      )

                  )

Collaborator Author

jsign Dec 5, 2025

This is the same attack code as before but with a twist. The previous version started with seed=0 and simply ran the loop until it ran out of gas, since we could do all in a singe tx.

Now we have to split the attack in multiple txs, thus I allow reading the seed from CALLDATALOAD(0) so we can create the attack txs in a way that makes sense.

tests/benchmark/compute/instruction/test_system.py

Comment on lines +173 to +180

    
                  if len(attack_code) > max_contract_size:

                      # TODO: A workaround could be to split the opcode code into multiple

                      # contracts and call them in sequence.

                      raise ValueError(

                          f"Code size {len(attack_code)} exceeds maximum "

                          f"code size {max_contract_size}"

                      )

                  attack_address = pre.deploy_contract(code=attack_code)

Collaborator Author

jsign Dec 5, 2025

Unchanged code.

tests/benchmark/compute/instruction/test_system.py

    
                  )

              def _deploy_max_contract_factory(

Collaborator Author

jsign Dec 5, 2025

The way the contract factory was created remains unchanged.

Note that the factory_code already used storage slot 0 as the internal seed. This was quite nice since it means that when we call the factory mutliple times now, it keeps creating the next bytecode in a coherent way from the previous tx.

tests/benchmark/compute/instruction/test_system.py

Comment on lines +71 to +74

    
                  tx_gas_limit_cap = fork.transaction_gas_limit_cap()

                  assert tx_gas_limit_cap is not None, (

                      "This benchmark requires a tx gas limit cap"

                  )

Collaborator Author

jsign Dec 5, 2025

Could we mark this test for Osaka forward only? (Or maybe that already happens since AFAIK in all the benchmark releases only Prague was filled, since I'm not interested we want backwards performance runs. For zkVM also makes sense that previous forks aren't interesting for this attack).

I think we could make it work backwards, it would just blow up a the logic below reg tx preparations for setup and attack. This test is already quite complex so not sure it is worth it.

tests/benchmark/compute/instruction/test_system.py

Comment on lines +187 to +219

    
                      full_txs = attack_gas_limit // tx_gas_limit_cap

                      remainder = attack_gas_limit % tx_gas_limit_cap

                      num_targeted_contracts_per_full_tx = (

                          # Base available gas:

                          # TX_GAS_LIMIT - intrinsic - (out of loop MSTOREs)

                          tx_gas_limit_cap

                          - intrinsic_gas_cost_calc()

                          - gas_costs.G_VERY_LOW * 4

                      ) // loop_cost

                      contract_start_index = 0

                      opcode_txs = []

                      for _ in range(full_txs):

                          opcode_txs.append(

                              Transaction(

                                  to=attack_address,

                                  gas_limit=tx_gas_limit_cap,

                                  gas_price=10**9,

                                  data=Hash(contract_start_index),

                                  sender=pre.fund_eoa(),

                              )

                          )

                          contract_start_index += num_targeted_contracts_per_full_tx

                      if remainder > 0:

                          opcode_txs.append(

                              Transaction(

                                  to=attack_address,

                                  gas_limit=remainder,

                                  gas_price=10**9,

                                  data=Hash(contract_start_index),

                                  sender=pre.fund_eoa(),

                              )

                          )

Collaborator Author

jsign Dec 5, 2025

TL;DR: creating the attack txs to execute in the same block, such that none surpas the transaction gas limit cap. In contract_start_index we keep the offset we pass as CALLDATALOAD for the attack contract that I explained in my previous comment.

Member

jochem-brouwer Dec 6, 2025

Wanted to mention this (did not check other logic): this looks almost similar to the split_transactions logic (

execution-specs/packages/testing/src/execution_testing/specs/benchmark.py

Line 344 in 843178f

remaining_gas if i == num_splits - 1 else gas_limit_cap

) of the BenchmarkTestFiller, however - that split txs would just split the tx (without calldata) up in multiple. Which we cannot do here due to the test logic.
@LouisTsai-Csie we could maybe think of adding cases to the split txs logic here? I feel that inserting calldata is something we will see in other benchmark tests often (for instance as a "pointer" to where a certain tx should pick up where the other tx has exited - we could also do this in the contract code itself, but this means spending gas on this logic, which we want to spend on the target (the scenario/opcode we are benching))

jsign marked this pull request as ready for review

December 5, 2025 16:36

jsign requested review from LouisTsai-Csie and marioevz

December 5, 2025 16:36

codecov bot commented Dec 5, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.31%. Comparing base (519eb26) to head (5dee307).
⚠️ Report is 14 commits behind head on forks/osaka.

Additional details and impacted files

@@               Coverage Diff                @@
##           forks/osaka    #1850       +/-   ##
================================================
+ Coverage        53.45%   87.31%   +33.86%     
================================================
  Files              743      541      -202     
  Lines            44076    32832    -11244     
  Branches          3891     3015      -876     
================================================
+ Hits             23559    28668     +5109     
+ Misses           20306     3557    -16749     
- Partials           211      607      +396

Flag	Coverage Δ
unittests	`87.31% <ø> (+33.86%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-test-benchmark