|
| 1 | +--- |
| 2 | +title: Spam or Legit? Analyzing 4byte Selector Collisions |
| 3 | +authors: |
| 4 | + - name: Kaan Uzdogan |
| 5 | + url: https://github.com/kuzdogan |
| 6 | + image_url: https://github.com/kuzdogan.png |
| 7 | +tags: [Sourcify, 4byte, Signatures] |
| 8 | +date: 2025-10-30 |
| 9 | +--- |
| 10 | +import FourbyteSS from "./screenshot.png" |
| 11 | +import samczsunTweet from "./samczsun-tweet.png" |
| 12 | + |
| 13 | +Recently Sourcify took over openchain.xyz's 4byte signature APIs as well as the domain itself, maintained by [@samczsun](https://x.com/samczsun). We also built the database and I wanted to run a quick analysis on the selector collisions and see how many collisions are legit vs. deliberately generated (spam). |
| 14 | + |
| 15 | +<a href="https://x.com/samczsun/status/1980718830260490447" target="_blank" rel="noopener noreferrer" style={{ display: "flex", justifyContent: "center" }}> |
| 16 | + <img src={samczsunTweet} alt="samczsun's tweet" width="500" /> |
| 17 | +</a> |
| 18 | + |
| 19 | +We built the dataset and a service to serve the data. The dataset contains: |
| 20 | +1. Data from openchain's dataset |
| 21 | +2. Data from [4byte.directory](https://4byte.directory) |
| 22 | +3. Signatures from verified contracts in Sourcify. |
| 23 | + |
| 24 | +You can see the database schema in the [database docs](/docs/repository/sourcify-database/#schema) and the service in [services/4byte](https://github.com/argotorg/sourcify/tree/staging/services/4byte) in the repo. As of now we have 4.7 million signatures, of which 1.9 million are not from verified contracts, and the rest of the majority appear in at least one verified contract ([stats](https://api.4byte.sourcify.dev/signature-database/v1/stats)). |
| 25 | + |
| 26 | +While it's possible to submit signatures to the database via the `/import` endpoint, we also add the signatures to the database automatically when a contract is verified. The 4byte databases are known to be spam prone, as function signatures are only 4 bytes and it's trivial to find a collusion to an otherwise legit signature. |
| 27 | + |
| 28 | +For example, see the ERC20 `transfer(address,uint256)` function's collisions under its selector `0xa9059cbb` in our 4byte.sourcify.dev page: https://4byte.sourcify.dev/?q=0xa9059cbb |
| 29 | + |
| 30 | +<a href="https://4byte.sourcify.dev/?q=0xa9059cbb" target="_blank" rel="noopener" style={{ display: "flex", justifyContent: "center" }}> |
| 31 | + <img src={FourbyteSS} alt="4byte.sourcify.dev transfer(address,uint256) collisions" /> |
| 32 | +</a> |
| 33 | + |
| 34 | +Seeing this and having the data I wanted do a quick analysis on the selector collisions and see how many collisions are legit vs. deliberately generated (spam). |
| 35 | + |
| 36 | +## Analysis |
| 37 | + |
| 38 | +Running a simple query to find the signatures that share the same 4byte selector: |
| 39 | + |
| 40 | +<details> |
| 41 | + <summary>Query</summary> |
| 42 | +```sql |
| 43 | +SELECT |
| 44 | + concat('0x', encode(signature_hash_4, 'hex')) AS signature_hash_4, |
| 45 | + COUNT(*) AS num_signatures, |
| 46 | + ARRAY_AGG(signature ORDER BY signature) AS signatures |
| 47 | +FROM public.signatures |
| 48 | +GROUP BY signature_hash_4 |
| 49 | +HAVING COUNT(*) > 1 |
| 50 | +ORDER BY num_signatures DESC; |
| 51 | +``` |
| 52 | +</details> |
| 53 | + |
| 54 | +In the end we find **2789** 4byte selectors that have more than one signature. Here are the top 5 with most collisions: |
| 55 | + |
| 56 | +[collisions.csv](./collisions.csv) |
| 57 | + |
| 58 | +[collisions.json](./collisions.json) |
| 59 | + |
| 60 | +| signature_hash_4 | num_signatures | signatures | |
| 61 | +|------------------|----------------|------------| |
| 62 | +| [0x00000000](https://4byte.sourcify.dev/?q=0x00000000) | 61 | `AaANwg8((address,address,address,uint136,uint40,uint40,uint24,uint8,uint256,bytes32,bytes32,uint256))`<br />`abcei51243fdgjkh(bytes)`<br />`adfepixw()`<br />... | |
| 63 | +| [0x00000001](https://4byte.sourcify.dev/?q=0x00000001) | 15 | `account_info_rotate_tine(uint256)`<br />`exec_606BaXt(bytes[])`<br />`f00000001_bdmvamqo()`<br />... | |
| 64 | +| [0xa9059cbb](https://4byte.sourcify.dev/?q=0xa9059cbb) | 8 | `_____$_$__$___$$$___$$___$__$$(address,uint256)`<br />`fakeTransfer_4570999670(bytes)`<br />`func_2093253501(bytes)`<br />... | |
| 65 | +| [0x095ea7b3](https://4byte.sourcify.dev/?q=0x095ea7b3) | 8 | `__$$$$___$$_$_$$__$_$_$$__$$$$(address,uint256)`<br />`approve(address,uint256)`<br />`as9q06we_7x8z(uint256,address[],address[],uint256)`<br />... | |
| 66 | +| [0x70a08231](https://4byte.sourcify.dev/?q=0x70a08231) | 7 | `$_$$$_$$$$$_$_$____$$$$_$$_$__(address)`<br />`balanceOf(address)`<br />`branch_passphrase_public(uint256,bytes8)`<br />... | |
| 67 | + |
| 68 | +Looking at the top collisions, it might look a lot. But still the spamming seems not excessive and spammers generally find a single funny signature and call it a day. Out of the 2789 selectors, 2740 have only 2 signatures (ie. a single collusion) and only 49 with more than 2 signatures. |
| 69 | + |
| 70 | +The interesting question is, how many of these collisions are actually unintended collisions vs. how many are deliberately generated (spam)? |
| 71 | + |
| 72 | +Looking at them one by one would take some time. First I want to actually see only the collisions that have a verified contract. Ie. if `f00000001_bdmvamqo()` is not seen on a verified contract, let's assume it's a spam. |
| 73 | + |
| 74 | +<details> |
| 75 | + <summary>Query</summary> |
| 76 | +```sql |
| 77 | +SELECT |
| 78 | + concat('0x', encode(s.signature_hash_4, 'hex')) AS signature_hash_4, |
| 79 | + COUNT(*) AS num_signatures, |
| 80 | + ARRAY_AGG(DISTINCT s.signature ORDER BY s.signature) AS signatures |
| 81 | +FROM public.signatures s |
| 82 | +WHERE EXISTS ( |
| 83 | + SELECT 1 |
| 84 | + FROM public.compiled_contracts_signatures ccs |
| 85 | + WHERE ccs.signature_hash_32 = s.signature_hash_32 |
| 86 | +) |
| 87 | +GROUP BY s.signature_hash_4 |
| 88 | +HAVING COUNT(*) > 1 |
| 89 | +ORDER BY num_signatures desc; |
| 90 | +``` |
| 91 | +</details> |
| 92 | + |
| 93 | +Now we're left with 1023 "verified" collisions: |
| 94 | + |
| 95 | +[collisions_verified.csv](./collisions_verified.csv) |
| 96 | + |
| 97 | +[collisions_verified.json](./collisions_verified.json) |
| 98 | + |
| 99 | +| signature_hash_4 | num_signatures | signatures | |
| 100 | +|------------------|----------------|------------| |
| 101 | +| [0x00000000](https://4byte.sourcify.dev/?q=0x00000000) | 28 | `AaANwg8((address,address,address,uint136,uint40,uint40,uint24,uint8,uint256,bytes32,bytes32,uint256))`<br />`arb_wcnwzblucpyf()`<br />`batchLock_63efZf()`<br />`buyAndFree22457070633(uint256)`<br />`call_g0oyU7o(address,uint256,bytes32,bytes)`<br />... | |
| 102 | +| ... | ... | *(4 rows with 4-5 signatures skipped)* | |
| 103 | +| [0x415565b0](https://4byte.sourcify.dev/?q=0x415565b0) | 3 | `JunionYoutubeXD_clgqmmkfvuba()`<br />`Sub2JunionOnYouTube_wuatcyecupza()`<br />`transformERC20(address,address,uint256,uint256,(uint32,bytes)[])` | |
| 104 | +| [0x00000002](https://4byte.sourcify.dev/?q=0x00000002) | 3 | `callWithPlaceholders4845164670(address,uint256,bytes32,bytes,(address,bytes,uint64,uint64,uint64)[])`<br />`wipeBlockchain_EkJWPe()`<br />`yoov6(address,address,uint256)` | |
| 105 | +| [0x6c5b47d2](https://4byte.sourcify.dev/?q=0x6c5b47d2) | 3 | `addDegree(uint256,string)`<br />`isBlacklisted5(address)`<br />`RenounceFungibleOwnership()` | |
| 106 | +| [0x9aa7c0e5](https://4byte.sourcify.dev/?q=0x9aa7c0e5) | 3 | `gain_network883718828((address,uint256,uint256,uint256,uint256,uint256,bool,uint256,uint256,uint256),uint8,uint256,uint256,address)`<br />`openTrade((address,uint256,uint256,uint256,uint256,uint256,bool,uint256,uint256,uint256),uint8,uint256,uint256,address)`<br />`TigrisTrade(int8,int56,uint80,bytes15,int88,int16)` | |
| 107 | +| [0x014ed8d2](https://4byte.sourcify.dev/?q=0x014ed8d2) | 2 | `CannotChangePaymentToken()`<br />`ModelRegistered(uint256,address,string,uint256)` | |
| 108 | +| [0x0161a64a](https://4byte.sourcify.dev/?q=0x0161a64a) | 2 | `cleanupExpiredListing(uint256)`<br />`MissingRole(address,bytes32)` | |
| 109 | +| [0x0182a6da](https://4byte.sourcify.dev/?q=0x0182a6da) | 2 | `initiateWalletTransfer(address)`<br />`withdrawStakingAmount(uint256)` | |
| 110 | +| [0x01a754a3](https://4byte.sourcify.dev/?q=0x01a754a3) | 2 | `AutoSwap()`<br />`updateTeamFeeContract(address)` | |
| 111 | +| [0x025313a2](https://4byte.sourcify.dev/?q=0x025313a2) | 2 | `getACLRole5999294130779334338()`<br />`proxyOwner()` | |
| 112 | + |
| 113 | +Now it starts to get interesting. Again the selectors with many collisions have mostly spam. But for 3 and less collisions we have some legitimate collisions. |
| 114 | + |
| 115 | +For example for [`0x01a754a3`](https://4byte.sourcify.dev/?q=0x01a754a3) we have `AutoSwap()` and `updateTeamFeeContract(address)`. It's really difficult to tell if this is a spam or not. |
| 116 | + |
| 117 | +But for the last row, [`0x025313a2`](https://4byte.sourcify.dev/?q=0x025313a2), we have `getACLRole5999294130779334338()` and `proxyOwner()`. Here the former is clearly a spam and the latter is not. |
| 118 | + |
| 119 | +Next, we can actually ask an LLM to filter the ones looking like a spam! Since the data is not excessive, I shoved all of it into Claude and asked it to filter the ones looking like a spam. In the end it gave me a list of **648** collisions that it thinks are legitimate. I peeked in the list and it seems to be mostly accurate: |
| 120 | + |
| 121 | +[legitimate_collisions.csv](./legitimate_collisions.csv) |
| 122 | + |
| 123 | +Here are 10 interesting examples of legitimate unintended collisions: |
| 124 | + |
| 125 | +| signature_hash_4 | num_signatures | signatures | |
| 126 | +|------------------|----------------|------------| |
| 127 | +| [0x04d742dc](https://4byte.sourcify.dev/?q=0x04d742dc) | 2 | `adminResetRank()`<br />`startSale(uint256,uint256,uint256)` | |
| 128 | +| [0x0536f755](https://4byte.sourcify.dev/?q=0x0536f755) | 2 | `FreeMintTokenSent(address,uint256)`<br />`NFTReward(address)` | |
| 129 | +| [0x092338cc](https://4byte.sourcify.dev/?q=0x092338cc) | 2 | `maxPurchasableInOneTx()`<br />`usdcGHSTOracle()` | |
| 130 | +| [0x17915e8d](https://4byte.sourcify.dev/?q=0x17915e8d) | 2 | `getCluster(address)`<br />`getTotalFeeBps()` | |
| 131 | +| [0x2025e52c](https://4byte.sourcify.dev/?q=0x2025e52c) | 2 | `createSaleTokensVault()`<br />`mintWithERC721(uint256)` | |
| 132 | +| [0x22061379](https://4byte.sourcify.dev/?q=0x22061379) | 2 | `getBaseStakeAmountForPlay()`<br />`vaultFees(uint256)` | |
| 133 | +| [0x55fcd027](https://4byte.sourcify.dev/?q=0x55fcd027) | 2 | `DepositAmountTooLow()`<br />`masterLogicAddress()` | |
| 134 | +| [0x667022fd](https://4byte.sourcify.dev/?q=0x667022fd) | 2 | `bought(address)`<br />`iceCreamVan()` | |
| 135 | +| [0x67bf975c](https://4byte.sourcify.dev/?q=0x67bf975c) | 2 | `NotAllowedToRecover()`<br />`RewardThresholdReached(uint256)` | |
| 136 | +| [0x706b8722](https://4byte.sourcify.dev/?q=0x706b8722) | 2 | `pauseAtId()`<br />`USDTBorrowed(address,uint256)` | |
| 137 | + |
| 138 | + |
| 139 | +These are all legitimate functions and events from different smart contracts that happen to share the same 4-byte selector purely by chance. This demonstrates that while 4-byte collisions are rare, they do happen naturally in the wild! |
| 140 | + |
| 141 | +At this stage I just ran this for fun. We only have a list of popular signatures that we know the "correct" signature for ([canonical-signatures.json](https://github.com/argotorg/sourcify/blob/staging/services/4byte/src/utils/canonical-signatures.json)). We filter out non-canonical ones by default and have a `filtered` field (can turn off filtering [in the API response](https://api.4byte.sourcify.dev/signature-database/v1/lookup?function=0xa9059cbb&filter=false) ). But if the community thinks this is useful, we can do a more thorough analysis and filter out the spam via LLMs. |
0 commit comments