Skip to content

Commit 6be6f06

Browse files
committed
Publish 4byte spam analysis post
1 parent 5537d48 commit 6be6f06

File tree

8 files changed

+23883
-0
lines changed

8 files changed

+23883
-0
lines changed

blog/signatures-analysis/collisions.csv

Lines changed: 2790 additions & 0 deletions
Large diffs are not rendered by default.

blog/signatures-analysis/collisions.json

Lines changed: 13948 additions & 0 deletions
Large diffs are not rendered by default.

blog/signatures-analysis/collisions_verified.csv

Lines changed: 1024 additions & 0 deletions
Large diffs are not rendered by default.

blog/signatures-analysis/collisions_verified.json

Lines changed: 5118 additions & 0 deletions
Large diffs are not rendered by default.

blog/signatures-analysis/index.mdx

Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
---
2+
title: Spam or Legit? Analyzing 4byte Selector Collisions
3+
authors:
4+
- name: Kaan Uzdogan
5+
url: https://github.com/kuzdogan
6+
image_url: https://github.com/kuzdogan.png
7+
tags: [Sourcify, 4byte, Signatures]
8+
date: 2025-10-30
9+
---
10+
import FourbyteSS from "./screenshot.png"
11+
import samczsunTweet from "./samczsun-tweet.png"
12+
13+
Recently Sourcify took over openchain.xyz's 4byte signature APIs as well as the domain itself, maintained by [@samczsun](https://x.com/samczsun). We also built the database and I wanted to run a quick analysis on the selector collisions and see how many collisions are legit vs. deliberately generated (spam).
14+
15+
<a href="https://x.com/samczsun/status/1980718830260490447" target="_blank" rel="noopener noreferrer" style={{ display: "flex", justifyContent: "center" }}>
16+
<img src={samczsunTweet} alt="samczsun's tweet" width="500" />
17+
</a>
18+
19+
We built the dataset and a service to serve the data. The dataset contains:
20+
1. Data from openchain's dataset
21+
2. Data from [4byte.directory](https://4byte.directory)
22+
3. Signatures from verified contracts in Sourcify.
23+
24+
You can see the database schema in the [database docs](/docs/repository/sourcify-database/#schema) and the service in [services/4byte](https://github.com/argotorg/sourcify/tree/staging/services/4byte) in the repo. As of now we have 4.7 million signatures, of which 1.9 million are not from verified contracts, and the rest of the majority appear in at least one verified contract ([stats](https://api.4byte.sourcify.dev/signature-database/v1/stats)).
25+
26+
While it's possible to submit signatures to the database via the `/import` endpoint, we also add the signatures to the database automatically when a contract is verified. The 4byte databases are known to be spam prone, as function signatures are only 4 bytes and it's trivial to find a collusion to an otherwise legit signature.
27+
28+
For example, see the ERC20 `transfer(address,uint256)` function's collisions under its selector `0xa9059cbb` in our 4byte.sourcify.dev page: https://4byte.sourcify.dev/?q=0xa9059cbb
29+
30+
<a href="https://4byte.sourcify.dev/?q=0xa9059cbb" target="_blank" rel="noopener" style={{ display: "flex", justifyContent: "center" }}>
31+
<img src={FourbyteSS} alt="4byte.sourcify.dev transfer(address,uint256) collisions" />
32+
</a>
33+
34+
Seeing this and having the data I wanted do a quick analysis on the selector collisions and see how many collisions are legit vs. deliberately generated (spam).
35+
36+
## Analysis
37+
38+
Running a simple query to find the signatures that share the same 4byte selector:
39+
40+
<details>
41+
<summary>Query</summary>
42+
```sql
43+
SELECT
44+
concat('0x', encode(signature_hash_4, 'hex')) AS signature_hash_4,
45+
COUNT(*) AS num_signatures,
46+
ARRAY_AGG(signature ORDER BY signature) AS signatures
47+
FROM public.signatures
48+
GROUP BY signature_hash_4
49+
HAVING COUNT(*) > 1
50+
ORDER BY num_signatures DESC;
51+
```
52+
</details>
53+
54+
In the end we find **2789** 4byte selectors that have more than one signature. Here are the top 5 with most collisions:
55+
56+
[collisions.csv](./collisions.csv)
57+
58+
[collisions.json](./collisions.json)
59+
60+
| signature_hash_4 | num_signatures | signatures |
61+
|------------------|----------------|------------|
62+
| [0x00000000](https://4byte.sourcify.dev/?q=0x00000000) | 61 | `AaANwg8((address,address,address,uint136,uint40,uint40,uint24,uint8,uint256,bytes32,bytes32,uint256))`<br />`abcei51243fdgjkh(bytes)`<br />`adfepixw()`<br />... |
63+
| [0x00000001](https://4byte.sourcify.dev/?q=0x00000001) | 15 | `account_info_rotate_tine(uint256)`<br />`exec_606BaXt(bytes[])`<br />`f00000001_bdmvamqo()`<br />... |
64+
| [0xa9059cbb](https://4byte.sourcify.dev/?q=0xa9059cbb) | 8 | `_____$_$__$___$$$___$$___$__$$(address,uint256)`<br />`fakeTransfer_4570999670(bytes)`<br />`func_2093253501(bytes)`<br />... |
65+
| [0x095ea7b3](https://4byte.sourcify.dev/?q=0x095ea7b3) | 8 | `__$$$$___$$_$_$$__$_$_$$__$$$$(address,uint256)`<br />`approve(address,uint256)`<br />`as9q06we_7x8z(uint256,address[],address[],uint256)`<br />... |
66+
| [0x70a08231](https://4byte.sourcify.dev/?q=0x70a08231) | 7 | `$_$$$_$$$$$_$_$____$$$$_$$_$__(address)`<br />`balanceOf(address)`<br />`branch_passphrase_public(uint256,bytes8)`<br />... |
67+
68+
Looking at the top collisions, it might look a lot. But still the spamming seems not excessive and spammers generally find a single funny signature and call it a day. Out of the 2789 selectors, 2740 have only 2 signatures (ie. a single collusion) and only 49 with more than 2 signatures.
69+
70+
The interesting question is, how many of these collisions are actually unintended collisions vs. how many are deliberately generated (spam)?
71+
72+
Looking at them one by one would take some time. First I want to actually see only the collisions that have a verified contract. Ie. if `f00000001_bdmvamqo()` is not seen on a verified contract, let's assume it's a spam.
73+
74+
<details>
75+
<summary>Query</summary>
76+
```sql
77+
SELECT
78+
concat('0x', encode(s.signature_hash_4, 'hex')) AS signature_hash_4,
79+
COUNT(*) AS num_signatures,
80+
ARRAY_AGG(DISTINCT s.signature ORDER BY s.signature) AS signatures
81+
FROM public.signatures s
82+
WHERE EXISTS (
83+
SELECT 1
84+
FROM public.compiled_contracts_signatures ccs
85+
WHERE ccs.signature_hash_32 = s.signature_hash_32
86+
)
87+
GROUP BY s.signature_hash_4
88+
HAVING COUNT(*) > 1
89+
ORDER BY num_signatures desc;
90+
```
91+
</details>
92+
93+
Now we're left with 1023 "verified" collisions:
94+
95+
[collisions_verified.csv](./collisions_verified.csv)
96+
97+
[collisions_verified.json](./collisions_verified.json)
98+
99+
| signature_hash_4 | num_signatures | signatures |
100+
|------------------|----------------|------------|
101+
| [0x00000000](https://4byte.sourcify.dev/?q=0x00000000) | 28 | `AaANwg8((address,address,address,uint136,uint40,uint40,uint24,uint8,uint256,bytes32,bytes32,uint256))`<br />`arb_wcnwzblucpyf()`<br />`batchLock_63efZf()`<br />`buyAndFree22457070633(uint256)`<br />`call_g0oyU7o(address,uint256,bytes32,bytes)`<br />... |
102+
| ... | ... | *(4 rows with 4-5 signatures skipped)* |
103+
| [0x415565b0](https://4byte.sourcify.dev/?q=0x415565b0) | 3 | `JunionYoutubeXD_clgqmmkfvuba()`<br />`Sub2JunionOnYouTube_wuatcyecupza()`<br />`transformERC20(address,address,uint256,uint256,(uint32,bytes)[])` |
104+
| [0x00000002](https://4byte.sourcify.dev/?q=0x00000002) | 3 | `callWithPlaceholders4845164670(address,uint256,bytes32,bytes,(address,bytes,uint64,uint64,uint64)[])`<br />`wipeBlockchain_EkJWPe()`<br />`yoov6(address,address,uint256)` |
105+
| [0x6c5b47d2](https://4byte.sourcify.dev/?q=0x6c5b47d2) | 3 | `addDegree(uint256,string)`<br />`isBlacklisted5(address)`<br />`RenounceFungibleOwnership()` |
106+
| [0x9aa7c0e5](https://4byte.sourcify.dev/?q=0x9aa7c0e5) | 3 | `gain_network883718828((address,uint256,uint256,uint256,uint256,uint256,bool,uint256,uint256,uint256),uint8,uint256,uint256,address)`<br />`openTrade((address,uint256,uint256,uint256,uint256,uint256,bool,uint256,uint256,uint256),uint8,uint256,uint256,address)`<br />`TigrisTrade(int8,int56,uint80,bytes15,int88,int16)` |
107+
| [0x014ed8d2](https://4byte.sourcify.dev/?q=0x014ed8d2) | 2 | `CannotChangePaymentToken()`<br />`ModelRegistered(uint256,address,string,uint256)` |
108+
| [0x0161a64a](https://4byte.sourcify.dev/?q=0x0161a64a) | 2 | `cleanupExpiredListing(uint256)`<br />`MissingRole(address,bytes32)` |
109+
| [0x0182a6da](https://4byte.sourcify.dev/?q=0x0182a6da) | 2 | `initiateWalletTransfer(address)`<br />`withdrawStakingAmount(uint256)` |
110+
| [0x01a754a3](https://4byte.sourcify.dev/?q=0x01a754a3) | 2 | `AutoSwap()`<br />`updateTeamFeeContract(address)` |
111+
| [0x025313a2](https://4byte.sourcify.dev/?q=0x025313a2) | 2 | `getACLRole5999294130779334338()`<br />`proxyOwner()` |
112+
113+
Now it starts to get interesting. Again the selectors with many collisions have mostly spam. But for 3 and less collisions we have some legitimate collisions.
114+
115+
For example for [`0x01a754a3`](https://4byte.sourcify.dev/?q=0x01a754a3) we have `AutoSwap()` and `updateTeamFeeContract(address)`. It's really difficult to tell if this is a spam or not.
116+
117+
But for the last row, [`0x025313a2`](https://4byte.sourcify.dev/?q=0x025313a2), we have `getACLRole5999294130779334338()` and `proxyOwner()`. Here the former is clearly a spam and the latter is not.
118+
119+
Next, we can actually ask an LLM to filter the ones looking like a spam! Since the data is not excessive, I shoved all of it into Claude and asked it to filter the ones looking like a spam. In the end it gave me a list of **648** collisions that it thinks are legitimate. I peeked in the list and it seems to be mostly accurate:
120+
121+
[legitimate_collisions.csv](./legitimate_collisions.csv)
122+
123+
Here are 10 interesting examples of legitimate unintended collisions:
124+
125+
| signature_hash_4 | num_signatures | signatures |
126+
|------------------|----------------|------------|
127+
| [0x04d742dc](https://4byte.sourcify.dev/?q=0x04d742dc) | 2 | `adminResetRank()`<br />`startSale(uint256,uint256,uint256)` |
128+
| [0x0536f755](https://4byte.sourcify.dev/?q=0x0536f755) | 2 | `FreeMintTokenSent(address,uint256)`<br />`NFTReward(address)` |
129+
| [0x092338cc](https://4byte.sourcify.dev/?q=0x092338cc) | 2 | `maxPurchasableInOneTx()`<br />`usdcGHSTOracle()` |
130+
| [0x17915e8d](https://4byte.sourcify.dev/?q=0x17915e8d) | 2 | `getCluster(address)`<br />`getTotalFeeBps()` |
131+
| [0x2025e52c](https://4byte.sourcify.dev/?q=0x2025e52c) | 2 | `createSaleTokensVault()`<br />`mintWithERC721(uint256)` |
132+
| [0x22061379](https://4byte.sourcify.dev/?q=0x22061379) | 2 | `getBaseStakeAmountForPlay()`<br />`vaultFees(uint256)` |
133+
| [0x55fcd027](https://4byte.sourcify.dev/?q=0x55fcd027) | 2 | `DepositAmountTooLow()`<br />`masterLogicAddress()` |
134+
| [0x667022fd](https://4byte.sourcify.dev/?q=0x667022fd) | 2 | `bought(address)`<br />`iceCreamVan()` |
135+
| [0x67bf975c](https://4byte.sourcify.dev/?q=0x67bf975c) | 2 | `NotAllowedToRecover()`<br />`RewardThresholdReached(uint256)` |
136+
| [0x706b8722](https://4byte.sourcify.dev/?q=0x706b8722) | 2 | `pauseAtId()`<br />`USDTBorrowed(address,uint256)` |
137+
138+
139+
These are all legitimate functions and events from different smart contracts that happen to share the same 4-byte selector purely by chance. This demonstrates that while 4-byte collisions are rare, they do happen naturally in the wild!
140+
141+
At this stage I just ran this for fun. We only have a list of popular signatures that we know the "correct" signature for ([canonical-signatures.json](https://github.com/argotorg/sourcify/blob/staging/services/4byte/src/utils/canonical-signatures.json)). We filter out non-canonical ones by default and have a `filtered` field (can turn off filtering [in the API response](https://api.4byte.sourcify.dev/signature-database/v1/lookup?function=0xa9059cbb&filter=false) ). But if the community thinks this is useful, we can do a more thorough analysis and filter out the spam via LLMs.

0 commit comments

Comments
 (0)