Replies: 3 comments 1 reply
-
|
Implementation of this idea would directly enable adding back small GPUs. |
Beta Was this translation helpful? Give feedback.
-
|
24/7 100% utilization of GPUs will lead to high power consumption, servers overhead, performance issues etc., right ? |
Beta Was this translation helpful? Give feedback.
-
|
Whilst having mathematical beauty - this solution has a serious adverse effect of high energy consumption. This drives away from the philosophy of devoting most of the compute power to inference. The primary problem is a scenario where a malicious actor can bypass the cPoC and PoC by loading the model fast enough and generating enough nonces within the generation window. Let’s consider some alternatives.
In regards to overall network health (avoiding abuse of serving only small models):
Note – This also has an adverse effect that some validators might be excluded from the network due to absence of demand. However in case the model range exceeds the network capacity, that should not be an issue. Also - that is actually a self-regulation mechanism once the block subsidy is lower than inference processing revenue. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
This proposal is closely related to proposal for multi-model PoC and can be seen as it’s continuation.
Problem
If we want to serve small models - we need to do PoCs via small models as well (see proposal multi-model PoC).
But this opens us to an attack vector when a malicious host can deploy a small model quick enough to participate in both PoC and cPoC with new nodes that were not participating in users’ inference. This attack vector was the reason behind switching to PoCv2 inside vLLM and selecting a large model (Qwen3-235B-FP8) for it.
Note: If someone still tries to cheat and doesn’t have 100% of their compute available at all times, in general, the network can catch such behaviour because of increasing miss rate - this works for large models. But in order to significantly utilize compute with deployed small models, the chain would have to serve an enormous amount of user inferences and overhead on recording their metadata would be more significant, which is a parallel engineering challenge the chain is solving now. Thus, for now, it is practically impossible to catch any malicious behaviour of participants serving small models because their miss rate will never be high enough because of the underutilization of compute. This can be solved by proposal for inference scaling.
Proposal
Current implementation of POCv2 has quite an artificial limitation - the GPU executes only one type of computation at a time: either inference or nonce check. De-facto, it’s the exactly same computation internally and can be computed fully parallel if GPU allows (in the same batch, utilizing all engine’s optimizations like dynamic batching)
If we get rid of this limitation, there is no need to have different phases: POC and INFERENCE. POC can work in parallel with inference and verify all the hardware which is not utilized by real requests at the moment (maintaining utilization level not to slow-down user requests).
So, essentially if the host has 100 GPU but chain has requests to utilize only 0.5 of them, 99.5 will be verified by the POC procedure.
There is an open question how to properly “weight” how many nonces should be compensated by particular inferences with different input / output lengths. It’s quite important to make sure that the UX of real inferences doesn’t change (inference itself is not slower).
Such continuous POC would not introduce significant overhead as only lightweight commits are recorded on-chain. The real artifacts are stored locally and requested directly. The validation itself doesn’t have to cover 100% of the epoch block but can be randomized and trigger some-revalidation for malicious fragments.
Implementation
[to be discussed]
Beta Was this translation helpful? Give feedback.
All reactions