[3/4] `StartInference` and `FinishInference`

# Background

`MsgStartInference` and `MsgFinishInference` are too slow in production. Blocks should be processed by nodes within 1-2 seconds, so that block time stays below 6 seconds. This means that to process 1000 inferences in a block, we need to record 1000 `MsgStartInference`, 1000 `MsgFinishInference`, and 100-200 `MsgValidation` transactions. This means that these transactions should be processed faster than 1ms. Even though they are quite fast in tests, in production with a large state they require 10-20ms, and on some nodes 50ms or more.

There are 2 main areas identified that contribute most of the time to transactions:
- Signatures validation (57% of `FinishInference` and 63% of `StartInference`)
- Stats query and recording (40% of `FinishInference` and 30% of `StartInference`)

Download profiling file:
https://drive.google.com/file/d/1yxY91lzMHxv_MeloAxW1zczcpbkBjZ0t/
And use command:
```
go tool pprof -http=:8080 /Users/davidliberman/Downloads/pprof.inferenced.samples.cpu.001.pb.gz
```

And choose flame graph to explore

Screen recoding: https://drive.google.com/file/d/1yxDaJllxCQ-l3ZO6ZuBb5bTEUgZ5t7Yu/view?usp=sharing

_**Signature validation**_ can be significantly optimized, reducing the number of signatures to be validated in most scenarios by 5x (from 5 signatures to just 1).

https://github.com/gonka-ai/gonka/issues/608 - which is now implemented by @DimaOrekhovPS

https://github.com/gonka-ai/gonka/pull/779 

**_Stats query and recording_** is designed to make it easier to query usage statistics for inference operations by storing this data on a chain. However, it is too heavy for on-chain operations and should be removed. In the end, we shouldn't read and write any large state record in `MsgStartInference`, `MsgFinishInference`, or `MsgValidation`.

`SetInference` (including the second time it is executed in `HandleInferenceComplete`): 
- 10% of `FinishInference`, 
- 12% of `StartInference`, 
- 4% of Validation
- 33% is Logging, 
- 38% `SetOrUpdateInferenceStatsByEpoch`, 
- 22% `SetOrUpdateInferenceStatusByTime` w/o logging

`HandleInferenceComplete`, excluding `SetInference`, accounts for 16% of `FinishInference` and 4% of `StartInference` (as it is rare for `StartInference` to come second).
- 20% is Logging
- 45% is 2xGetEpochGroupData
- 5% GetEpochIndex
- 10% SetEpochGroupData, 
- 20% SetParticipant/GetParticipants w/o logging

`ProcessInferencePayment`: 14% of `FinishInference` and 12% of `StartInference` 
- 63% is Logging
- 18% `SetParticipant`/2x`GetParticipant` w/o logging
- 9% Add/GetTokenomicsData

# Tasks:
In `HandleInferenceComplete`, we also read `GetEpochGroupData` to add `ExecutorReputation`, `ExecutorPower`, and `TotalPower` (of the model group) to `InferenceValidationDetails`, which is then saved for future validation. We also increment `NumberOfRequests` of the epoch group and save it. This operation should also be moved to the `EndBlocker`. Execute `GetEpochGroup` (main and for each required models) and `SetEpochGroup` only once per block.

We should add a key Block+InferenceId in `HandleInferenceComplete` then iterate through  the keys to get Inferences by id during `EndBlocker` to store `InferenceValidationDetails` (clean keys immediately in the `EndBlocker` after the iteration).

After moving those operations to the `EndBlocker`, we need to validate if the endblocker time won't be increased significantly by the action (though adding `GetInference` iterations to `EndBlock` without changing state during transactions) - it should take not more than 50-100ms for 1000 inferences in a mainnet node. The test can be done by adding the read operations to `EndBlocker` mainnet node but without set operation, so that state of the node will stay the same.

# Important
This issue is one of five issues in the [0/4] StartInference and FinishInference series (and correspondingly [1/4], [2/4], [3/4], [4/4]).
These tasks can be completed independently of each other by different contributors.
However, this specific task requires maintaining and operating a node on mainnet in order to test and validate the result.

All five issues [0/4], [1/4], [2/4], [3/4], [4/4] in this series must be completed as part of the v0.2.11 upgrade, which is scheduled for the week of February 23. After the v0.2.11 upgrade, these tasks will no longer be relevant, because a different solution can/will be proposed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[3/4] `StartInference` and `FinishInference` #782

Background

Tasks:

Important

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[3/4] StartInference and FinishInference #782

Description

Background

Tasks:

Important

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[3/4] `StartInference` and `FinishInference` #782