Skip to content

[4/4] StartInference and FinishInference #783

@tcharchian

Description

@tcharchian

Background

MsgStartInference and MsgFinishInference are too slow in production. Blocks should be processed by nodes within 1-2 seconds, so that block time stays below 6 seconds. This means that to process 1000 inferences in a block, we need to record 1000 MsgStartInference, 1000 MsgFinishInference, and 100-200 MsgValidation transactions. This means that these transactions should be processed faster than 1ms. Even though they are quite fast in tests, in production with a large state they require 10-20ms, and on some nodes 50ms or more.

There are 2 main areas identified that contribute most of the time to transactions:

  • Signatures validation (57% of FinishInference and 63% of StartInference)
  • Stats query and recording (40% of FinishInference and 30% of StartInference)

Download profiling file:
https://drive.google.com/file/d/1yxY91lzMHxv_MeloAxW1zczcpbkBjZ0t/
And use command:

go tool pprof -http=:8080 /Users/davidliberman/Downloads/pprof.inferenced.samples.cpu.001.pb.gz

And choose flame graph to explore

Screen recoding: https://drive.google.com/file/d/1yxDaJllxCQ-l3ZO6ZuBb5bTEUgZ5t7Yu/view?usp=sharing

Signature validation can be significantly optimized, reducing the number of signatures to be validated in most scenarios by 5x (from 5 signatures to just 1).

#608 - which is now implemented by @DimaOrekhovPS

#779

Stats query and recording is designed to make it easier to query usage statistics for inference operations by storing this data on a chain. However, it is too heavy for on-chain operations and should be removed. In the end, we shouldn't read and write any large state record in MsgStartInference, MsgFinishInference, or MsgValidation.

SetInference (including the second time it is executed in HandleInferenceComplete):

  • 10% of FinishInference,
  • 12% of StartInference,
  • 14% of Validation
  • 33% is Logging,
  • 38% SetOrUpdateInferenceStatsByEpoch,
  • 22% SetOrUpdateInferenceStatusByTime w/o logging

HandleInferenceComplete, excluding SetInference, accounts for 16% of FinishInference and 4% of StartInference (as it is rare for StartInference to come second).

  • 20% is Logging
  • 45% is 2xGetEpochGroupData
  • 5% GetEpochIndex
  • 10% SetEpochGroupData,
  • 20% SetParticipant/GetParticipants w/o logging

ProcessInferencePayment: 14% of FinishInference and 12% of StartInference

  • 63% is Logging
  • 18% SetParticipant/2xGetParticipant w/o logging
  • 9% Add/GetTokenomicsData

Tasks:

SetParticipant is executed twice in ProcessInferencePayment and HandleInferenceComplete for the second Start/Finish transaction, when it could be executed just once.

Most of the time spent in SetParticipant is consumed by ComputeStatus and GetParams (except for Logging, which takes 50%, which we discussed separately in #780). Decimal.Ln in ComputeStatus can be optimized significantly. Regarding GetParams, we read it for each transaction in any case, so we can pass it to SetParticipant and reuse what we already have.

Important

This issue is one of five issues in the [0/4] StartInference and FinishInference series (and correspondingly [1/4], [2/4], [3/4], [4/4]).
These tasks can be completed independently of each other by different contributors.
This specific task does not require maintaining and operating a node on mainnet in order to test and validate the result.

All five issues [0/4], [1/4], [2/4], [3/4], [4/4] in this series must be completed as part of the v0.2.11 upgrade, which is scheduled for the week of February 23. After the v0.2.11 upgrade, these tasks will no longer be relevant, because a different solution can/will be proposed.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    In Progress

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions