|
| 1 | +# Go App Profiling |
| 2 | + |
| 3 | +Go App Profiling uses the pprof for sampling |
| 4 | + |
| 5 | +pprof is bundled within the auto-instrument agent and corresponds to [In-Process Profiling](../../concepts-and-designs/profiling.md#in-process-profiling). |
| 6 | + |
| 7 | +It is delivered to the agent in the form of a task, allowing it to be enabled or disabled dynamically. |
| 8 | +When service encounters performance issues (CPU usage, memory allocation, etc.), pprof task can be created. |
| 9 | +When the agent receives a task, it enables pprof for sampling. |
| 10 | +After sampling is completed, the sampling results are analyzed by requesting the server to render a flame graph for performance |
| 11 | +analysis to determine the specific business code lines that cause performance problems. |
| 12 | +Note, tracing profiling in the Go agent relies on the Go runtime’s global CPU sampling used by pprof. |
| 13 | +Since only one CPU profiler can run at a time within the same instance, tracing and pprof CPU profiling cannot be enabled simultaneously. |
| 14 | +If both are activated on the same instance, one task may fail to start. |
| 15 | + |
| 16 | +## Activate pprof in the OAP |
| 17 | +OAP and the agent use a brand-new protocol to exchange pprof data, so it is necessary to start OAP with the following configuration: |
| 18 | + |
| 19 | +```yaml |
| 20 | +receiver-pprof: |
| 21 | + selector: ${SW_RECEIVER_PPROF:default} |
| 22 | + default: |
| 23 | + # Used to manage the maximum size of the pprof file that can be received, the unit is Byte, default is 30M |
| 24 | + pprofMaxSize: ${SW_RECEIVER_PPROF_MAX_SIZE:31457280} |
| 25 | + # Used to determine whether to receive pprof in memory file or physical file mode |
| 26 | + # |
| 27 | + # The memory file mode have fewer local file system limitations, so they are by default. But it costs more memory. |
| 28 | + # |
| 29 | + # The physical file mode will use less memory when parsing and is more friendly to parsing large files. |
| 30 | + # However, if the storage of the tmp directory in the container is insufficient, the oap server instance may crash. |
| 31 | + # It is recommended to use physical file mode when volume mounting is used or the tmp directory has sufficient storage. |
| 32 | + memoryParserEnabled: ${SW_RECEIVER_PPROF_MEMORY_PARSER_ENABLED:true} |
| 33 | +``` |
| 34 | +
|
| 35 | +## pprof Task with Analysis |
| 36 | +
|
| 37 | +To use the pprof feature, please follow these steps: |
| 38 | +
|
| 39 | +1. **Create pprof task**: Use the UI or CLI tool to create a task. |
| 40 | +2. **Wait agent collect data and upload**: Wait for pprof to collect pprof data and report. |
| 41 | +3. **Query task progress**: Query the progress of tasks, including analyzing successful and failed instances and task logs. |
| 42 | +4. **Analyze the data**: Analyze the pprof data to determine where performance bottlenecks exist in the service. |
| 43 | +
|
| 44 | +### Create an pprof task |
| 45 | +
|
| 46 | +Create an pprof task to notify some go-agent instances in the execution service to start pprof for data collection. |
| 47 | +
|
| 48 | +When creating a task, the following configuration fields are required: |
| 49 | +
|
| 50 | +1. **serviceId**: Define the service to execute the task. |
| 51 | +2. **serviceInstanceIds**: Define which instances need to execute tasks. |
| 52 | +3. **duration**: Define the duration of this task in minutes, required for CPU, BLOCK, MUTEX events. |
| 53 | +4. **events**: Define which event types this task needs to collect. |
| 54 | +5. **dumpPeriod**: Define the period of the pprof dump, required for BLOCK, MUTEX events. |
| 55 | +
|
| 56 | +When the Agent receives a pprof task from OAP, it automatically generates a log to notify that the task has been acknowledged. The log contains the following field information: |
| 57 | +
|
| 58 | +1. **Instance**: The name of the instance where the Agent is located. |
| 59 | +2. **Type**: Supports "NOTIFIED" and "EXECUTION_FINISHED" and "PPROF_UPLOAD_FILE_TOO_LARGE_ERROR", "EXECUTION_TASK_ERROR", with the current log displaying "NOTIFIED". |
| 60 | +3. **Time**: The time when the Agent received the task. |
| 61 | +
|
| 62 | +### Wait the agent to collect data and upload |
| 63 | +
|
| 64 | +At this point, pprof will trace the events you selected when you created the task: |
| 65 | +
|
| 66 | +1. CPU: samples CPU usage over time to show which functions consume the most processing time. |
| 67 | +2. ALLOC, HEAP: |
| 68 | + - HEAP: a sampling of memory allocations of live objects. |
| 69 | + - ALLOC: a sampling of all past memory allocations. |
| 70 | +3. BLOCK, MUTEX: |
| 71 | + - BLOCK: stack traces that led to blocking on synchronization primitives. |
| 72 | + - MUTEX: stack traces of holders of contended mutexes. |
| 73 | +4. GOROUTINE, THREADCREAT: |
| 74 | + - GOROUTINE: stack traces of all current goroutines. |
| 75 | + - THREADCREATE: stack traces that led to the creation of new OS threads. |
| 76 | +
|
| 77 | +Finally, the agent will upload the pprof file produced by pprof to the oap server for online performance analysis. |
| 78 | +
|
| 79 | +### Query the profiling task progresses |
| 80 | +
|
| 81 | +Wait for pprof to complete data collection and upload successfully. |
| 82 | +We can query the execution logs of the pprof task and the task status, which includes the following information: |
| 83 | +
|
| 84 | +1. **successInstanceIds**: SuccessInstanceIds gives instances that have executed the task successfully. |
| 85 | +2. **errorInstanceIds**: ErrorInstanceIds gives instances that failed to execute the task. |
| 86 | +3. **logs**: All task execution logs of the current task. |
| 87 | + 1. **id**: The task id. |
| 88 | + 2. **instanceId**: InstanceId is the id of the instance which reported this task log. |
| 89 | + 3. **instanceName**: InstanceName is the name of the instance which reported this task log. |
| 90 | + 4. **operationType**: Contains "NOTIFIED" and "EXECUTION_FINISHED" and "PPROF_UPLOAD_FILE_TOO_LARGE_ERROR", "EXECUTION_TASK_ERROR". |
| 91 | + 5. **operationTime**: operationTime is the time when the operation occurs. |
| 92 | +
|
| 93 | +### Analyze the profiling data |
| 94 | +
|
| 95 | +Once some agents completed the task, we can analyze the data through the following query: |
| 96 | +
|
| 97 | +1. **taskId**: The task id. |
| 98 | +2. **instanceIds**: InstanceIds defines the instances to be included for analysis |
| 99 | +
|
| 100 | +After the query, the following data would be returned to render a flame graph: |
| 101 | +1. **taskId**: The task id. |
| 102 | +2. **elements**: Combined with "id" to determine the hierarchical relationship. |
| 103 | + 1. **Id**: Id is the identity of the stack element. |
| 104 | + 2. **parentId**: Parent element ID. The dependency relationship between elements can be determined using the element ID and parent element ID. |
| 105 | + 3. **codeSignature**: Method signatures in tree nodes. |
| 106 | + 4. **total**:The total number of samples of the current tree node, including child nodes. |
| 107 | + 5. **self**: The sampling number of the current tree node, excluding samples of the children. |
0 commit comments