You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/mobile-graphics-and-gaming/optimizing-vertex-efficiency/_index.md
+11-11Lines changed: 11 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,21 +1,17 @@
1
1
---
2
-
title: Optimizing graphics vertex efficiency for Arm GPUs
3
-
4
-
draft: true
5
-
cascade:
6
-
draft: true
2
+
title: Optimize graphics vertex efficiency for Arm GPUs
7
3
8
4
minutes_to_complete: 10
9
5
10
-
who_is_this_for: This is an advanced topic for Android graphics application developers.
6
+
who_is_this_for: This is an advanced topic for Android graphics application developers aiming to enhance GPU performance through smarter vertex optimization.
11
7
12
8
learning_objectives:
13
-
- Optimize vertex representations on Arm GPUs
14
-
- How to interpret Vertex Memory Efficiency in Arm Frame Advisor
9
+
- Optimize vertex representations on Arm GPUs.
10
+
- Analyze Vertex Memory Efficiency using Arm Frame Advisor.
15
11
16
12
prerequisites:
17
-
- An understanding of vertex attributes
18
-
- Familiarity with Arm Frame Advisor, part of Arm Performance Studio
13
+
- Understanding of vertex attributes.
14
+
- Familiarity with Arm Frame Advisor (part of Arm Performance Studio).
After profiling your frame with Arm Frame Advisor, you might notice that the shadow map draw calls have low Vertex Memory Efficiency (VME), as shown in the image below.
18
14
19
-
In this Learning Path, you will learn about a common source of rendering
20
-
inefficiency, how to spot the issue using Arm Frame Advisor, and how
21
-
to rectify it.
15
+
This raises an important question: what's causing the inefficiency, and how can you fix it?
This Learning Path shows you approaches to addressing this problem, by demonstrating:
22
+
23
+
* Common sources of rendering inefficiencies.
24
+
* How to identify and rectify issues using Arm Frame Advisor.
23
25
24
26
## Shadow mapping
25
27
26
-
In this scenario, draw calls in the shadow map render pass are the
27
-
source of our poor VME scores. Let's start by reviewing exactly what
28
-
these draws are doing.
28
+
In this scenario, draw calls in the shadow map render pass are responsible for the low Vertex Memory Efficiency (VME) scores. To understand why, let's begin by reviewing what these draws are doing.
29
29
30
-
Shadow mapping is the mechanism that decides, for every visible pixel,
31
-
whether it is lit or in shadow. A shadow map is a texture that is
32
-
created as the first part of this process. It is rendered from the
33
-
point of view of the light source, and stores the distance to all of
34
-
the objects that light can see. Parts of a surface that are visible
35
-
to the light are lit, and any part that is occluded must be in shadow.
30
+
*Shadow mapping* is the mechanism that decides whether each visible pixel is lit or in shadow. The process begins by rendering a shadow map - a texture rendered from the point of view of the light source. This texture stores the distance to the nearest surfaces visible to the light.
31
+
32
+
During the final render pass, the GPU compares the depth of each pixel from the camera’s viewpoint to the corresponding value in the shadow map. If the pixel is farther away than what the light "sees," it’s considered occluded and rendered in shadow. Otherwise, it is lit.
36
33
37
34
## Mesh layout
38
35
39
-
The primary input into shadow map creation is the object geometry for
40
-
all of the objects that cast shadows. In this scenario, let's
41
-
assume that the vertex data for each object is stored in memory as an
42
-
array structure, which is a commonly used layout in many applications:
36
+
The primary input for shadow map creation is the geometry of all objects that cast shadows. In this scenario, assume that each object’s vertex data is stored in memory as an array structure, a layout commonly used in many applications:
43
37
44
38
```C++
45
39
structVertex {
46
40
float position[3];
47
-
float color[3].
41
+
float color[3];
48
42
float normal[3];
49
43
};
50
44
@@ -54,67 +48,46 @@ std::vector<Vertex> mesh {
54
48
55
49
```
56
50
57
-
This would give the mesh the following layout in memory:
51
+
This gives the mesh the following layout in memory:
This looks like a standard way of passing mesh data into a GPU,
57
+
At a first glance, this looks like a standard way of passing mesh data into a GPU,
64
58
so where is the inefficiency coming from?
65
59
66
-
The vertex data that is defined contains all of the attributes that
67
-
you need for your object, including those that are needed to compute
68
-
color in the main lighting pass. When generating the shadow map,
69
-
you only need to compute the position of the object, so most
70
-
of your vertex attributes will be unused by the shadow map generation
71
-
draw calls.
72
-
73
-
The inefficiency comes from how hardware gets the data it needs from
74
-
main memory so that computation can proceed. Processors do not fetch
75
-
single values from DRAM, but instead fetch a small neighborhood of
76
-
data, because this is the most efficient way to read from DRAM. For Arm
77
-
GPUs, the hardware will read an entire 64 byte cache line at a time.
60
+
The vertex data that is defined contains all of the attributes that you need for your object, including those that are needed to compute color in the main lighting pass. When generating the shadow map, you only need to compute the position of the object, so most of your vertex attributes will be unused by the shadow map generation draw calls.
78
61
79
-
In this example, an attempt to fetch a vertex position during shadow
80
-
map creation would also load the nearby color and normal values,
81
-
even though you do not need them.
62
+
The inefficiency comes from how GPUs fetch vertex data from main memory. GPUs don't retrieve individual values from DRAM. Instead, they fetch a small neighborhood of data at once, which is more efficient for memory access. On Arm GPUs, this typically means reading an entire 64-byte cache line at a time.
82
63
64
+
In this example, fetching a vertex position for shadow map rendering also loads the adjacent color and normal attributes into cache, even though they're not needed. This wastes memory bandwidth and contributes to poor Vertex Memory Efficiency (VME).
83
65
84
-
## Detecting a sub-optimal layout
66
+
## Detecting a suboptimal layout
85
67
86
-
Arm Frame Advisor analyzes the attribute memory layout for each draw
87
-
call the application makes, and provides the Vertex Memory Efficiency
88
-
(VME) metric to show how efficiently that attribute layout is working.
68
+
Arm Frame Advisor analyzes the vertex attribute memory layout for each draw call and reports a Vertex Memory Efficiency (VME) metric to show how efficiently the GPU accesses vertex data.
89
69
90
-

70
+

91
71
92
-
A VME of 1.0 would indicate that the draw call is making an optimal
93
-
use of the memory bandwidth, with no unnecessary data fetches.
72
+
A VME of 1.0 indicates that the draw call is making an optimal use of the memory bandwidth, with no unnecessary data fetches.
94
73
95
-
A VME of less than one indicates that unnecessary data is being loaded
96
-
from memory, wasting bandwidth on data that is not being used in the
97
-
computation on the GPU.
74
+
A VME score below 1.0 indicates that unnecessary data is being loaded from memory, wasting bandwidth on attributes not being used in the computation on the GPU.
98
75
99
-
In this mesh layout you are only using 12 bytes for the `position`
100
-
field, out of a total vertex size of 36 bytes, so your VME score would
101
-
be only 0.33.
76
+
In this mesh layout you are only using 12 bytes for the `position` field, out of a 36-byte vertex, resulting in a VME score of 0.33.
102
77
78
+
## Fixing a suboptimal layout
103
79
104
-
## Fixing a sub-optimal layout
80
+
Shadow mapping only needs to load position, so to fix this issue you need to use a memory layout that allows position to be fetched in isolation from the other data. It is still preferable to leave the other attributes interleaved.
105
81
106
-
Shadow mapping only needs to load position, so to fix this issue you
107
-
need to use a memory layout that allows position to be fetched in
108
-
isolation from the other data. It is still preferable to leave the
109
-
other attributes interleaved. On the CPU, this would look like the following:
This allows the shadow map creation pass to read only useful position
131
-
data, without any waste. The main lighting pass that renders the full
132
-
object will then read from both memory regions.
133
-
134
-
The good news is that this technique is actually a useful one to apply
135
-
all of the time, even for the main lighting pass! Many mobile GPUs,
136
-
including Arm GPUs, process geometry in two passes. The first pass
137
-
computes only the primitive position, and second pass will process
138
-
the remainder of the vertex shader only for the primitives that are
139
-
visible after primitive culling has been performed. By splitting
140
-
the position attributes into a separate stream, you avoid wasting
141
-
memory bandwidth fetching non-position data for primitives that are
142
-
ultimately discarded by primitive culling tests.
103
+
This allows the shadow map creation pass to read only useful position data, without any waste. The main lighting pass that renders the full object will then read from both memory regions.
143
104
105
+
The good news is that this technique is actually a useful one to apply all of the time, even for the main lighting pass! Many mobile GPUs, including Arm GPUs, process geometry in two passes: an initial pass that computes only primitive positions, followed by a second pass that runs the full vertex shader only for primitives that survive culling. By placing position data in a separate buffer or stream, you reduce memory bandwidth wasted on fetching attributes like color or normals for primitives that are ultimately discarded.
144
106
145
-
# Conclusion
107
+
##Conclusion
146
108
147
-
Arm Frame Advisor can give you actionable metrics that can identify
148
-
specific inefficiencies in your application to optimize.
109
+
Arm Frame Advisor provides actionable metrics that can help identify specific inefficiencies in your graphics application. The Vertex Memory Efficiency metric measures how efficiently you are using your input vertex memory bandwidth, indicating what proportion of the input data is actually consumed by the shader program. You can improve VME by adjusting your vertex memory layout to separate attribute data into distinct streams, ensuring that each render pass only loads the data it needs. Avoid packing unused attributes into memory regions accessed by draw calls, as this wastes bandwidth and reduces performance.
149
110
150
-
The VME metric shows how efficiently you are using your input
151
-
vertex memory bandwidth, indicating what proportion of the input
152
-
data is actually used by the shader program. VME can be improved by
153
-
changing vertex memory layout to separate the different streams of
154
-
data such that only the data needed for type of computation is packed
155
-
together. Try not to mix data in that a computation would not use.
title: Accelerate Bitmap Scanning with NEON and SVE Instructions on Arm servers
3
+
4
+
draft: true
5
+
cascade:
6
+
draft: true
7
+
8
+
minutes_to_complete: 20
9
+
10
+
who_is_this_for: This is an introductory topic for database developers, performance engineers, and anyone optimizing data processing workloads on Arm-based cloud instances.
11
+
12
+
13
+
learning_objectives:
14
+
- Understand bitmap scanning operations in database systems
15
+
- Implement bitmap scanning with scalar, NEON, and SVE instructions
16
+
- Compare performance between different implementations
17
+
- Measure performance improvements on Graviton4 instances
18
+
19
+
prerequisites:
20
+
- An [Arm based instance](/learning-paths/servers-and-cloud-computing/csp/) from an appropriate
21
+
cloud service provider.
22
+
23
+
author: Pareena Verma
24
+
25
+
26
+
### Tags
27
+
skilllevels: Introductory
28
+
subjects: Performance and Architecture
29
+
armips:
30
+
- Neoverse
31
+
operatingsystems:
32
+
- Linux
33
+
tools_software_languages:
34
+
- SVE
35
+
- NEON
36
+
- Runbook
37
+
38
+
further_reading:
39
+
- resource:
40
+
title: Accelerate multi-token search in strings with SVE2 SVMATCH instruction
0 commit comments