Skip to content

Commit 3007b1b

Browse files
authored
Merge pull request #1993 from andrewkilroyarm/main
Add learning path for vertex efficiency
2 parents ce131c7 + 6c97b4c commit 3007b1b

File tree

7 files changed

+544
-0
lines changed

7 files changed

+544
-0
lines changed
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
---
2+
title: Optimizing graphics vertex efficiency for Arm GPUs
3+
4+
minutes_to_complete: 10
5+
6+
who_is_this_for: Android graphics application developers
7+
8+
learning_objectives:
9+
- Optimize vertex representations on Arm GPUs
10+
- How to interpret Vertex Memory Efficiency in Arm Frame Advisor
11+
12+
prerequisites:
13+
- An understanding of vertex attributes
14+
- Familiarity with Arm Frame Advisor, part of Arm Performance Studio
15+
16+
author:
17+
- Andrew Kilroy
18+
- Peter Harris
19+
20+
### Tags
21+
skilllevels: Intermediate
22+
subjects: Performance
23+
armips:
24+
- Immortalis
25+
- Mali
26+
tools_software_languages:
27+
- C
28+
- C++
29+
operatingsystems:
30+
- Android
31+
32+
further_reading:
33+
- resource:
34+
title: Arm GPU Best Practices Developer Guide
35+
link: https://developer.arm.com/documentation/101897/0304/Vertex-shading/Attribute-layout
36+
type: documentation
37+
- resource:
38+
title: Frame Advisor user guide
39+
link: https://developer.arm.com/documentation/102693/latest/
40+
type: documentation
41+
- resource:
42+
title: Analyse a Frame with Frame Advisor
43+
link: https://learn.arm.com/learning-paths/mobile-graphics-and-gaming/analyze_a_frame_with_frame_advisor/
44+
type: blog
45+
- resource:
46+
title: Arm Performance Studio
47+
link: https://developer.arm.com/Tools%20and%20Software/Arm%20Performance%20Studio%20for%20Mobile
48+
type: website
49+
50+
51+
52+
### FIXED, DO NOT MODIFY
53+
# ================================================================================
54+
weight: 1 # _index.md always has weight of 1 to order correctly
55+
layout: "learningpathall" # All files under learning paths have this same wrapper
56+
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
57+
---
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
# ================================================================================
3+
# FIXED, DO NOT MODIFY THIS FILE
4+
# ================================================================================
5+
weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
6+
title: "Next Steps" # Always the same, html page title.
7+
layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
8+
---
204 KB
Loading
232 KB
Loading
31.7 KB
Loading

content/learning-paths/mobile-graphics-and-gaming/optimizing-vertex-efficiency/initial-memory-layout.svg

Lines changed: 318 additions & 0 deletions
Loading
Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
---
2+
title: Optimizing graphics vertex efficiency for Arm GPUs
3+
weight: 5
4+
5+
### FIXED, DO NOT MODIFY
6+
layout: learningpathall
7+
---
8+
9+
# Optimizing graphics vertex efficiency for Arm GPUs
10+
11+
You're writing a graphics application targeting an Arm Immortalis
12+
GPU, and not hitting your desired performance. When running the Arm
13+
Frame Advisor tool, you spot that the draw calls in your shadow map
14+
creation pass have poor Vertex Memory Efficiency (VME) scores. How
15+
should you go about improving this?
16+
17+
![Frame Advisor screenshot](fa-found-bad-vme-in-content-metrics.png)
18+
19+
In this article, I will outline a common source of rendering
20+
inefficiency, how to spot the issue using Arm Frame Advisor, and how
21+
to rectify it.
22+
23+
24+
## Shadow mapping
25+
26+
In this scenario, draw calls in our shadow map render pass are the
27+
source of our poor VME scores. Let's start by reviewing exactly what
28+
these draws are doing.
29+
30+
Shadow mapping is the mechanism that decides, for every visible pixel,
31+
whether it is lit or in shadow. A shadow map is a texture that is
32+
created as the first part of this process. It is rendered from the
33+
point of view of the light source, and stores the distance to all of
34+
the objects that light can see. Parts of a surface that are visible
35+
to the light are lit, and any part that is occluded must be in shadow.
36+
37+
## Mesh layout
38+
39+
The primary input into shadow map creation is the object geometry for
40+
all of the objects that are shadow casters. In our scenario, let's
41+
assume that the vertex data for each object is stored in memory as an
42+
array structure, which is a commonly used layout in many applications:
43+
44+
``` C++
45+
struct Vertex {
46+
float position[3];
47+
float color[3].
48+
float normal[3];
49+
};
50+
51+
std::vector<Vertex> mesh {
52+
// Model data ...
53+
};
54+
55+
```
56+
57+
This would give the mesh the following layout in memory:
58+
59+
![Initial memory layout](initial-memory-layout.png)
60+
61+
## Why is this sub-optimal?
62+
63+
This looks like a standard way of passing mesh data into a GPU,
64+
so where is the inefficiency coming from?
65+
66+
The vertex data we have defined contains all of the attributes that
67+
we need for our object, including those that are needed to compute
68+
color in the main lighting pass. When generating the shadow map,
69+
we only actually need to compute the position of the object, so most
70+
of our vertex attributes will be unused by the shadow map generation
71+
draw calls.
72+
73+
The inefficiency comes from how hardware gets the data it needs from
74+
main memory so that computation can proceed. Processors do not fetch
75+
single values from DRAM, but instead fetch a small neighborhood of
76+
data, because this is the most efficient way to read from DRAM. For Arm
77+
GPUs, the hardware will read an entire 64 byte cache line at a time.
78+
79+
In our example, an attempt to fetch a vertex position during shadow
80+
map creation would also load the nearby color and normal values,
81+
even though we do not need them.
82+
83+
84+
## Detecting a sub-optimal layout
85+
86+
Arm Frame Advisor analyzes the attribute memory layout for each draw
87+
call the application makes, and provides the Vertex Memory Efficiency
88+
(VME) metric to show how efficiently that attribute layout is working.
89+
90+
![Location of vertex memory efficiency in FA](fa-navigate-to-call.png)
91+
92+
A VME of 1.0 would indicate that the draw call is making an optimal
93+
use of the memory bandwidth, with no unnecessary data fetches.
94+
95+
A VME of less than one indicates that unnecessary data is being loaded
96+
from memory, wasting bandwidth on data that is not being used in the
97+
computation on the GPU.
98+
99+
In our mesh layout we are only using 12 bytes for the `position`
100+
field, out of a total vertex size of 36 bytes, so our VME score would
101+
be only 0.33.
102+
103+
104+
## Fixing a sub-optimal layout
105+
106+
Shadow mapping only needs to load position, so to fix this issue we
107+
need to use a memory layout that allows position to be fetched in
108+
isolation from the other data. It is still preferable to leave the
109+
other attributes interleaved. This would look like this on the CPU:
110+
111+
``` C++
112+
struct VertexPart1 {
113+
float position[3];
114+
};
115+
116+
struct VertexPart2 {
117+
float color[3].
118+
float normal[3];
119+
};
120+
121+
std::vector<VertexPart1> mesh {
122+
// Model data ...
123+
};
124+
125+
std::vector<VertexPart2> mesh {
126+
// Model data ...
127+
};
128+
```
129+
130+
This allows the shadow map creation pass to read only useful position
131+
data, without any waste. The main lighting pass that renders the full
132+
object will then read from both memory regions.
133+
134+
The good news is that this technique is actually a useful one to apply
135+
all of the time, even for the main lighting pass! Many mobile GPUs,
136+
including Arm GPUs, process geometry in two passes. The first pass
137+
computes only the primitive position, and second pass will processes
138+
the remainder of the vertex shader only for the primitives that are
139+
visible after primitive culling has been performed. By splitting
140+
the position attributes into a separate stream, we avoid wasting
141+
memory bandwidth fetching non-position data for primitives that are
142+
ultimately discarded by primitive culling tests.
143+
144+
145+
# Conclusion
146+
147+
Arm Frame Advisor can give you actionable metrics that can identify
148+
specific inefficiencies in your application to optimize.
149+
150+
The VME metric shows how efficiently you are using your input
151+
vertex memory bandwidth, indicating what proportion of the input
152+
data is actually used by the shader program. VME can be improved by
153+
changing vertex memory layout to separate the different streams of
154+
data such that only the data needed for type of computation is packed
155+
together. Try not to mix data in that a computation would not use.
156+
157+
# Other links
158+
159+
Arm's advice on [attribute layouts][2]
160+
161+
[2]: https://developer.arm.com/documentation/101897/0304/Vertex-shading/Attribute-layout

0 commit comments

Comments
 (0)