Skip to content

Commit 698ed1d

Browse files
Merge pull request #2017 from ArmDeveloperEcosystem/main
production update
2 parents a2a4c99 + 4fd14f6 commit 698ed1d

File tree

19 files changed

+1791
-105
lines changed

19 files changed

+1791
-105
lines changed

.wordlist.txt

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4200,4 +4200,14 @@ multithreaded
42004200
Wix's
42014201
ngrok's
42024202
qs
4203-
qu
4203+
qu
4204+
Mbps
4205+
SVMATCH
4206+
abd
4207+
bitvector
4208+
bitvectors
4209+
iperf
4210+
normals
4211+
svcntb
4212+
svmatch
4213+
tc

content/install-guides/aws-q-cli.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -206,6 +206,33 @@ based distributions like Ubuntu, which is your preference.
206206

207207
Give it a try by asking questions like "How do I install the AWS CLI?" and verify that the answers match the provided context.
208208

209+
## How do I change the model Amazon Q uses?
210+
211+
When you start `q chat` the model is printed:
212+
213+
```output
214+
🤖 You are chatting with claude-3.7-sonnet
215+
```
216+
217+
You can use the `/model` command to list other available models.
218+
219+
```console
220+
/model
221+
```
222+
223+
The model options are displayed:
224+
225+
```output
226+
? Select a model for this chat session ›
227+
❯ claude-4-sonnet
228+
claude-3.7-sonnet (active)
229+
claude-3.5-sonnet
230+
```
231+
232+
Use the arrow keys and select the model you want to use.
233+
234+
You can ask Amazon Q to set the default model for future sessions.
235+
209236
## Install an MCP server
210237

211238
As an example of using MCP with Amazon Q, you can configure the Github MCP server.

content/learning-paths/mobile-graphics-and-gaming/optimizing-vertex-efficiency/_index.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,17 @@
11
---
2-
title: Optimizing graphics vertex efficiency for Arm GPUs
3-
4-
draft: true
5-
cascade:
6-
draft: true
2+
title: Optimize graphics vertex efficiency for Arm GPUs
73

84
minutes_to_complete: 10
95

10-
who_is_this_for: This is an advanced topic for Android graphics application developers.
6+
who_is_this_for: This is an advanced topic for Android graphics application developers aiming to enhance GPU performance through smarter vertex optimization.
117

128
learning_objectives:
13-
- Optimize vertex representations on Arm GPUs
14-
- How to interpret Vertex Memory Efficiency in Arm Frame Advisor
9+
- Optimize vertex representations on Arm GPUs.
10+
- Analyze Vertex Memory Efficiency using Arm Frame Advisor.
1511

1612
prerequisites:
17-
- An understanding of vertex attributes
18-
- Familiarity with Arm Frame Advisor, part of Arm Performance Studio
13+
- Understanding of vertex attributes.
14+
- Familiarity with Arm Frame Advisor (part of Arm Performance Studio).
1915

2016
author:
2117
- Andrew Kilroy
@@ -43,13 +39,17 @@ further_reading:
4339
link: https://developer.arm.com/documentation/102693/latest/
4440
type: documentation
4541
- resource:
46-
title: Analyse a Frame with Frame Advisor
42+
title: Analyze a Frame with Frame Advisor
4743
link: https://learn.arm.com/learning-paths/mobile-graphics-and-gaming/analyze_a_frame_with_frame_advisor/
4844
type: blog
4945
- resource:
5046
title: Arm Performance Studio
5147
link: https://developer.arm.com/Tools%20and%20Software/Arm%20Performance%20Studio%20for%20Mobile
5248
type: website
49+
- resource:
50+
title: Attribute Layouts
51+
link: https://developer.arm.com/documentation/101897/0304/Vertex-shading/Attribute-layout
52+
type: website
5353

5454

5555

content/learning-paths/mobile-graphics-and-gaming/optimizing-vertex-efficiency/vme-learning-path.md

Lines changed: 39 additions & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -6,45 +6,39 @@ weight: 5
66
layout: learningpathall
77
---
88

9-
# Optimizing graphics vertex efficiency for Arm GPUs
9+
## Diagnosing poor Vertex Memory Efficiency with Frame Advisor
1010

11-
You are writing a graphics application targeting an Arm Immortalis
12-
GPU, and not hitting your desired performance. When running the Arm
13-
Frame Advisor tool, you spot that the draw calls in your shadow map
14-
creation pass have poor Vertex Memory Efficiency (VME) scores. How
15-
should you go about improving this?
11+
Imagine you're developing a graphics application targeting an Arm Immortalis GPU, but you're seeing subpar performance.
1612

17-
![Frame Advisor screenshot](fa-found-bad-vme-in-content-metrics.png)
13+
After profiling your frame with Arm Frame Advisor, you might notice that the shadow map draw calls have low Vertex Memory Efficiency (VME), as shown in the image below.
1814

19-
In this Learning Path, you will learn about a common source of rendering
20-
inefficiency, how to spot the issue using Arm Frame Advisor, and how
21-
to rectify it.
15+
This raises an important question: what's causing the inefficiency, and how can you fix it?
2216

17+
![Frame Advisor screenshot#center](fa-found-bad-vme-in-content-metrics.png "Arm Frame Advisor showing poor Vertex Memory Efficiency (VME) in shadow map draw calls.")
18+
19+
## Finding a solution
20+
21+
This Learning Path shows you approaches to addressing this problem, by demonstrating:
22+
23+
* Common sources of rendering inefficiencies.
24+
* How to identify and rectify issues using Arm Frame Advisor.
2325

2426
## Shadow mapping
2527

26-
In this scenario, draw calls in the shadow map render pass are the
27-
source of our poor VME scores. Let's start by reviewing exactly what
28-
these draws are doing.
28+
In this scenario, draw calls in the shadow map render pass are responsible for the low Vertex Memory Efficiency (VME) scores. To understand why, let's begin by reviewing what these draws are doing.
2929

30-
Shadow mapping is the mechanism that decides, for every visible pixel,
31-
whether it is lit or in shadow. A shadow map is a texture that is
32-
created as the first part of this process. It is rendered from the
33-
point of view of the light source, and stores the distance to all of
34-
the objects that light can see. Parts of a surface that are visible
35-
to the light are lit, and any part that is occluded must be in shadow.
30+
*Shadow mapping* is the mechanism that decides whether each visible pixel is lit or in shadow. The process begins by rendering a shadow map - a texture rendered from the point of view of the light source. This texture stores the distance to the nearest surfaces visible to the light.
31+
32+
During the final render pass, the GPU compares the depth of each pixel from the camera’s viewpoint to the corresponding value in the shadow map. If the pixel is farther away than what the light "sees," it’s considered occluded and rendered in shadow. Otherwise, it is lit.
3633

3734
## Mesh layout
3835

39-
The primary input into shadow map creation is the object geometry for
40-
all of the objects that cast shadows. In this scenario, let's
41-
assume that the vertex data for each object is stored in memory as an
42-
array structure, which is a commonly used layout in many applications:
36+
The primary input for shadow map creation is the geometry of all objects that cast shadows. In this scenario, assume that each object’s vertex data is stored in memory as an array structure, a layout commonly used in many applications:
4337

4438
``` C++
4539
struct Vertex {
4640
float position[3];
47-
float color[3].
41+
float color[3];
4842
float normal[3];
4943
};
5044

@@ -54,67 +48,46 @@ std::vector<Vertex> mesh {
5448

5549
```
5650
57-
This would give the mesh the following layout in memory:
51+
This gives the mesh the following layout in memory:
5852
59-
![Initial memory layout](initial-memory-layout.png)
53+
![Initial memory layout#center](initial-memory-layout.png "Initial memory layout")
6054
61-
## Why is this sub-optimal?
55+
## Why is this suboptimal?
6256
63-
This looks like a standard way of passing mesh data into a GPU,
57+
At a first glance, this looks like a standard way of passing mesh data into a GPU,
6458
so where is the inefficiency coming from?
6559
66-
The vertex data that is defined contains all of the attributes that
67-
you need for your object, including those that are needed to compute
68-
color in the main lighting pass. When generating the shadow map,
69-
you only need to compute the position of the object, so most
70-
of your vertex attributes will be unused by the shadow map generation
71-
draw calls.
72-
73-
The inefficiency comes from how hardware gets the data it needs from
74-
main memory so that computation can proceed. Processors do not fetch
75-
single values from DRAM, but instead fetch a small neighborhood of
76-
data, because this is the most efficient way to read from DRAM. For Arm
77-
GPUs, the hardware will read an entire 64 byte cache line at a time.
60+
The vertex data that is defined contains all of the attributes that you need for your object, including those that are needed to compute color in the main lighting pass. When generating the shadow map, you only need to compute the position of the object, so most of your vertex attributes will be unused by the shadow map generation draw calls.
7861
79-
In this example, an attempt to fetch a vertex position during shadow
80-
map creation would also load the nearby color and normal values,
81-
even though you do not need them.
62+
The inefficiency comes from how GPUs fetch vertex data from main memory. GPUs don't retrieve individual values from DRAM. Instead, they fetch a small neighborhood of data at once, which is more efficient for memory access. On Arm GPUs, this typically means reading an entire 64-byte cache line at a time.
8263
64+
In this example, fetching a vertex position for shadow map rendering also loads the adjacent color and normal attributes into cache, even though they're not needed. This wastes memory bandwidth and contributes to poor Vertex Memory Efficiency (VME).
8365
84-
## Detecting a sub-optimal layout
66+
## Detecting a suboptimal layout
8567
86-
Arm Frame Advisor analyzes the attribute memory layout for each draw
87-
call the application makes, and provides the Vertex Memory Efficiency
88-
(VME) metric to show how efficiently that attribute layout is working.
68+
Arm Frame Advisor analyzes the vertex attribute memory layout for each draw call and reports a Vertex Memory Efficiency (VME) metric to show how efficiently the GPU accesses vertex data.
8969
90-
![Location of vertex memory efficiency in FA](fa-navigate-to-call.png)
70+
![Location of vertex memory efficiency in FA#center](fa-navigate-to-call.png "Location of vertex memory efficiency in Frame Advisor")
9171
92-
A VME of 1.0 would indicate that the draw call is making an optimal
93-
use of the memory bandwidth, with no unnecessary data fetches.
72+
A VME of 1.0 indicates that the draw call is making an optimal use of the memory bandwidth, with no unnecessary data fetches.
9473
95-
A VME of less than one indicates that unnecessary data is being loaded
96-
from memory, wasting bandwidth on data that is not being used in the
97-
computation on the GPU.
74+
A VME score below 1.0 indicates that unnecessary data is being loaded from memory, wasting bandwidth on attributes not being used in the computation on the GPU.
9875
99-
In this mesh layout you are only using 12 bytes for the `position`
100-
field, out of a total vertex size of 36 bytes, so your VME score would
101-
be only 0.33.
76+
In this mesh layout you are only using 12 bytes for the `position` field, out of a 36-byte vertex, resulting in a VME score of 0.33.
10277
78+
## Fixing a suboptimal layout
10379
104-
## Fixing a sub-optimal layout
80+
Shadow mapping only needs to load position, so to fix this issue you need to use a memory layout that allows position to be fetched in isolation from the other data. It is still preferable to leave the other attributes interleaved.
10581
106-
Shadow mapping only needs to load position, so to fix this issue you
107-
need to use a memory layout that allows position to be fetched in
108-
isolation from the other data. It is still preferable to leave the
109-
other attributes interleaved. On the CPU, this would look like the following:
82+
On the CPU, this looks like this:
11083
11184
``` C++
11285
struct VertexPart1 {
11386
float position[3];
11487
};
11588
11689
struct VertexPart2 {
117-
float color[3].
90+
float color[3];
11891
float normal[3];
11992
};
12093
@@ -127,35 +100,14 @@ std::vector<VertexPart2> mesh {
127100
};
128101
```
129102

130-
This allows the shadow map creation pass to read only useful position
131-
data, without any waste. The main lighting pass that renders the full
132-
object will then read from both memory regions.
133-
134-
The good news is that this technique is actually a useful one to apply
135-
all of the time, even for the main lighting pass! Many mobile GPUs,
136-
including Arm GPUs, process geometry in two passes. The first pass
137-
computes only the primitive position, and second pass will process
138-
the remainder of the vertex shader only for the primitives that are
139-
visible after primitive culling has been performed. By splitting
140-
the position attributes into a separate stream, you avoid wasting
141-
memory bandwidth fetching non-position data for primitives that are
142-
ultimately discarded by primitive culling tests.
103+
This allows the shadow map creation pass to read only useful position data, without any waste. The main lighting pass that renders the full object will then read from both memory regions.
143104

105+
The good news is that this technique is actually a useful one to apply all of the time, even for the main lighting pass! Many mobile GPUs, including Arm GPUs, process geometry in two passes: an initial pass that computes only primitive positions, followed by a second pass that runs the full vertex shader only for primitives that survive culling. By placing position data in a separate buffer or stream, you reduce memory bandwidth wasted on fetching attributes like color or normals for primitives that are ultimately discarded.
144106

145-
# Conclusion
107+
## Conclusion
146108

147-
Arm Frame Advisor can give you actionable metrics that can identify
148-
specific inefficiencies in your application to optimize.
109+
Arm Frame Advisor provides actionable metrics that can help identify specific inefficiencies in your graphics application. The Vertex Memory Efficiency metric measures how efficiently you are using your input vertex memory bandwidth, indicating what proportion of the input data is actually consumed by the shader program. You can improve VME by adjusting your vertex memory layout to separate attribute data into distinct streams, ensuring that each render pass only loads the data it needs. Avoid packing unused attributes into memory regions accessed by draw calls, as this wastes bandwidth and reduces performance.
149110

150-
The VME metric shows how efficiently you are using your input
151-
vertex memory bandwidth, indicating what proportion of the input
152-
data is actually used by the shader program. VME can be improved by
153-
changing vertex memory layout to separate the different streams of
154-
data such that only the data needed for type of computation is packed
155-
together. Try not to mix data in that a computation would not use.
156111

157-
# Other links
158112

159-
Arm's advice on [attribute layouts][2]
160113

161-
[2]: https://developer.arm.com/documentation/101897/0304/Vertex-shading/Attribute-layout

content/learning-paths/servers-and-cloud-computing/_index.md

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ key_ip:
88
maintopic: true
99
operatingsystems_filter:
1010
- Android: 2
11-
- Linux: 143
11+
- Linux: 146
1212
- macOS: 10
1313
- Windows: 14
1414
pinned_modules:
@@ -23,7 +23,7 @@ subjects_filter:
2323
- Databases: 15
2424
- Libraries: 9
2525
- ML: 27
26-
- Performance and Architecture: 53
26+
- Performance and Architecture: 56
2727
- Storage: 1
2828
- Web: 10
2929
subtitle: Optimize cloud native apps on Arm for performance and cost
@@ -98,6 +98,7 @@ tools_software_languages_filter:
9898
- Hugging Face: 9
9999
- InnoDB: 1
100100
- Intrinsics: 1
101+
- iperf3: 1
101102
- Java: 3
102103
- JAX: 1
103104
- Kafka: 1
@@ -117,8 +118,8 @@ tools_software_languages_filter:
117118
- MongoDB: 2
118119
- mpi: 1
119120
- MySQL: 9
121+
- NEON: 4
120122
- Neon: 3
121-
- NEON: 2
122123
- Nexmark: 1
123124
- Nginx: 3
124125
- Node.js: 3
@@ -135,16 +136,16 @@ tools_software_languages_filter:
135136
- Redis: 3
136137
- Remote.It: 2
137138
- RME: 6
138-
- Runbook: 68
139+
- Runbook: 70
139140
- Rust: 2
140141
- snappy: 1
141142
- Snort3: 1
142143
- SQL: 7
143144
- Streamline CLI: 1
144145
- Streamlit: 2
145146
- Supervisor: 1
146-
- SVE: 4
147-
- SVE2: 1
147+
- SVE: 5
148+
- SVE2: 2
148149
- Sysbench: 1
149150
- Telemetry: 1
150151
- TensorFlow: 2
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
---
2+
title: Accelerate Bitmap Scanning with NEON and SVE Instructions on Arm servers
3+
4+
draft: true
5+
cascade:
6+
draft: true
7+
8+
minutes_to_complete: 20
9+
10+
who_is_this_for: This is an introductory topic for database developers, performance engineers, and anyone optimizing data processing workloads on Arm-based cloud instances.
11+
12+
13+
learning_objectives:
14+
- Understand bitmap scanning operations in database systems
15+
- Implement bitmap scanning with scalar, NEON, and SVE instructions
16+
- Compare performance between different implementations
17+
- Measure performance improvements on Graviton4 instances
18+
19+
prerequisites:
20+
- An [Arm based instance](/learning-paths/servers-and-cloud-computing/csp/) from an appropriate
21+
cloud service provider.
22+
23+
author: Pareena Verma
24+
25+
26+
### Tags
27+
skilllevels: Introductory
28+
subjects: Performance and Architecture
29+
armips:
30+
- Neoverse
31+
operatingsystems:
32+
- Linux
33+
tools_software_languages:
34+
- SVE
35+
- NEON
36+
- Runbook
37+
38+
further_reading:
39+
- resource:
40+
title: Accelerate multi-token search in strings with SVE2 SVMATCH instruction
41+
link: https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-token-search-strings-svmatch-instruction
42+
type: blog
43+
- resource:
44+
title: Arm SVE2 Programming Guide
45+
link: https://developer.arm.com/documentation/102340/latest/
46+
type: documentation
47+
48+
### FIXED, DO NOT MODIFY
49+
# ================================================================================
50+
weight: 1 # _index.md always has weight of 1 to order correctly
51+
layout: "learningpathall" # All files under learning paths have this same wrapper
52+
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
53+
layout: learningpathall
54+
---

0 commit comments

Comments
 (0)