Skip to content

Commit 7b37427

Browse files
committed
add new blog: BEV-LSS
1 parent 47b5d52 commit 7b37427

File tree

1 file changed

+235
-0
lines changed

1 file changed

+235
-0
lines changed
Lines changed: 235 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,235 @@
1+
---
2+
title: 从 2D 到 BEV,LSS 技术如何重塑自动驾驶感知?
3+
tags:
4+
- BEV
5+
- LSS
6+
categories:
7+
- Algorithms
8+
date: 2025-03-21 19:20:16
9+
---
10+
11+
## 1. BEV 与 LSS
12+
13+
**BEV** (Bird-Eye-View),鸟瞰视角是一种比相机视角更直观、信息更丰富的视角,非常适合用来做多传感器信息融合,特别是在跨相机物体跟踪方面,能够提供一致性更强的信息。
14+
15+
**LSS**(Lift-Splat-Shoot),是一种能够将多视角图像转换为 BEV 表示的技术。作为 BEV 领域的一个经典方法,LSS 兼顾了效率和准确性,在自动驾驶、机器人感知等任务中都有广泛应用。
16+
17+
论文:[Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123590188.pdf)
18+
19+
代码:[https://github.com/nv-tlabs/lift-splat-shoot](https://github.com/nv-tlabs/lift-splat-shoot#)
20+
21+
## 2. Motivation
22+
23+
- 先做深度估计和特征融合,然后投影到 BEV 视图中
24+
- 在 BEV 视图中做特征融合,在融合后的特征图上做检测、规划等任务
25+
26+
## 3. Method
27+
28+
### 3.1 Lift(升维)
29+
30+
将 2D 图像(WxHx3)增加深度信息,升维到 3D(WxHxD),其中 D 表示不同的深度,为不同的深度学习出C维的特征,就得到了 WxHxDxC 维度的视锥点云
31+
32+
![LSS-Lift](https://s21.ax1x.com/2025/03/17/pEdFjnx.md.png)
33+
34+
### 3.2 Splat(投影)
35+
36+
将图像特征投影到BEV空间中,采用 cumsum trick 的方法将特征求和,得到 CxXxY 维度的 BEV 特征,至此 BEV 特征已提取完成
37+
38+
![LSS-Splat](https://s21.ax1x.com/2025/03/17/pEdkP9H.md.png)
39+
40+
之所以要做求和操作,是因为会出现多个视锥点云落在同一BEV栅格的情况,有两种情况会出现
41+
42+
- 不同高度的视锥点云会落在同一个栅格内,比如电线杆上的不同像素点
43+
- 不同相机之间存在 overlap,两个相机观测到的同一个物体,会落在同一个BEV栅格
44+
45+
### 3.3 Shoot
46+
47+
将轨迹通过模板投影到 BEV 空间上,并计算轨迹的 cost,实现端到端的轨迹规划,这一步是锦上添花的工作
48+
49+
![LSS-Shoot](https://s21.ax1x.com/2025/03/17/pEdkpND.md.png)
50+
51+
## 4. 训练
52+
53+
### 4.1 Cost
54+
55+
- 图像 backbone 采用 EfficientNet,通过预训练得到深度估计,需要标记检测出的物体在BEV视角下的投影
56+
- 监督真值是实例分割结果、可行驶区域,Loss 定义为预测结果与 groud truth 的交叉熵
57+
58+
## 5. 代码解析
59+
60+
首先看模型的初始化函数,包含以下三个模块:
61+
- camencode:图像特征提取
62+
- bevencode:BEV检测backbone
63+
- frustum:视锥,用于图像点云坐标和BEV栅格间的坐标转换
64+
65+
其它变量是模型的参数,看名称就可以知道是什么含义,这里不做展开
66+
67+
```python
68+
class LiftSplatShoot(nn.Module):
69+
def __init__(self, grid_conf, data_aug_conf, outC):
70+
super(LiftSplatShoot, self).__init__()
71+
self.grid_conf = grid_conf
72+
self.data_aug_conf = data_aug_conf
73+
74+
dx, bx, nx = gen_dx_bx(self.grid_conf['xbound'],
75+
self.grid_conf['ybound'],
76+
self.grid_conf['zbound'],
77+
)
78+
self.dx = nn.Parameter(dx, requires_grad=False)
79+
self.bx = nn.Parameter(bx, requires_grad=False)
80+
self.nx = nn.Parameter(nx, requires_grad=False)
81+
82+
self.downsample = 16
83+
self.camC = 64
84+
self.frustum = self.create_frustum()
85+
self.D, _, _, _ = self.frustum.shape
86+
self.camencode = CamEncode(self.D, self.camC, self.downsample)
87+
self.bevencode = BevEncode(inC=self.camC, outC=outC)
88+
89+
# toggle using QuickCumsum vs. autograd
90+
self.use_quickcumsum = True
91+
```
92+
93+
模型的推理 pipeline 如下,forward() 调用了 get_voxels() 方法,get_voxels() 中:
94+
- get_geometry() 属于 Lift 操作
95+
- voxel_polling() 属于 Splat 操作
96+
97+
下面小结我们详细介绍这两步操作,这里先介绍一下 forward 的函数的参数:
98+
- x:BxN 张图像
99+
- rots, trans:相机的外参,以旋转矩阵和平移矩阵形式表示
100+
- intrins:相机内参
101+
- post_rots,post_trans:图像增强时使用的旋转矩阵和平移矩阵,用于在模型训练时撤销图像增强引入的位姿变化
102+
103+
```python
104+
def get_voxels(self, x, rots, trans, intrins, post_rots, post_trans):
105+
geom = self.get_geometry(rots, trans, intrins, post_rots, post_trans)
106+
x = self.get_cam_feats(x)
107+
108+
x = self.voxel_pooling(geom, x)
109+
110+
return x
111+
112+
def forward(self, x, rots, trans, intrins, post_rots, post_trans):
113+
x = self.get_voxels(x, rots, trans, intrins, post_rots, post_trans)
114+
x = self.bevencode(x)
115+
return x
116+
```
117+
118+
### 5.1 Lift
119+
120+
Lift 操作分为两步:
121+
122+
- 生成图像坐标->视锥坐标的变换
123+
- 计算视锥坐标->投影坐标的变换
124+
125+
第一步比较简单,需要了解普通相机模型,得到的 xy 是单位深度下的像素坐标
126+
127+
```python
128+
def create_frustum(self):
129+
# make grid in image plane
130+
ogfH, ogfW = self.data_aug_conf['final_dim']
131+
fH, fW = ogfH // self.downsample, ogfW // self.downsample
132+
# ds: DxWxH 表示每个点的深度
133+
ds = torch.arange(*self.grid_conf['dbound'], dtype=torch.float).view(-1, 1, 1).expand(-1, fH, fW)
134+
D, _, _ = ds.shape
135+
# xs:DxWxH 表示每个点的x坐标
136+
xs = torch.linspace(0, ogfW - 1, fW, dtype=torch.float).view(1, 1, fW).expand(D, fH, fW)
137+
# ys:DxWxH 表示每个点的y坐标
138+
ys = torch.linspace(0, ogfH - 1, fH, dtype=torch.float).view(1, fH, 1).expand(D, fH, fW)
139+
140+
# D x H x W x 3
141+
frustum = torch.stack((xs, ys, ds), -1)
142+
return nn.Parameter(frustum, requires_grad=False)
143+
```
144+
145+
主要分为三步:
146+
- 去除相机增强的位姿变换影响
147+
- 转换到真实坐标系再乘以内存去畸变,需要注意的是,上一步得到的 xy 是单位深度下的相机坐标,不同深度对应的 xy 是一样的,因此需要乘以深度 d 才能得到真实世界的坐标
148+
- 通过外参转换到 BEV 坐标系下
149+
150+
```python
151+
def get_geometry(self, rots, trans, intrins, post_rots, post_trans):
152+
"""Determine the (x,y,z) locations (in the ego frame)
153+
of the points in the point cloud.
154+
Returns B x N x D x H/downsample x W/downsample x 3
155+
"""
156+
B, N, _ = trans.shape
157+
158+
# 撤销图像增强的图像变换
159+
# B x N x D x H x W x 3
160+
points = self.frustum - post_trans.view(B, N, 1, 1, 1, 3)
161+
points = torch.inverse(post_rots).view(B, N, 1, 1, 1, 3, 3).matmul(points.unsqueeze(-1))
162+
163+
# cam_to_ego
164+
points = torch.cat((points[:, :, :, :, :, :2] * points[:, :, :, :, :, 2:3],
165+
points[:, :, :, :, :, 2:3]
166+
), 5)
167+
combine = rots.matmul(torch.inverse(intrins))
168+
points = combine.view(B, N, 1, 1, 1, 3, 3).matmul(points).squeeze(-1)
169+
points += trans.view(B, N, 1, 1, 1, 3)
170+
171+
return points
172+
```
173+
174+
### 5.2 Splat
175+
176+
- 首先,将输入的矩阵一维化,得到了一维索引
177+
- 由于一维索引是在 BEV 空间下按照顾 X->Y->Z->B 的顺序生成的,因此相邻体素的索引也是相邻的
178+
- 用排序后的索引对相机特征进行重排,这样相邻的体素对应的相机特征也是相邻的
179+
- 对排序后的特征进行 acumsum,方便后续特征计算与求导
180+
181+
注:这里采用的是 sum pooling,而不是 max pooling 或者 average pooling,是为了保存特征响应比较强的体素,定性来讲,被观测比较多的特征,在 BEV 空间上会有更强的 feature
182+
183+
```python
184+
def voxel_pooling(self, geom_feats, x):
185+
B, N, D, H, W, C = x.shape
186+
Nprime = B*N*D*H*W
187+
188+
# flatten x
189+
x = x.reshape(Nprime, C)
190+
191+
# flatten indices
192+
geom_feats = ((geom_feats - (self.bx - self.dx/2.)) / self.dx).long()
193+
geom_feats = geom_feats.view(Nprime, 3)
194+
batch_ix = torch.cat([torch.full([Nprime//B, 1], ix,
195+
device=x.device, dtype=torch.long) for ix in range(B)])
196+
geom_feats = torch.cat((geom_feats, batch_ix), 1)
197+
198+
# filter out points that are outside box
199+
kept = (geom_feats[:, 0] >= 0) & (geom_feats[:, 0] < self.nx[0])\
200+
& (geom_feats[:, 1] >= 0) & (geom_feats[:, 1] < self.nx[1])\
201+
& (geom_feats[:, 2] >= 0) & (geom_feats[:, 2] < self.nx[2])
202+
x = x[kept]
203+
geom_feats = geom_feats[kept]
204+
205+
# get tensors from the same voxel next to each other
206+
ranks = geom_feats[:, 0] * (self.nx[1] * self.nx[2] * B)\
207+
+ geom_feats[:, 1] * (self.nx[2] * B)\
208+
+ geom_feats[:, 2] * B\
209+
+ geom_feats[:, 3]
210+
sorts = ranks.argsort()
211+
x, geom_feats, ranks = x[sorts], geom_feats[sorts], ranks[sorts]
212+
213+
# cumsum trick
214+
if not self.use_quickcumsum:
215+
x, geom_feats = cumsum_trick(x, geom_feats, ranks)
216+
else:
217+
x, geom_feats = QuickCumsum.apply(x, geom_feats, ranks)
218+
219+
# griddify (B x C x Z x X x Y)
220+
final = torch.zeros((B, C, self.nx[2], self.nx[0], self.nx[1]), device=x.device)
221+
final[geom_feats[:, 3], :, geom_feats[:, 2], geom_feats[:, 0], geom_feats[:, 1]] = x
222+
223+
# collapse Z
224+
final = torch.cat(final.unbind(dim=2), 1)
225+
226+
return final
227+
```
228+
229+
## 6. 总结
230+
231+
- LSS 总体思路蛮直观的,图像特征按照深度投影到 BEV 平面上,经过 pooling 操作得到 BEV feature
232+
- 先学习得到深度分布D 和 特征C,然后两者点积得到 DxC 维向量,比较巧妙
233+
- cumsum_trick 虽然很巧妙,但是容易增加理解复杂度,这个技巧只是为了加速,本质上还是 sum pooling,不用这个技巧应该也能得到相同的效果
234+
235+
下一篇,学习 BEVFormer

0 commit comments

Comments
 (0)