|
| 1 | +--- |
| 2 | +title: 从 2D 到 BEV,LSS 技术如何重塑自动驾驶感知? |
| 3 | +tags: |
| 4 | + - BEV |
| 5 | + - LSS |
| 6 | +categories: |
| 7 | + - Algorithms |
| 8 | +date: 2025-03-21 19:20:16 |
| 9 | +--- |
| 10 | + |
| 11 | +## 1. BEV 与 LSS |
| 12 | + |
| 13 | +**BEV** (Bird-Eye-View),鸟瞰视角是一种比相机视角更直观、信息更丰富的视角,非常适合用来做多传感器信息融合,特别是在跨相机物体跟踪方面,能够提供一致性更强的信息。 |
| 14 | + |
| 15 | +**LSS**(Lift-Splat-Shoot),是一种能够将多视角图像转换为 BEV 表示的技术。作为 BEV 领域的一个经典方法,LSS 兼顾了效率和准确性,在自动驾驶、机器人感知等任务中都有广泛应用。 |
| 16 | + |
| 17 | +论文:[Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123590188.pdf) |
| 18 | + |
| 19 | +代码:[https://github.com/nv-tlabs/lift-splat-shoot](https://github.com/nv-tlabs/lift-splat-shoot#) |
| 20 | + |
| 21 | +## 2. Motivation |
| 22 | + |
| 23 | +- 先做深度估计和特征融合,然后投影到 BEV 视图中 |
| 24 | +- 在 BEV 视图中做特征融合,在融合后的特征图上做检测、规划等任务 |
| 25 | + |
| 26 | +## 3. Method |
| 27 | + |
| 28 | +### 3.1 Lift(升维) |
| 29 | + |
| 30 | +将 2D 图像(WxHx3)增加深度信息,升维到 3D(WxHxD),其中 D 表示不同的深度,为不同的深度学习出C维的特征,就得到了 WxHxDxC 维度的视锥点云 |
| 31 | + |
| 32 | + |
| 33 | + |
| 34 | +### 3.2 Splat(投影) |
| 35 | + |
| 36 | +将图像特征投影到BEV空间中,采用 cumsum trick 的方法将特征求和,得到 CxXxY 维度的 BEV 特征,至此 BEV 特征已提取完成 |
| 37 | + |
| 38 | + |
| 39 | + |
| 40 | +之所以要做求和操作,是因为会出现多个视锥点云落在同一BEV栅格的情况,有两种情况会出现 |
| 41 | + |
| 42 | +- 不同高度的视锥点云会落在同一个栅格内,比如电线杆上的不同像素点 |
| 43 | +- 不同相机之间存在 overlap,两个相机观测到的同一个物体,会落在同一个BEV栅格 |
| 44 | + |
| 45 | +### 3.3 Shoot |
| 46 | + |
| 47 | +将轨迹通过模板投影到 BEV 空间上,并计算轨迹的 cost,实现端到端的轨迹规划,这一步是锦上添花的工作 |
| 48 | + |
| 49 | + |
| 50 | + |
| 51 | +## 4. 训练 |
| 52 | + |
| 53 | +### 4.1 Cost |
| 54 | + |
| 55 | +- 图像 backbone 采用 EfficientNet,通过预训练得到深度估计,需要标记检测出的物体在BEV视角下的投影 |
| 56 | +- 监督真值是实例分割结果、可行驶区域,Loss 定义为预测结果与 groud truth 的交叉熵 |
| 57 | + |
| 58 | +## 5. 代码解析 |
| 59 | + |
| 60 | +首先看模型的初始化函数,包含以下三个模块: |
| 61 | +- camencode:图像特征提取 |
| 62 | +- bevencode:BEV检测backbone |
| 63 | +- frustum:视锥,用于图像点云坐标和BEV栅格间的坐标转换 |
| 64 | + |
| 65 | +其它变量是模型的参数,看名称就可以知道是什么含义,这里不做展开 |
| 66 | + |
| 67 | +```python |
| 68 | +class LiftSplatShoot(nn.Module): |
| 69 | + def __init__(self, grid_conf, data_aug_conf, outC): |
| 70 | + super(LiftSplatShoot, self).__init__() |
| 71 | + self.grid_conf = grid_conf |
| 72 | + self.data_aug_conf = data_aug_conf |
| 73 | + |
| 74 | + dx, bx, nx = gen_dx_bx(self.grid_conf['xbound'], |
| 75 | + self.grid_conf['ybound'], |
| 76 | + self.grid_conf['zbound'], |
| 77 | + ) |
| 78 | + self.dx = nn.Parameter(dx, requires_grad=False) |
| 79 | + self.bx = nn.Parameter(bx, requires_grad=False) |
| 80 | + self.nx = nn.Parameter(nx, requires_grad=False) |
| 81 | + |
| 82 | + self.downsample = 16 |
| 83 | + self.camC = 64 |
| 84 | + self.frustum = self.create_frustum() |
| 85 | + self.D, _, _, _ = self.frustum.shape |
| 86 | + self.camencode = CamEncode(self.D, self.camC, self.downsample) |
| 87 | + self.bevencode = BevEncode(inC=self.camC, outC=outC) |
| 88 | + |
| 89 | + # toggle using QuickCumsum vs. autograd |
| 90 | + self.use_quickcumsum = True |
| 91 | +``` |
| 92 | + |
| 93 | +模型的推理 pipeline 如下,forward() 调用了 get_voxels() 方法,get_voxels() 中: |
| 94 | +- get_geometry() 属于 Lift 操作 |
| 95 | +- voxel_polling() 属于 Splat 操作 |
| 96 | + |
| 97 | +下面小结我们详细介绍这两步操作,这里先介绍一下 forward 的函数的参数: |
| 98 | +- x:BxN 张图像 |
| 99 | +- rots, trans:相机的外参,以旋转矩阵和平移矩阵形式表示 |
| 100 | +- intrins:相机内参 |
| 101 | +- post_rots,post_trans:图像增强时使用的旋转矩阵和平移矩阵,用于在模型训练时撤销图像增强引入的位姿变化 |
| 102 | + |
| 103 | +```python |
| 104 | +def get_voxels(self, x, rots, trans, intrins, post_rots, post_trans): |
| 105 | + geom = self.get_geometry(rots, trans, intrins, post_rots, post_trans) |
| 106 | + x = self.get_cam_feats(x) |
| 107 | + |
| 108 | + x = self.voxel_pooling(geom, x) |
| 109 | + |
| 110 | + return x |
| 111 | + |
| 112 | +def forward(self, x, rots, trans, intrins, post_rots, post_trans): |
| 113 | + x = self.get_voxels(x, rots, trans, intrins, post_rots, post_trans) |
| 114 | + x = self.bevencode(x) |
| 115 | + return x |
| 116 | +``` |
| 117 | + |
| 118 | +### 5.1 Lift |
| 119 | + |
| 120 | +Lift 操作分为两步: |
| 121 | + |
| 122 | +- 生成图像坐标->视锥坐标的变换 |
| 123 | +- 计算视锥坐标->投影坐标的变换 |
| 124 | + |
| 125 | +第一步比较简单,需要了解普通相机模型,得到的 xy 是单位深度下的像素坐标 |
| 126 | + |
| 127 | +```python |
| 128 | +def create_frustum(self): |
| 129 | + # make grid in image plane |
| 130 | + ogfH, ogfW = self.data_aug_conf['final_dim'] |
| 131 | + fH, fW = ogfH // self.downsample, ogfW // self.downsample |
| 132 | + # ds: DxWxH 表示每个点的深度 |
| 133 | + ds = torch.arange(*self.grid_conf['dbound'], dtype=torch.float).view(-1, 1, 1).expand(-1, fH, fW) |
| 134 | + D, _, _ = ds.shape |
| 135 | + # xs:DxWxH 表示每个点的x坐标 |
| 136 | + xs = torch.linspace(0, ogfW - 1, fW, dtype=torch.float).view(1, 1, fW).expand(D, fH, fW) |
| 137 | + # ys:DxWxH 表示每个点的y坐标 |
| 138 | + ys = torch.linspace(0, ogfH - 1, fH, dtype=torch.float).view(1, fH, 1).expand(D, fH, fW) |
| 139 | + |
| 140 | + # D x H x W x 3 |
| 141 | + frustum = torch.stack((xs, ys, ds), -1) |
| 142 | + return nn.Parameter(frustum, requires_grad=False) |
| 143 | +``` |
| 144 | + |
| 145 | +主要分为三步: |
| 146 | +- 去除相机增强的位姿变换影响 |
| 147 | +- 转换到真实坐标系再乘以内存去畸变,需要注意的是,上一步得到的 xy 是单位深度下的相机坐标,不同深度对应的 xy 是一样的,因此需要乘以深度 d 才能得到真实世界的坐标 |
| 148 | +- 通过外参转换到 BEV 坐标系下 |
| 149 | + |
| 150 | +```python |
| 151 | +def get_geometry(self, rots, trans, intrins, post_rots, post_trans): |
| 152 | + """Determine the (x,y,z) locations (in the ego frame) |
| 153 | + of the points in the point cloud. |
| 154 | + Returns B x N x D x H/downsample x W/downsample x 3 |
| 155 | + """ |
| 156 | + B, N, _ = trans.shape |
| 157 | + |
| 158 | + # 撤销图像增强的图像变换 |
| 159 | + # B x N x D x H x W x 3 |
| 160 | + points = self.frustum - post_trans.view(B, N, 1, 1, 1, 3) |
| 161 | + points = torch.inverse(post_rots).view(B, N, 1, 1, 1, 3, 3).matmul(points.unsqueeze(-1)) |
| 162 | + |
| 163 | + # cam_to_ego |
| 164 | + points = torch.cat((points[:, :, :, :, :, :2] * points[:, :, :, :, :, 2:3], |
| 165 | + points[:, :, :, :, :, 2:3] |
| 166 | + ), 5) |
| 167 | + combine = rots.matmul(torch.inverse(intrins)) |
| 168 | + points = combine.view(B, N, 1, 1, 1, 3, 3).matmul(points).squeeze(-1) |
| 169 | + points += trans.view(B, N, 1, 1, 1, 3) |
| 170 | + |
| 171 | + return points |
| 172 | +``` |
| 173 | + |
| 174 | +### 5.2 Splat |
| 175 | + |
| 176 | +- 首先,将输入的矩阵一维化,得到了一维索引 |
| 177 | +- 由于一维索引是在 BEV 空间下按照顾 X->Y->Z->B 的顺序生成的,因此相邻体素的索引也是相邻的 |
| 178 | +- 用排序后的索引对相机特征进行重排,这样相邻的体素对应的相机特征也是相邻的 |
| 179 | +- 对排序后的特征进行 acumsum,方便后续特征计算与求导 |
| 180 | + |
| 181 | +注:这里采用的是 sum pooling,而不是 max pooling 或者 average pooling,是为了保存特征响应比较强的体素,定性来讲,被观测比较多的特征,在 BEV 空间上会有更强的 feature |
| 182 | + |
| 183 | +```python |
| 184 | +def voxel_pooling(self, geom_feats, x): |
| 185 | + B, N, D, H, W, C = x.shape |
| 186 | + Nprime = B*N*D*H*W |
| 187 | + |
| 188 | + # flatten x |
| 189 | + x = x.reshape(Nprime, C) |
| 190 | + |
| 191 | + # flatten indices |
| 192 | + geom_feats = ((geom_feats - (self.bx - self.dx/2.)) / self.dx).long() |
| 193 | + geom_feats = geom_feats.view(Nprime, 3) |
| 194 | + batch_ix = torch.cat([torch.full([Nprime//B, 1], ix, |
| 195 | + device=x.device, dtype=torch.long) for ix in range(B)]) |
| 196 | + geom_feats = torch.cat((geom_feats, batch_ix), 1) |
| 197 | + |
| 198 | + # filter out points that are outside box |
| 199 | + kept = (geom_feats[:, 0] >= 0) & (geom_feats[:, 0] < self.nx[0])\ |
| 200 | + & (geom_feats[:, 1] >= 0) & (geom_feats[:, 1] < self.nx[1])\ |
| 201 | + & (geom_feats[:, 2] >= 0) & (geom_feats[:, 2] < self.nx[2]) |
| 202 | + x = x[kept] |
| 203 | + geom_feats = geom_feats[kept] |
| 204 | + |
| 205 | + # get tensors from the same voxel next to each other |
| 206 | + ranks = geom_feats[:, 0] * (self.nx[1] * self.nx[2] * B)\ |
| 207 | + + geom_feats[:, 1] * (self.nx[2] * B)\ |
| 208 | + + geom_feats[:, 2] * B\ |
| 209 | + + geom_feats[:, 3] |
| 210 | + sorts = ranks.argsort() |
| 211 | + x, geom_feats, ranks = x[sorts], geom_feats[sorts], ranks[sorts] |
| 212 | + |
| 213 | + # cumsum trick |
| 214 | + if not self.use_quickcumsum: |
| 215 | + x, geom_feats = cumsum_trick(x, geom_feats, ranks) |
| 216 | + else: |
| 217 | + x, geom_feats = QuickCumsum.apply(x, geom_feats, ranks) |
| 218 | + |
| 219 | + # griddify (B x C x Z x X x Y) |
| 220 | + final = torch.zeros((B, C, self.nx[2], self.nx[0], self.nx[1]), device=x.device) |
| 221 | + final[geom_feats[:, 3], :, geom_feats[:, 2], geom_feats[:, 0], geom_feats[:, 1]] = x |
| 222 | + |
| 223 | + # collapse Z |
| 224 | + final = torch.cat(final.unbind(dim=2), 1) |
| 225 | + |
| 226 | + return final |
| 227 | +``` |
| 228 | + |
| 229 | +## 6. 总结 |
| 230 | + |
| 231 | +- LSS 总体思路蛮直观的,图像特征按照深度投影到 BEV 平面上,经过 pooling 操作得到 BEV feature |
| 232 | +- 先学习得到深度分布D 和 特征C,然后两者点积得到 DxC 维向量,比较巧妙 |
| 233 | +- cumsum_trick 虽然很巧妙,但是容易增加理解复杂度,这个技巧只是为了加速,本质上还是 sum pooling,不用这个技巧应该也能得到相同的效果 |
| 234 | + |
| 235 | +下一篇,学习 BEVFormer |
0 commit comments