gavinsun0921.github.io/index.json at main · GavinSun0921/gavinsun0921.github.io · GitHub

1
[{"content":"TL;DR 动态场景的 3D 重建一直是个硬骨头，通常需要堆叠光流、深度、位姿等多个模型。Google DeepMind 刚刚发布的 D4RT 提出了一种大道至简的思路：将所有几何任务降维成一个通用的“坐标查询”函数。它不仅在单次前馈中解决了 SLAM + 重建 + 跟踪，还跑出了 200+ FPS 的惊人速度。\n1. 痛点：动态世界的“拼图游戏” 在计算机视觉领域，如果你想从一段视频中重建出 4D 动态场景（即 3D 几何 + 时间），传统的做法往往很累。\n你需要一个模型算光流（Optical Flow），一个模型估深度（Depth），再来一个系统算相机位姿（SfM/SLAM）。像 MegaSaM 这样的工作就是典型的“拼装车”，虽然效果不错，但系统极其复杂，且需要昂贵的测试时优化（Test-time Optimization）。而像 VGGT 这样的端到端模型，虽然快了，但往往通过不同的 Head 去回归不同的任务，导致几何一致性较差。\nDeepMind 的 D4RT (Dynamic 4D Reconstruction and Tracking) 给了我们一个全新的解法：能不能不要那么多 Head，只用一个 Decoder 搞定所有事？\n2. 核心方法：万物皆可 Query D4RT 的架构核心是一个**以查询为中心（Query-centric）**的 Transformer 解码器。\n不同于输出固定的深度图或体素，D4RT 将解码器看作一个函数接口。你给它一个查询 $q$，它还你一个 3D 坐标 $P$：\n$$ P = \\mathcal{D}(q, F) $$\n其中 $F$ 是编码器提取的全局场景表示。最精妙的地方在于这个 $q$ 的设计：\n$$ q = (u, v, t_{src}, t_{tgt}, t_{cam}) $$\n这 5 个参数实现了空间、时间和参考系的完全解耦：\n$(u, v)$: 你关注的像素点在哪里？ $t_{src}$: 这个点是在哪一帧被选中的？ $t_{tgt}$: 你想知道它在哪个时刻的状态？ $t_{cam}$: 你希望输出的坐标基于哪个时刻的相机坐标系？ 这种设计带来了什么？（One Interface for All）\n通过排列组合这几个参数，D4RT 在同一个模型内实现了所有几何任务，如下图所示，同一个 Encoder，同一个 Decoder，仅仅是输入参数的不同，就能输出轨迹、深度或点云。\nFig. 1. D4RT is a unified, efficient, feedforward method for Dynamic 4D Reconstruction and Tracking, unlocking a variety of outputs including point cloud (Orange), point tracks (Blue), camera parameters (Pink) through a single interface.\nTab. 1. Unified decoding – A diverse set of geometry-related tasks can be inferred by querying the Cartesian product of the respective entries.\n3. D4RT 的位姿估计比单独 Pose Head 还准？ 在评估中，D4RT 在相机位姿估计（Camera Pose Estimation）上击败了专门设计 Pose 回归头的 VGGT。这反直觉吗？并不。\nD4RT 揭示了一个深刻的工程哲学：让神经网络做它擅长的，让数学做它擅长的。\nVGGT 的做法：让网络直接回归旋转矩阵 $R$ 和平移向量 $T$。这是个黑盒过程，网络得隐式学习刚体变换规律，很难泛化。 D4RT 的做法：网络只负责预测 3D 点的对应关系（这是感知的强项）。拿到两组对应的 3D 点云后，D4RT 使用经典的 Umeyama 算法（SVD 分解）来计算位姿 。 这是一个具有数学封闭解（Closed-form solution）的过程。只要网络能把特征点匹配对，位姿解算就是纯几何的、精确的“集体投票”过程，天然比单点回归更鲁棒。\n4. 细节决定成败：Local RGB Patch Transformer 虽然强，但在处理高频细节（如头发丝、边缘）时往往比较糊。D4RT 引入了一个简单的 Trick：Local RGB Patch。\n在构建 Query 时，除了传入坐标 $(u,v)$，作者还将该坐标周围 $9\\times9$ 的 RGB 像素块 Patch 也作为一个 Embedding 喂进去。这相当于给了解码器一个“小抄”，告诉它：“你要找的点长这样”。\n实验证明，这个操作极大地提升了边缘细节和深度图的锐度，实现了****亚像素级**（Subpixel）的精度。\n5. 性能与速度 D4RT 是一个纯前馈网络（Feedforward），不需要测试时优化。\n速度：在 A100 GPU 上，D4RT 的相机位姿估计吞吐量达到了 200+ FPS。相比之下，MegaSaM 只有约 2 FPS，VGGT 约 20 FPS。 精度：在 Sintel（动态场景）和 ScanNet（静态场景）上，无论是深度误差还是位姿误差，均优于当前 SOTA。 Fig. 2. Pose accuracy vs. speed.\n6. 总结与思考 D4RT 是一篇极具工程美感的论文。它没有堆砌复杂的模块，而是通过设计一个足够灵活的底层表示（General Scene Representation）和通用的查询接口，解决了 4D 重建中长期存在的“任务碎片化”问题。\n对后续工作的启示：\n解耦 (Decoupling) 是王道：将时间、空间和参考系解耦，能带来巨大的灵活性。 Explicit Geometry still matters：不要迷信端到端回归，适当引入经典几何求解器（如 SVD、Umeyama）能提升系统的泛化性和可解释性。 稀疏查询带来的效率：基于 Query 的设计允许我们在训练时只算少量点（快），推理时按需算全图（全），这是实现 Real-time 4D 感知的关键。 ","permalink":"https://gavinsun0921.github.io/posts/fast-paper-reading-09/","summary":"动态场景的 3D 重建一直是个硬骨头，通常需要堆叠光流、深度、位姿等多个模型。Google DeepMind 刚刚发布的 D4RT 提出了一种大道至简的思路：将所有几何任务降维成一个通用的“坐标查询”函数。它不仅在单次前馈中解决了 SLAM + 重建 + 跟踪，还跑出了 200+ FPS 的惊人速度。","title":"[arXiv'2512] Efficiently Reconstructing Dynamic Scenes One 🎯 D4RT at a Time 阅读报告"},{"content":" 这篇文章提出了一种从单目视频中重建动态场景并估计长程 3D 运动轨迹的新方法。\n1. 核心背景与动机 (Motivation \u0026amp; Challenges) 1.1 问题定义 该研究旨在解决**单目视频 (Monocular Video)**中重建复杂动态场景的几何形状 (3D Geometry)和长程 3D 运动轨迹 (Long-range 3D Motion)这一难题。\n1.2 现有挑战 首先这是一个病态问题 (ill-posed nature)：从单视角恢复动态 3D 场景极其困难，因为每个时刻只能从一个视点观察到移动到物体。\n现有方法的局限性：\n大多数方法依赖多视角视频或 LiDAR 深度传感器。 现有的弹幕方法通常只建模短程场景流 (Scene Flow)，或者仅适用于准静态场景/相机瞬移场景，无法捕捉持续的长程 3D 轨迹。 纯 2D 跟踪方法（如 TAPIR）虽然强大，但缺乏 3D 几何和运动感知。 1.3 核心洞察 (Key Insights) 运动的低维性：虽然图像空间的 2D 动态可能很复杂，但底层的 3D 运动通常是由简单的刚体运动单元组成（例如多个刚体部件的组合）。 数据驱动先验的融合：利用现有的强大先验模型（如弹幕深度估计、长程 2D 跟踪）提供的含噪信号，可以通过优化框架融合为全局一致的 3D 表示。 2. 核心方法 (Methodology) 该方法将动态场景表示为一组**持久的 3D 高斯体 (Persistent 3D Gaussians)，这些高斯体随时间进行平移和旋转。\n2.1 场景表示 (Scene Representation) Canonical 3D Gaussians：场景由 N 个定义在规范帧 (Canonical Frame, $t_0$) 中的 3D 高斯体表示，包含位置、旋转、尺度、不透明度和颜色参数。 SE(3)运动基 (SE(3) Motion Bases)：为了利用运动的低维特性，作者定义了一组全局共享的 SE(3) 运动基 $\\left\\{ T^{(b)} \\right\\}^B_{b=1}(B\u0026laquo;N)$。每个高斯体的运动不是独立优化的，而是由这些基的线性组合决定的。 $T_{0\\to t} = \\sum_{t=0}^B w^{(b)} T^{(b)}_{0\\to t}$，其中$w^{(b)}$是每个高斯体特有的运动系数。 这种设计强制了运动的低秩约束，使得运动相似的高斯体（属于同一刚体）具有相似的系数。 2.2 优化流程与先验融合 (Optimization \u0026amp; Priors) 该系统是一个测试时优化 (Test-time Optimization) 框架，利用现成的工具提取先验信息作为输入：\n输入准备： Camera Pose：使用 MegaSAM 或 COLMAP 估计。 移动物体掩码：使用 Track-Anything。 单目深度图：使用 Depth-Anything，并进行对齐处理。 长程 2D 轨迹：使用 TAPIR 提取前景点的 2D 轨迹。 初始化： 将 2D 轨迹利用深度图 Lift 为含噪声的 3D 轨迹。 通过对这些噪声轨迹的速度进行 K-means 聚类，初始化 SE(3) 运动基。 监督损失函数 (Loss Functions)： 重建损失：渲染出的 RGB 图像、深度图和掩码与输入视频及先验深度/掩码一致。 2D 轨迹损失：渲染出的 3D 轨迹投影回 2D 屏幕后，应与 TAPIR 预测的 2D 轨迹匹配。 刚性/物理先验：强制动态高斯体与其邻居之间的距离随时间保持不变（局部刚性约束）。 3. 实验结果 (Experiments) 作者在合成数据集 (Kubric MOVi-F) 和真实世界数据集 (iPhone Dataset, NVIDIA Dataset) 上进行了广泛评估。\n3.1 评估任务与指标 Long-Range 3D Tracking：指标为 3D EPE 和准确率。 Long-Range 2D Tracking：指标包括 Average Jaccard (AJ) 和遮挡准确率 (OA)。 Novel View Systhesis：指标包括 PSNR, SSIM, LPIPS。 3.2 主要结果 iPhone Dataset (真实场景)： 在所有三个任务上均达到了 SOTA。 3D 跟踪：相比于简单的 \u0026ldquo;TAPIR + Depth Anything\u0026rdquo; 组合，该方法显著降低了误差 (EPE 从 0.114 降低至 0.082)，证明了全局优化能有效修正噪声先验。 对比 NeRF/3DGS：相比 HyperNeRF、DynIBaR 和 Deformable-3D-GS，该方法在保持高质量渲染的同时，提供了更准确的运动轨迹。 Kubric Dataset (合成场景)： 在具有快速运动和运动模糊的场景中，该方法的 3D 跟踪精度优于仅依赖 2D 跟踪器加深度提升的 baseline。 可视化效果： 能够生成被称为 \u0026ldquo;Shape of Motion\u0026rdquo; 的彩色 3D 轨迹，揭示了物体运动的几何模式（如旋转的风车、抛出的物体）。 PCA 分析显示，优化后的运动系数能够自动将场景分解为不同的刚体运动组。 3.3 消融实验 (Ablation Studies) SE(3) 基的重要性：使用 SE(3) 基比仅使用平移基 (Traslation Bases) 或者每个高斯体独立运动 (Per-Gaussian) 效果更好，能有效减少伪影并提升精度。 2D 轨迹监督的重要性：去掉 2D 轨迹损失会导致性能显著下降，证明了长程 2D 跟踪先验对恢复 3D 运动至关重要。 4. 总结与讨论 (Conclusion) 4.1 主要贡献 提出了一种 4D 场景表示，结合了持久的 3D 高斯体和紧凑的 SE(3) 运动基，支持实时 NVS 和全局一致的 3D 跟踪。 设计了一个优化框架，成功地将单目深度和长程 2D 轨迹等噪声先验融合为一个物理一致的动态场景模型。 在单目视频的 3D/2D 跟踪和 NVS 任务上取得了 SOTA 性能。 4.2 局限性 需要针对每个场景进行测试时优化（Test-time optimization），无法做到实时流式处理。 依赖于现成的先验模型（如相机位姿、掩码），如果这些先验在无纹理区域或剧烈运动下失效，重建质量会下降。 目前需要用户交互来指定移动物体的掩码。 4.3 总结 这篇工作是单目动态场景重建领域的重要进展，它通过显式建模 3D 运动轨迹并融合多模态先验，解决了传统方法难以兼顾渲染质量和运动估计精度的问题。\n","permalink":"https://gavinsun0921.github.io/posts/fast-paper-reading-08/","summary":"这篇文章提出了一种从单目视频中重建动态场景并估计长程 3D 运动轨迹的新方法。","title":"[ICCV'25 Highlight] Shape of Motion: 4D Reconstruction from a Single Video 阅读报告"},{"content":" 这篇文章提出了一种基于 3D 点轨迹 (3D Point Tracks) 的视频生成式编辑框架，能够同时精确控制摄像机运动和物体运动。\n1. 研究动机 (Motivation) 在视频编辑领域，精确控制“运动”（Motion）一直是一个巨大挑战，现有的方法存在明显的局限性：\nImage-to-Video (I2V) 方法的缺陷：如 TrajAttn, DaS 等。它们通常只基于第一帧图像生成视频，导致丢失了原视频后续帧的上下文信息，背景和物体的一致性较差。 Video-to-Video (V2V) 方法的缺陷： 摄像机控制类 (Camera-controlled)：如 ReCamMaster，只能改变视角，无法编辑物体本身的动作。 Inpainting 类：依赖于简单的扭曲和补全，当物体发生复杂运动或遮挡时，容易产生伪影（如无法正确处理物体移动后的阴影或水花）。 目标：提出一种统一的框架，既能利用原视频的全部上下文，又能精确地联合编辑摄像机视角和物体运动。\n2. 核心方法：Edit-by-Track (Methodology) 该论文提出了一种基于 3D 点轨迹 条件的 V2V 扩散模型框架。\n2.1 为什么选择 3D 点轨迹？ 统一表征：3D 点轨迹可以同时表示摄像机运动（背景点）和物体运动（前景点）。 深度感知：相比 2D 轨迹，3D 轨迹提供了显式的深度线索，能够帮助模型处理遮挡关系（Occlusion）和深度排序，实现更精确的编辑。 2.2 模型架构 基于预训练的 Text-to-Video (T2V) 模型 Wan-2.1 进行微调 ，引入了核心组件 3D Track Conditioner：\n输入处理： 输入源视频 ($V_{\\text{src}}$) 被编码为 Latent tokens。 用户编辑后的 3D 轨迹被投影到 2D 屏幕坐标，并保留深度信息 ($z$)。 3D Track Conditioner (核心创新)： Sampling (采样)：利用 Cross-Attention，根据源轨迹从源视频特征中“提取”视觉上下文。 Splatting (抛雪球/泼溅)：利用 Cross-Attention，将提取的特征根据目标轨迹“泼溅”到目标视频的特征空间中。 这种机制建立了源视频和目标视频之间的稀疏对应关系，实现了像素级的搬运和重组。 3. 训练策略：两阶段微调 (Two-Stage Training) 由于缺乏完美的“成对”训练数据（即：同一场景、不同运动的视频对），作者设计了两阶段策略：\nStage 1: 合成数据启动 (Synthetic Data Bootstrapping) 数据来源：使用 Blender 生成合成数据（Mixamo 人体动画 + Kubric 背景）。 目的：拥有完美的 Ground Truth 3D 轨迹，让模型初步学会听从轨迹指令进行运动控制。 Stage 2: 真实数据微调 (Real Data Fine-tuning) 数据构建：从单目真实视频中采样两个不连续的片段（Non-contiguous clips）。利用视频本身的时间跨度来模拟“源视频”到“目标视频”的运动变化（例如摄像机移动了，物体动作变了）。 轨迹扰动 (Track Perturbation)：为了应对真实视频中 3D 轨迹估计的噪声，训练时主动给目标轨迹添加噪声（如沿极线抖动、线性漂移），提高推理时的鲁棒性。 4. 应用场景 (Applications) 得益于 3D 轨迹的灵活性，该模型支持多种编辑任务：\n联合运动编辑：同时改变摄像机视角和物体运动（例如：让滑板少年换个方向滑，同时摄像机拉高）。 非刚体形变 (Non-rigid Deformation)：例如拉长一只狗的身体，或改变其形状。 人体动作迁移 (Human Motion Transfer)：结合 SMPL-X 参数，将一个人的动作迁移到视频中的人物上。 物体移除与复制：通过将轨迹移出画面实现移除，或复制轨迹实现物体克隆。 以上示例都可以去该工作的主页预览：edit-by-track\n5. 实验结果 (Experiments) 5.1 定量评估 在 DyCheck 数据集上，该方法在 PSNR, SSIM, LPIPS 等指标上均优于现有的 I2V (如 TrajAttn) 和 V2V (如 GEN3C) 方法。 在 MiraData 数据集（野外真实视频）上，取得了最低的端点误差 (EPE)，证明了其运动控制的准确性。 5.2 定性对比 相比 I2V 方法：Edit-by-Track 能够保持背景和物体外观的高度一致性，不会出现“失忆”或形变。 相比 Inpainting V2V 方法：能够正确生成物体移动后的物理伴随效应（如原来的位置不会留下鬼影，新位置有正确的光影）。 6. 局限性和总结 (Conclusion) 6.1 局限性 对于密集的小物体轨迹（如小物体的大幅度翻转），可能会出现视觉失真。 难以生成复杂的物理流体效果（如倒咖啡时，咖啡与牛奶的混合效果无法凭空生成）。 6.2 总结 Edit-by-Track 是第一个通过 3D 点轨迹实现联合摄像机与物体运动编辑的 V2V 框架。通过创新的 3D Track Conditioner 和两阶段训练策略，它解决了视频编辑中上下文保持和精确运动控制的难题。\n","permalink":"https://gavinsun0921.github.io/posts/fast-paper-reading-07/","summary":"这篇文章提出了一种基于 3D 点轨迹 (3D Point Tracks) 的视频生成式编辑框架，能够同时精确控制摄像机运动和物体运动。","title":"[arXiv'2512] Generative Video Motion Editing with 3D Point Tracks 阅读报告"},{"content":"0. 引言 随着NeRF、Gaussian Splatting等表示方法的发展，动态场景的4D重建与生成 已成为计算机视觉与计算机图形学的研究前沿。与静态3D重建相比，4D重建不仅要求恢复三维结构与外观，还要建模其随时间变化的运动与形变，因此本质上是一个“不适定”的逆问题，具有极大挑战。\n然而，目前许多4D生成相关研究依赖大规模扩散模型或生成式NeRF，往往需要昂贵的算力和海量训练数据。这使得资源有限的研究团队难以直接切入。相比之下，4D重建 为我们提供了一条更加务实且具创新潜力的研究路径：通过结合预训练模型、点跟踪、以及经典几何优化等轻量先验，可以在有限算力下实现高质量的动态场景恢复。\n本综述旨在：\n回顾近年来4D动态场景重建的最新进展，重点关注新型表示方法与3D点跟踪结合的两大方向； 分析这些工作的核心思想、算力需求与适用性； 基于此提出若干可行的研究选题，特别是适合有限算力条件下的创新路线，为后续 CVPR 2026 等会议的投稿提供参考。 1. 背景与动机 1.1. 4D 重建与生成概念 “4D Reconstruction \u0026amp; Generation”指的是对动态3D场景（即随时间变化的三维场景）的建模与合成。其中4D通常表示在三维空间上随时间推移的变化。动态场景重建要求我们从输入（如单目视频、多目视频等）中重建出场景在整个时间序列中的几何结构和外观，进而可以实现动态场景的新视角合成或动画重现。\n这一问题非常具有挑战性，因为相比静态3D重建，多了时间维度的运动/形变因素，解的不确定性更高，是一个“不适定”的逆问题。同时，4D生成（直接从文本/图片/视频生成动态场景）的任务更为困难，往往需要大规模模型（例如扩散模型、生成式NeRF等）和海量算力支持。\n1.2. 算力限制考量 由于实验室算力有限，直接训练或微调基础模型 (foundation models) 来完成4D生成是不现实的。同时，目前不少4D生成相关工作（如大规模视频生成模型）对算力要求极高，不适合作为初期研究切入点。因此，聚焦于4D重建是较为明智的选择：我们可以利用现有的视频/图像数据，通过算法和较小规模的模型，在有限算力下实现动态场景的重建。这也契合当前计算机视觉领域的一个趋势，即尽可能借助已有的预训练模型或高效算法，避开从零训练超大模型。接下来，我们将回归近期4D重建领域的重要进展，并基于这些工作提出的可行的研究课题设想。\n2. 4D重建最新进展简述 近年来，随着NeRF及相关技术的发展，动态场景的4D重建成为计算机视觉和图形学研究的热点。一方面，有研究致力于设计新的4D表示方法来高效表达动态场景；另一方面，也有工作尝试将3D点追踪 (point tracking) 与重建结合，以提升动态重建的鲁棒性和精度。以下我们按照这两个脉络分别介绍最新成果：\n2.1. 基于新型表示的动态4D重建 Gaussian Splatting方法与动态NeRF：静态NeRF都发展催生了Gaussian Splatting等更高效的表示方法，将场景表示为一系列高斯体元以加速渲染和优化。最近，这类思路被拓展到动态场景中，例如CVPR 2025的Mosca和FreeTimeGS。\nMoSca (CVPR 2025) - 4D Motion Scaffolds：MoSca提出了一套现代4D重建系统，用于从随手拍摄的单目视频中重建动态场景并合成新视角。它的核心是引入了一种运动脚手架 (Motion Scaffold) 的中间表示，将视频提升到一个能紧凑平滑编码底层运动/形变的4D表示。MoSca利用基础视觉模型的先验知识（如预训练的深度、光流等模型输出）来辅助这个Motion Scaffold的建立。在此基础上，场景的几何和外观通过锚定在脚手架上的高斯来表示，并使用Gaussian Splatting进行优化。值得一提的是，MoSca还集成了传统视觉的技巧：通过在不依赖外部工具的情况下，对相机焦距和位姿执行bundle adjustment（捆绑调整）优化，从而提升重建的准确性。实验结果表明，MoSca在动态新视角合成基准上达到了新的SOTA（state-of-the-art），在真实视频上也展现了出色效果。它证明了利用预训练模型先验+显式4D表示（运动场+高斯体元）+传统BA优化的混合方案，可以在有限数据和算力下取得优异的动态重建和渲染效果。 Fig. 1. MoSca Overview\nFreeTimeGS (CVPR 2025) – Free Gaussian Splatting: FreeTimeGS侧重于解决动态场景中复杂运动带来的挑战。之前的方法往往假设一个公共的规范空间 (canonical space)，并学习一个变形场将规范空间下的静态场景映射到各帧，这样可以实现实时的动态新视角合成。但这种思路在剧烈或复杂运动场景下难以奏效，一个主要原因是全局变形场的优化难度很大。FreeTimeGS针对这一问题提出了一种全新的4D高斯表示：允许高斯原语在时间和空间上“自由”出现或消失。也就是说，并不假设场景中每个点始终存在于所有帧，而是允许在需要时引入新的高斯来表示新出现的结构。这种表示灵活性更强，显著提高了对复杂动态场景的刻画能力。此外，每个高斯还被赋予一个运动函数，使其能够随时间平滑移动到相邻位置，以减少冗余并捕捉连续运动。通过在多个数据集上的实验，FreeTimeGS的渲染质量大幅超越近期方法。总的来看，FreeTimeGS表明，通过放弃固定的规范场景假设、采用可动态生成/消亡的点云高斯表示并给每个点添加运动模型，可以更好地适应复杂的动态变化。 Fig. 2. FreeTimeGS Overview\n以上两项工作代表了动态NeRF/场景表示方向的前沿进展。它们共同特点是：利用更灵活高效的表示（如高斯体元）来编码4D场景，并结合一定的先验（无论是基础模型知识还是对运动规律的显式假设），从而在无需极端算力的情况下取得优秀效果。这类方法通常需要对每个新视频进行优化（类似于NeRF的test-time optimization），但由于表示简洁，优化效率相对可控。此外，它们也能自然产生密集的动态几何和纹理结果，利于后续应用。\n2.2 结合3D点追踪的4D重建 另一条重要路线是将3D点的时空跟踪（tracking）与4D重建融合。传统上，结构重建和运动跟踪是不同任务，但在动态场景下它们密不可分：正确的重建依赖准确的点对应和相机运动估计，而动态物体的准确跟踪也依赖良好的三维结构感知。2025年前后有多项工作尝试统一或协同解决这两个问题，在有限算力条件下，这类方法尤其有吸引力，因为它们往往引入较少参数（有时结合传统算法）或利用预训练跟踪模型来简化重建。下面介绍三项代表性工作：\nSpatialTrackerV2 (ICCV 2025) – 端到端3D点跟踪框架: SpatialTrackerV2提出了首个统一的端到端3D点跟踪模型，可从单目视频中直接估计任意2D像素对应的世界坐标系下3D运动轨迹。与以往需要分别调用深度估计、光流、SLAM等模块的方法不同，SpatialTrackerV2在一个差分可训练的架构中同时学习场景几何、相机运动（自运动）以及逐像素的细粒度3D运动。这种统一设计使其能够在大规模多样的数据上进行可扩展训练（包括合成序列、有姿态和深度标注的视频，甚至无标注的野外视频）。训练好的模型对任意新视频推理非常高效，报告称每段序列仅需10–20秒即可输出完整的相机轨迹、稠密点轨迹和场景结构。得益于联合学习几何和运动，SpatialTrackerV2在3D点跟踪准确性上明显超越以往方法，同时在2D跟踪和动态3D重建任务上也取得了优秀结果。这一成果表明，通过充分利用数据驱动的方法将SLAM式的多模块流程融合为单一网络，可以极大提升动态场景理解能力。不过，需要注意的是训练这样的大模型本身算力需求不菲（作者结合了高校和企业资源，训练集包括合成和真实数据）。幸而作者提供了预训练模型和演示——这意味着研究者可以在有限算力下使用该模型的能力，而不必从头训练。 Fig. 3. SpatialTrackerV2 Overview\nSt4RTrack (ICCV 2025) – 同时4D重建与跟踪: St4RTrack关注在世界坐标系下同时完成动态重建和点跟踪。它是一个前向(feed-forward)的框架：通过输入两帧（时间上不同）的图像，网络输出这两帧在统一世界坐标和相同时间戳下的两幅3D点图(point maps)。直观地说，St4RTrack预测了第一帧中的各点在第j帧中的对应3D位置（实现跟踪)，同时预测第j帧自身的3D几何结构在世界坐标下的位置（实现重建）。通过将第一帧与序列中每一个后续帧两两喂入并链式相连，能得到全视频范围的长时对应关系和重建结果。重要的是，St4RTrack不需要4D真值监督：作者先在三种合成数据上进行了基本训练（即使数据规模较小且合成），然后提出一种自适应fine-tune策略，利用重投影损失来自适应任意真实视频。例如，只需利用2D轨迹(光流)和单目深度等可从现有模型获取的弱监督信号，就能对模型进行细调，使其适应复杂真实场景。这种“先合成训练+无真值自适应”的思路使得St4RTrack能在不依赖昂贵标注的情况下应用于野外视频，并取得优异效果。作者还构建了一个新的世界坐标系跟踪基准 (WorldTrack) 来评测方法效果，结果显示相比将重建和跟踪拆开的组合方法，St4RTrack在长程跟踪精度(APD指标)和ATE等误差上表现最佳。其重建质量也与专用的动态重建方法相当，有竞争力。总之，St4RTrack展示了通过新颖的网络表示（两帧点图、世界坐标统一）和自监督适配，可以实现动态场景高效、统一的跟踪与重建。这一方法依赖深度网络推理，但由于是前向计算，推理效率高；需要的训练数据少且偏合成，使得复现实验的算力需求相对可控。 Fig. 4. St4RTrack\nBA-Track (ICCV 2025 Oral) – Bundle Adjustment与学习跟踪融合: BA-Track（论文题目“Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction”）旨在将经典SLAM里的Bundle Adjustment(BA)思想重新带回动态场景重建中。传统SLAM假定场景静止，直接对动态视频BA会失败或需要滤除运动物体。而BA-Track的策略是借助学习的3D点跟踪前端来进行运动分解：它利用一个3D点跟踪器获取观测到的动态物体运动，将其分解为相机引起的运动和物体本身的运动。通过只将相机运动分量提交给BA模块优化，相当于在虚拟的“静态场景”上做BA，从而可以稳健地联合优化相机位姿和所有场景点。动态物体的不一致运动不再破坏BA，因为这些运动被分解、隔离了。此外，作者还提出结合尺度地图(scale maps)的轻量后处理，保证跨帧深度的一致尺度。整个框架融合了传统BA内核和鲁棒的学习型3D点跟踪前端，并集成了运动分解、BA优化和深度一致性步骤。结果是：BA-Track能够准确地估计相机轨迹，并生成时间上一致、尺度正确的稠密重建，包括场景中的静态和动态元素。在有挑战性的动态视频数据集上，BA-Track显著提升了相机位姿估计和3D重建精度，相比现有技术取得了大的改进。这一工作直接体现了“用3D点跟踪简化4D重建”的理念：通过点跟踪获取关于运动的先验知识，再利用经典优化算法解算结构，从而避免了纯学习方法可能出现的不稳定，也克服了传统方法不擅长动态的弱点。 Fig. 5. BA-Track Overview\n上述跟踪融合类方法从不同角度将时空对应融入了重建过程：SpatialTrackerV2和St4RTrack走的是端到端学习路线，通过统一网络架构直接输出跟踪+重建效果；BA-Track则是深度+优化融合，利用学习方法提供线索再用经典优化确保精度。这些方法充分利用了3D点轨迹这一中间信息：一方面，点轨迹提供了跨帧关联，可以将动态物体的运动与相机运动解耦；另一方面，点轨迹也约束了场景的几何形状，使得重建算法有据可依、更快收敛。对于算力有限的研究者来说，跟踪融合方案很有吸引力，因为我们可以借助预训练的跟踪模型或已有的光流、深度工具来获取这些轨迹/对应关系，而不必从零训练庞大模型，同时再通过优化或小规模网络完成精细重建。\n3. 潜在研究课题与创新思路 结合以上对最新文献的回顾，这里针对CVPR 2026的时间节点，提出若干可行且新颖的研究方向建议。这些设想特别考虑了有限算力的约束，旨在通过算法创新和巧妙结合现有工具，实现对4D重建问题的突破。\n3.1 方向一：基于3D点跟踪的高效4D重建 核心思路：正如设想的“4D Reconstruction Made Easier with 3D Point Tracking”，充分利用3D点轨迹来降低动态重建的难度和计算量。具体而言，可以设计一个管线：首先采用3D点跟踪获取视频中稠密/稀疏的点对应关系，然后将这些对应作为先验或约束，融入到动态重建的模型中。例如：\n用点轨迹初始化或引导形变场：在经典动态NeRF或Gaussian Splitting优化中，最大的难点在于求解每个点随时间的形变。如果预先由3D点追踪获得了对应关系，我们就可以推断每个点在不同帧的位置，从而得到一个初始的运动场估计。重建算法可以以此为初值进行优化，或者在损失函数中添加约束，使其预测的运动不要偏离这些观测轨迹。这将大幅缩小形变场的搜索空间，使优化更稳健高效。 将跟踪结果作为稀疏监督信号：即使不直接用于初始化，也可以在训练动态重建模型（比如训练一个小型网络来表示动态场景）时，将跟踪得到的3D对应用于监督。例如，要求模型在时间$t_1$和$t_2$输出的点云，对应跟踪到的同一点距离要接近0（保持时序一致性）。这种稀疏监督不需要真值4D标签，利用的是由算法得到的弱标签，有点类似St4RTrack利用重投影自监督。因为跟踪本身可能有噪声，可以设计鲁棒的loss或筛选可靠轨迹来提高效果。 结合BA-Track思想进行相机优化：点轨迹还能帮助解算相机运动。BA-Track已经证明，将动态物体运动分解开后，可以对整个场景做BA以求相机pose。我设想的方案中，全段跟踪给出每帧相对于参考帧的3D点云配准，那么可以固定点对应关系，利用BA优化相机参数和点深度，从而得到准确的相机轨迹和初步结构。这一步的结果又能进一步喂给动态NeRF等方法，作为先验锁定相机参数和部分几何，使剩下任务（比如精细纹理、形变细节）更容易。 3.2 方向二：面向对象的分层4D重建 核心思路：动态场景往往由多个运动主体和背景组成。如果试图用一个单一模型表示整个场景的所有动态，可能过于复杂、低效。一个自然的想法是分而治之：按照场景的语义和运动模式，将其分解为若干字部分，各自重建，再融合。具体来说：\n运动分组与对象分割：可以借助于训练的分割模型（如Segment Anything Model, SAM）在关键帧将场景中不同对象/人物/背景分离出粗略的掩膜。结合3D点跟踪，可以识别哪些点轨迹属于同一刚体或同一物体。这类似于生成每个主要运动体的“点轨迹簇”。例如，一辆行驶的汽车、一个走动的人，他们身上的点应具有共同的刚体运动分量，可通过分析轨迹的相对运动将其聚类。 子空间重建：对于每个分离出的运动对象，建立独立的动态重建模型。例如，可为每个对象维护一个小型NeRF/高斯模型来表示其3D形状和纹理，并同时学习一个该对象自身的运动场（通常可假设为缸体或可除去微变形）。背景静态部分则用常规3D重建方法获取。如果对象本身是非刚体（如人软体运动），也可以用一个小的4D模型（如Canonical体+变形场）专门处理。由于每个模型只需关注局部区域，参数量和优化难度都降低，所需算力也更小。 全局融合：将各对象的重建结果注册到全局世界坐标下。由于先前步骤中每个对象模型都可以通过其轨迹与相机运动确定全局位姿，我们可以在渲染时将它们组合，或者进一步优化边界过渡，使对象与背景无缝融合为完整场景。BA-Track的做法相当于只区分了“静止背景”和“动态物体”两类；我们这里设想扩展为多物体，更细粒度的分层。 优势与创新：这种对象级4D重建思路可以视作传统多目标跟踪（MOT）与多体SLAM的延伸。在CV领域，单人体或单物体的动态重建已有大量研究，能在单对象场景下取得不错效果。然而，当场景同时有多个人物或物体互动时，单一模型常常力不从心。我们提出的分而治之方案，将复杂场景解构，每部分使用专门模型处理，大大降低了单个模型的复杂度。而难点在于如何自动地进行运动分组和最终融合，这是研究创新点所在。近期的SpatialTrackerV2和St4RTrack已经提供了获取稠密点轨迹的手段，我们可以基于这些轨迹做后续分析，实现自动化的运动分解。如果成功，这将是一个很有新意的成果：它介于完全端到端（黑盒）的方法和完全手工指定方法之间，利用算法从数据中挖掘场景结构，再分别应用优化，在有限算力条件下有望实现对复杂动态场景的高质量重建。\n3.3 方向三：融合多源先验的轻量4D重建管线 核心思路：借鉴MoSca利用基础模型先验的思路和BA-Track结合传统BA的思路，我们可以构建一个融合多种轻量级预训练模型与经典算法的杂交管线，达到用较小计算代价解决4D重建的模型。具体而言：\n深度和光流先验：充分利用现成的单目深度估计、光流估计算法（例如MiDaS深度、RAFT光流等）。这些模型经过大规模训练，本身在普通硬件上就能给出不错结果。我们可以对输入视频先运行这些模块，获取每帧的深度粗估计和相邻帧像素对应关系。这样，相当于为重建提供了初始几何和运动约束。MoSca中正是利用了“foundation models”的先验将视频提升到Motion Scaffold表示——比如可能用了于训练光流来推断初步的运动场架构。在我们的设计中，可以将深度先验用于初始化高斯点云或NeRF的密度场，将光流先验用于初始化/监督运动场。 经典优化模块插入：在管线的关键步骤引入鲁棒的经典优化以提高精度和一致性。例如，可在初始相机位姿通过PnP等求得后，插入Bundle Adjustment全局优化相机和关键点（类似BA-Track思路）；又或者在获得初步的动态形变后，利用多视图几何的方法对形变场进行平滑约束（如强制刚性或局部刚性，以减少非物理畸变）。这些优化算法虽然迭代执行但规模相对小（参数远少于整个神经网络），在CPU/GPU上少量计算即可收敛，适合有限算力场景。 轻量网络细化：将预训练先验+经典优化得到的结果，作为输入喂给一个小型神经网络做端到端细化。例如，可以训练一个小的U-Net或Transformer，将粗糙深度+纹理预估refine成高质量的输出，或者训练一个条件NeRF，以初始的高斯点云为起点，通过少量迭代优化逼近真实影像。这部分网络的参数和训练需求相对较小，因为模型不需要从零学习结构，而是在先验基础上修正。有了上述先验约束，即使网络规模小，依然可以取得显著提升。 可行性与亮点：这种多模块融合策略的魅力在于“各取所长”。基础模型提供了开箱即用的强大感知能力（但单帧的，不一定时序一致）；传统BA提供全局一致的优化能力（但需要良好初值）；而小型网络提供灵活的非线性表示能力（但可在有先验时收敛更快）。三者结合可以在有限算力下达到接近大型端到端模型的效果。例如，MoSca已经证明只用基础模型先验+Gaussian Splatting就能达到SOTA性能；BA-Track证明经典BA在动态场景中并非不可用，只要配合学习前段。由此延伸，我们的方案会是一个实用系统：不需要训练庞大模型，更多以来拼接已有组建，通过巧妙设计让它们互相配合。学术贡献在于：以前的工作往往各自为政，我们则探索两者的协同增益，使动态重建又“快”又“好”。如能验证这一理念，将对业界具有参考价值。\n4. 总结与展望 综上所述，4D动态场景重建目前正处于快速发展阶段：从表示层面的创新（如MoSca、FreeTimeGS提出的高斯融合动态表示），到算法流程的革新（如SpatialTrackerV2、St4RTrack端到端联合跟踪重建，以及BA-Track的经典与学习融合）。这些最新工作为我们提供了丰富的灵感和工具。建议重点围绕“如何利用现有信息更高效地完成4D重建”这个主题展开。无论是引入3D点跟踪作为辅助，还是分解场景降低复杂度，抑或融合多先验构建高效管线，都契合这一主题，并有充分的相关工作基础可供参考。关键在于找到具体的切入点并证明相应的效果提升。\n5. 参考文献 Jiahui Lei et al., \u0026ldquo;MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds,\u0026rdquo; CVPR 2025. Yifan Wang et al., \u0026ldquo;FreeTimeGS: Free Gaussian Primitives at Anytime and Anywhere for Dynamic Scene Reconstruction,\u0026rdquo; CVPR 2025. Yuxi Xiao et al., \u0026ldquo;SpatialTrackerV2: 3D Point Tracking Made Easy,\u0026rdquo; ICCV 2025. Haiwen Feng et al., \u0026ldquo;St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World,\u0026rdquo; ICCV 2025. Weirong Chen et al., \u0026ldquo;Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction (BA-Track),\u0026rdquo; ICCV 2025. ","permalink":"https://gavinsun0921.github.io/posts/paper-research-03/","summary":"探讨一下当前时代下博士生在 4D Reconstruction \u0026amp; Generation 研究方向上，面对高校科研环境有限算力，以及目标是CVPR 2026的情况下，对该方向的一些选题思考。","title":"4D动态场景重建研究方向综述与选题建议"},{"content":"TL;DR 本文提出了一个 feed-forward 框架，通过引入一种创新的、依赖于时间的 pointmap 表示，并利用一个双分支 Transformer 架构，实现了在统一的世界坐标系中同时进行动态场景的密集追踪与三维重建。\n1. 研究动机 (Motivation) 1.1 背景 (Background) 在计算机视觉中，“对应关系”是三维重建的基石。在静态场景中，三维几何和二维对应是同一枚硬币的两面。\n1.2 现有方法的局限性 (Gap/Limitations of Existing Work) 当场景变为动态时，这种几何与对应的协同关系似乎被打破了。现有方法，特别是数据驱动的方法，往往将动态场景重建和点追踪（即寻找对应关系）视为两个独立且不相关的任务。作者认为，这是一种“错失的机会”，因为动态场景中的这种协同关系并未消失，只是需要额外理解场景内容如何随时间演化，即3D运动估计（3D点追踪）。\n\u0026ldquo;We argue that this is a missed opportunity; the synergy between 3D reconstruction and 2D correspondence is not lost in dynamic scenes—it simply requires an additional element: understanding how the scene content evolves over time.\u0026rdquo; (p.1, Section 1)\n1.3 本文价值 (Value Proposition) 本文旨在重新建立动态场景下三维重建与追踪之间的联系。\nSt4RTrack 提出了一个统一的学习框架，能够直接从RGB视频中，在一个一致的世界坐标系里，同时完成动态内容的重建与追踪。这种在世界坐标系中进行追踪的方式，能从根本上解耦场景运动和相机运动。\nFig. 1. St4RTrack\n2. 解决的关键问题与贡献 (Key Problem Solved \u0026amp; Contribution) 2.1 解决的关键技术问题 如何设计一个统一的 feed-forward 模型，它能够仅通过重新定义其输出表示，就能自然地将动态场景的三维重建任务和三维点追踪任务融合在一起，并直接在统一的世界坐标系中输出结果？\n2.2 核心贡献 统一的 4D 表示: 本文的核心思想源于一个关键的观察：一个静态的 3D 重建方法（DUSt3R）只需改变其 pointmap 的标注方式，就能适应动态场景（MonST3R）。基于此，本文提出了一种新的、依赖于时间的 pointmap 定义，通过预测两张精心定义的 pointmap 来统一重建与追踪任务。 同时重建与追踪的架构: 实现了一个双分支的 Transformer 架构。其中“重建分支”负责重建目标帧的几何，“追踪分支”则负责预测参考帧的几何内容如何运动到目标帧的时刻。 无需4D真值的自适应方案: 提出了一种新颖的 test-time adaptation 方案。通过一个可微的PnP模块来求解相机参数，进而利用2D追踪的伪标签和单目深度先验构成 reprojection loss，使得模型能够从未标注的真实视频中学习，适应新领域。 新的评测基准：针对世界坐标系下的 3D 追踪任务，建立了一个新的评测基准 WorldTrack，以评估和推动相关研究。 3. 方法详述 (Method) St4RTrack 的方法核心在于对 pointmap 概念的重新思考和扩展，并围绕此构建了一个双分支 feed-forward 网络，最终通过 reprojection loss 实现自适应。\n3.1 统一的 4D 表示 (Unified 4D Representation) 这是理解本文方法的关键。作者引入了时间作为 pointmap 的一个决定性因素 。\n时间依赖的 Pointmap 定义：作者提出了一个更泛化的 pointmap 表示: $$^{\\color{red}a}\\mathbf{X}_{\\color{green}t}^{\\color{blue}b}$$ $\\color{blue}b$: pointmap 所描述的物理内容来源是第 b 帧图像。 $\\color{green}t$: pointmap 所描述的是在 t 时刻的场景状态。 $\\color{red}a$: pointmap 的三维坐标是在第 a 帧的相机坐标下表达的。 St4RTrack 的核心 pipeline：如图 3 所示，对于输入的一对图像 $(\\mathbf{I}_1, \\mathbf{I}_j)$，St4RTrack 模型 $f_\\theta$ 会输出两个 pointmap：$$f_\\theta(\\mathbf{I}_1, \\mathbf{I}_j)={^1\\mathbf{X}^1_j, ^1\\mathbf{X}^j_j}$$ $^1\\mathbf{X}^j_j$（重建分支）：这个 pointmap 描述的是第 j 帧的内容，在第 j 帧的时刻，在第 1 帧的坐标系下表达。这本质上就是动态场景重建：将 j 帧的场景重建到 1 帧的坐标系下。 $^1\\mathbf{X}^1_j$（追踪分支）：这个 pointmap 描述的是第 1 帧的内容，在第 j 帧的时刻，在第 1 帧的坐标系下表达。这本质上就是 3D 点追踪：它回答了“第 1 帧的那些点，在第 j 帧的那个时刻，移动到了世界坐标系下的什么位置？” 当处理整个视频时，模型始终将第一帧 $\\mathbf{I}_1$ 作为参考（即世界坐标），依次计算 $f_\\theta(\\mathbf{I}_1, \\mathbf{I}_j)$ 对 $(j=1,2,\\cdots, T)$。这样，输出的 $^1\\mathbf{X}^1_j$ 序列就构成了对第一帧所有点的密集 3D 追踪轨迹，而 $^1\\mathbf{X}^j_j$ 序列则构成了整个视频的动态三维重建。\nFig. 3. Overview of St4RTrack\n3.2 联合学习 (Joint Learning) 网络架构：St4RTrack 采用了一个与 DUSt3R 类似的双分支（siamese）Transformer 架构。两个输入图像 $\\mathbf{I}_1, \\mathbf{I}_j$ 分别通过 ViT Encoder，然后在 Decoder 中通过自注意力和交叉注意力进行信息交互，最终由不同的 Head 输出各自的 pointmap。虽然两个分支共享结构，但它们的目标不同，分别对应“最终”和“重建”。 有监督预训练：由于追踪分支需要知道点在世界坐标系中的真实运动，模型首先在提供完整 4D 信息的合成数据集（如 Point Odyssey, Dynamic Replica）上进行预训练。使用 ground-truth 的相机参数、深度图和顶点轨迹来监督两个分支的 pointmap 输出。 3.3 无 4D 标签的自适应 (Adapt to Any Video) 这是本文的另一个亮点，使得模型能够应用于没有4D真值的真实视频。\n3.3.1 可微的相机参数求解 模型首先像 DUSt3R 一样，从追踪分支的输出 $^1\\mathbf{X}^1_1$ 中估计出相机内参 $\\mathbf{K}$。 然后，利用重建分支的输出 $^1\\mathbf{X}^j_j$。这个输出为第 j 帧的每个像素 $\\mathbf{x}^{j,n}$ 提供了一个在世界坐标系（即第 1 帧坐标系）下的 3D 坐标 $\\mathbf{X}^{j,n}_j$。这就构成了一组 2D-3D 对应点。 利用这些对应，可以通过 PnP 算法求解第 j 帧的相机外参 $\\mathbf{P}^j = [ \\mathbf{R}^j | \\mathbf{T}^j ]$。 为了让损失能够反向传播，作者采用了一个可微的 PnP 求解器（基于 Gauss-Newton）。 3.3.2 Reprojection Loss 一旦获得了可微的相机位姿 $\\mathbf{P}^j$，就可以构建用于自监督优化的 reprojection loss。这个损失由三个部分构成\n$\\mathcal{L}_\\text{traj}$（轨迹损失）：将追踪分支输出的 3D 点 $^1\\mathbf{X}^1_j$ 投影回第 j 帧的图像平面，得到预测的 2D 轨迹点 $\\hat{\\mathbf{x}}^{j,n}$。然后，将其与一个强大的现成 2D 追踪器（如CoTracker3）提供的伪标签 $\\mathbf{x}^{j,n}_\\text{trk}$ 进行比较，计算尺度不变的 L2 损失。 $\\mathcal{L}_\\text{depth}$（深度损失）：将重建分支输出的 3D 点 $^1\\mathbf{X}^j_j$ 变换到第 j 帧的相机坐标系下，得到预测的深度 $z^{j,n}_\\text{proj}$。然后，将其与一个强大的现成单目深度估计模型（如MoGe）提供的伪标签 $z^{j,n}_\\text{mono}$ 进行比较，计算尺度不变的 L2 损失。 $\\mathcal{L}_\\text{align}$（3D自洽损失）：这是一个 3D 空间中的一致性约束。它要求对于第 1 帧中那些在第 j 帧依然可见的点，其在追踪分支中的 3D 位置 $^1\\mathbf{X}^{1,n}_j$，应该与其对应点在重建分支中的 3D 位置 $^1\\mathbf{X}^{j,n'}_j$ 尽可能接近。这确保了两个分支在同一时刻对同一物理点的预测是一致的。 通过最小化总的 reprojection loss，模型可以在测试时对新的、无标签的视频进行 fine-tuning（test-time adaptation），从而弥补合成数据与真实世界之间的领域鸿沟。在自适应时，作者选择冻结重建分支，以保留从预训练中学到的视图对齐能力。\n4. 实验分析 (Experiments) 4.1 3D Tracking in World Coordinates Tab. 1. World Coordinate 3D Point Tracking\n实验结果显示，St4RTrack 在新提出的 WorldTrack 基准上取得了全面的SOTA性能 。值得注意的是，它显著优于那些复杂的组合基线，证明了其统一建模的优越性。即使在没有相机运动的 Panoptic Studio 数据集上，它的表现也优于专门的相机空间追踪器 SpatialTracker。\n4.2 Dynamic 3D Reconstruction Tab. 2. World Coordinate 3D Reconstruction\n在重建任务上，St4RTrack 同样达到了SOTA水平 。它甚至超过了那些使用了额外全局对齐（Global Alignment）步骤的 MonST3R 等方法。这进一步凸显了其联合进行追踪与重建所带来的好处。\n4.3 Ablation Study Fig. 5. Ablation Study\n预训练的必要性: 图 5 的定性比较清晰地显示，如果没有在合成数据集上进行预训练来学习本文提出的4D表示，即使进行 test-time adaptation，模型的追踪和重建两个分支的输出也无法对齐，效果很差。 Test-Time Adaptation (TTA)的有效性: 图 5 同样证明，TTA能够有效修正模型在真实数据上的漂移问题，使追踪和重建结果更精确。表 6 的结果也显示，TTA带来了显著的性能提升。 Reprojection Loss 各部分的贡献：表 6 的最后三行显示，在TTA中去掉轨迹损失、深度损失或3D自洽损失中的任何一项，都会导致性能下降，证明了这三个损失分量对于模型的自适应都至关重要。 Tab. 6. World Coordinate 3D Tracking (Median-Scale) （这个表在原论文的补充材料中）\n5. 批判性思考 (Critical Analysis \u0026amp; Personal Thoughts) 5.1 优点 (Strengths) 概念的优雅与统一: 本文最大的亮点在于其思想的统一性。通过对 pointmap 表示进行巧妙的重新定义，将追踪和重建这两个看似独立的任务，内生地、优雅地统一到了一个框架下。这种“表示即方案”的思路极具启发性。 直击问题本质: 它直接在世界坐标系中进行操作，从根本上解决了相机运动和物体运动的纠缠问题，而不是像其他方法那样进行“解耦”或“分离”。这是一种更直接、更符合第一性原理的解决方案。 创新的自适应机制: Test-time adaptation 的设计非常巧妙。它通过可微的 PnP 求解器和利用现成模型作为伪标签的 reprojection loss，为如何让一个在合成数据上训练的复杂 4D 模型成功应用于无标签的真实世界视频，提供了一个非常有效的范本。 5.2 潜在缺点/疑问 (Weaknesses/Questions) 对锚定帧的依赖: 整个框架将第一帧作为世界坐标系的绝对参考。这意味着如果视频的第一帧质量不佳（例如，模糊、遮挡严重），可能会影响后续所有帧的重建和追踪精度。整个系统的“地基”完全由第一帧决定。 长视频的可扩展性: St4RTrack 采用的是将每一帧都与第一帧配对的策略。对于非常长的视频，这种方法可能会忽略相邻帧之间丰富的时序信息。作者在讨论部分也承认了这是一个局限，并提出未来可以引入跨多帧的 temporal attention 来缓解。 Test-Time Adaptation 的成本: 虽然 TTA 效果显著，但它需要在测试时对每个视频序列进行额外的优化（在4块A100上约需5分钟）。这相对于纯粹的 feed-forward 推理（30 FPS），在需要即时响应的应用中是一个不可忽视的成本。 5.3 启发/可借鉴点 (Insights/Takeaways) 表示的力量: 这篇论文再次证明，一个好的数据表示（Representation）本身就是一种解决方案。通过引入时间维度，作者将一个复杂的多任务问题转化为了一个统一的表示预测问题。 利用先验进行自监督: “利用现成的、强大的模型（如CoTracker3, MoGe）的输出作为伪标签来构建损失函数”是一种非常实用的策略。这使得模型可以在没有昂贵真值标注的情况下，从海量真实数据中学习。 Sim-to-Real的有效路径: “在合成数据上预训练以学习核心概念和表示” + “在真实数据上通过自监督损失进行微调/自适应”，是解决模拟到现实（Sim-to-Real）领域鸿沟的一条黄金路径。 ","permalink":"https://gavinsun0921.github.io/posts/fast-paper-reading-06/","summary":"本文提出了一个 feed-forward 框架，通过引入一种创新的、依赖于时间的 pointmap 表示，并利用一个双分支 Transformer 架构，实现了在统一的世界坐标系中同时进行动态场景的密集追踪与三维重建。","title":"[ICCV'25] St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World 阅读报告"},{"content":"TL;DR 本文巧妙地提出了一种“运动解耦”机制，通过一个学习的 3D Tracker 将动态物体的自身运动从观测运动中剥离，使得经典的 Bundle Adjustment 能够首次被统一地应用于含动态物体的场景中，极大地提升了动态场景重建中的相机位姿精度和三维重建质量。\n1 研究动机 (Motivation) 1.1 背景 (Background) 从 casual video 中进行动态场景重建，对于 AR、机器人等应用至关重要。传统的 SLAM 和 SfM 方法依赖于经典的 Bundle Adjustment (BA)，在静态场景中表现优异，能够实现高精度的相机位姿和场景几何恢复。\n1.2 现有方法的局限性 (Gap/Limitations of Existing Work) 做着指出现有方法的“无人区”：\n经典方法的失效：传统 BA 和 SLAM 系统强依赖于静态世界假设（即 epipolar constraint），而动态物体完全破坏了这个假设，导致这些方法在动态场景中会失效或产生严重错误。 ”删除“策略的缺陷：一些方法通过检测并过滤掉动态区域来保证 BA 的运行。但这会严重导致重建结果不完整，丢失了所有动态物体的几何信息，这在很多应用场景中是不可接受的。 “独立建模”策略的困难：另一些方法尝试为动态物体建立独立的运动模型，但这通常很复杂，并且容易导致运动估计不一致的问题。 “深度先验”策略的瓶颈：最近的工作利用了单目深度估计作为先验，但这些深度图在时序上往往存在不一致（inconsistent），尤其是尺度（scale）不一致，导致将它们融合成全局一致的三维模型非常困难。 1.3 本文价值 (Value Proposition) 这篇论文的价值主张极具颠覆性——我们不必再“绕着走”了。作者没有选择删除动态点或为其建立复杂的模型，而是提出了一个全新的思路：我们是否可以将动态点的运动“中和”掉，让 BA 认为它们也是“静态”的？这使得强大而成熟的 BA 优化框架得以重返（Back on Track）动态场景重建的核心舞台。\n2. 解决的关键问题与贡献 (Key Problem Solved \u0026amp; Contribution) 2.1 解决的关键技术问题 如何改造输入信息，使得经典且强大的 Bundle Adjustment 优化框架能够直接、统一地处理包含大量动态物体的场景，从而在不丢弃动态信息的前提下，同时实现高精度的相机位姿和时序一致的密集场景重建？\n2.2 核心贡献 核心思想：运动解耦（Motion Decoupling）：这是本文的灵魂。作者提出，任何一个点的观测运动（total motion）都可以被分解为相机运动引发的静态分量（static component）和物体自身运动引发的动态分量（dynamic component）。通过只把这个“静态分量”喂给 BA，BA 就可以像处理静态点一样处理动态点，从而统一了整个优化过程。 双追踪器架构（Dual-Tracker Architecture）：为了准确地实现运动解耦，做着设计了一个由两个 Transformer 追踪器组成的 Front-end。一个追踪器 $T$ 负责预测总运动 $\\mathbf{X}_{\\text{total}}$，另一个更小更高效的追踪器 ${T}_{\\text{dyn}}$ 专门预测动态分量 $\\mathbf{X}_{\\text{dyn}}$。这个设计在实验中被证明优于单一追踪器直接预测静态分量的方案。 统一框架 BA-Track：提出了一个由三个阶段组成的完整 pipeline： Stage 1: 基于学习的运动解耦 3D 追踪器，分离出运动的静态分量。 Stage 2: 经典的 Bundle Adjustment，利用静态运动分量稳健地估计相机位姿和稀疏场景结构。 Stage 3: 轻量级的全局优化，利用 BA 得到的稀疏精确点来优化初始的密集深度图，得到全局一致的密集重建。 3. 方法详述 (Method) Fig. 1. BA-Track Framework\n3.1 Stage I: 运动解耦的3D追踪器 (Motion-Decoupled 3D Tracker) 这一阶段是整个框架的基石。它的核心目标不是简单地追踪点，而是要从任何一个点的总观测运动中，精准地分离出纯粹由相机运动引起的静态分量（static component）。\n这个思想在图 2 中非常直观的展示了，对于一个动态物体上的点，我们观测到的总运动 $\\mathbf{X}_{\\text{total}}$（蓝色向量），实际上是相机自身运动 $\\mathbf{X}_{\\text{static}}$（橙色向量）和物体自身运动 $\\mathbf{X}_{\\text{dyn}}$（紫色向量）的叠加。\nFig. 2. Illustration of motion decoupling\n为了实现这个解耦，作者设计了一个巧妙的 Front-end：\n双追踪器架构 (Dual-Tracker Architecture): 作者没有用一个网络去硬解这个难题，而是用了两个基于 Transformer 的追踪器。 主追踪器 $T$: 负责预测点的总观测运动 $\\mathbf{X}_{\\text{total}}$，同时还预测该点的可见性 $v$ 和一个动态标签 $m$（一个0到1之间的软标签，表示该点属于动态物体的概率）。 动态追踪器 $T_{\\text{dyn}}$: 这是一个更小、更高效的网络（层数是 $T$ 的一半），专门负责预测物体自身的动态运动分量 $\\mathbf{X}_{\\text{dyn}}$。 为什么用两个？论文在消融实验中证明，让一个网络同时学习追踪和理解复杂的运动模式是次优的。通过将任务分解给两个网络，每个网络可以更专注，从而实现更准确的解耦。 运动解耦公式: $$ \\mathbf{X}_{\\text{static}} = \\mathbf{X}_{\\text{total}} - m \\cdot \\mathbf{X}_{\\text{dyn}} \\tag{1} $$ 如果一个点是静态的（$m$ 趋近于0），那么 $\\mathbf{X}_{\\text{static}}$ 就约等于 $\\mathbf{X}_{\\text{total}}$，这符合事实。 如果一个点是动态的（$m$ 趋近于1），那么它的静态分量 $\\mathbf{X}_{\\text{static}}$ 就是其总运动减去物体自身运动。这样一来，即使是动态物体上的点，其 $\\mathbf{X}_{\\text{static}}$ 也只反映了相机的运动，使得它在 BA 看来和一个静态点是等价的。 3.2 Stage II: Bundle Adjustment (BA) 利用第一阶段“净化”后的运动信息 $\\mathbf{X}_{\\text{static}}$，运行一个鲁棒的 BA，以高精度地恢复相机位姿和场景的稀疏三维结构。\nBA 是一个经典的优化问题，通过最小化重投影误差来同时优化相机参数和三维点坐标。本作的关键在于，它将 BA 的输入进行了“改造”。\n“BA友好”的输入: BA 处理的不再是原始的、带有动态噪声的观测点，而是 Front-end 输出的运动静态分量 $\\mathbf{X}_{\\text{static}}$。这使得经典的BA框架几乎无需修改就可以直接应用于动态场景，因为动态物体的“干扰”已经在输入端被消除了。 优化目标: $$\\argmin_{{\\mathbf{T}_t},{\\mathbf{Y}}} \\sum_{|i-j|\\leq S} \\sum_n W^i_n(j) | \\mathcal{P}_j(\\mathbf{x}^i_n, y^i_n) - X^t_n(j) |_\\rho + \\alpha | y^i_n - d(\\mathbf{X}^i_n) |^2 \\tag{2}$$ 优化变量: 所有帧的相机位姿 $\\{\\mathbf{T}_t\\}$ 和所有稀疏查询点的精确深度 $\\{\\mathbf{Y}\\}$。 第一项：重投影误差: $\\mathcal{P}_j$ 是标准的投影函数，它将第 $i$ 帧的点 $\\mathbf{x}^i_n$ 根据当前的相机位姿估计 $\\{\\mathbf{T}_t\\}$ 投影到第 $j$ 帧。误差是这个投影点与 Front-end 输出的追踪点 $X^i_n(j)$ 之间的距离。 第二项：深度一致性: 这是一个正则项，它鼓励优化后的深度 $y_n^i$ 不要偏离初始的深度先验 $d(\\mathbf{X}^i_n)$ 太远，由超参数 $\\alpha$ 控制权重。 权重 $W$: 权重 $W_n^i(j)$ 包含了可见性 $v$ 和动态标签 $m$ 的信息，形式为 $v \\cdot (1-m)$。这意味着，在优化位姿时，BA 会更相信那些可见且被判断为静态的点，从而进一步提升了相机位姿估计的鲁棒性。 3.3 Stage III: 全局优化 (Global Refinement) BA 只输出了精确的相机位姿和稀疏的三维点。而初始的深度图是密集但可能不准确且时序不一致的。此阶段的目标就是利用 BA 输出的精确稀疏点，来对齐和优化整个视频的密集深度图，得到全局一致的密集重建结果。\n作者没有直接对密集的深度图进行复杂变形，而是采用了一种更高效的 scale map 方法。\n可变形的 scale map: 对每一帧，模型学习一个比原图分辨率低的 2D scale map $\\theta_t$ 深度优化公式: 最终的优化后深度由一个简单的乘法得到 $$ \\hat{D}_t[\\mathbf{x}] = \\theta_t[\\mathbf{x}]\\cdot D_t[\\mathbf{x}] $$ 其中 $D_t[\\mathbf{x}]$ 是初始深度，$\\theta_t[\\mathbf{x}]$ 是从低分辨率 scale map 中通过双线性插值得到的缩放因子。这意味着整个优化过程是学习一个平滑的、空间可变的缩放场来校正初始深度。 两个核心损失函数: $\\mathcal{L}_\\text{depth}$: 深度一致性损失。它强制要求，在稀疏点的那些位置上，优化后的密集深度 $\\hat{D}_t$ 应该要等于 BA 阶段得到的精确稀疏深度。这相当于用精确的稀疏点作为“锚点”来校准整个密集深度场。 $\\mathcal{L}_\\text{rigid}$: 场景刚性损失，它要求场景中静态部分的任意两点之间的三维距离，在不同帧之间应该保持不变。这个损失通过一个权重 $W_\\text{static}$ 来确保只对静态点生效，从而防止了静态背景在优化过程中发生不应有的形变。 通过这三个阶段的紧密协作，BA-Track 成功地将经典 BA 的强大优化能力引入到充满挑战的动态世界中，实现了非常出色的效果。\n4. 实验分析 (Experiments) 4.1 Camera Pose Estimation Tab. 1. Camera pose evaluation results on Sintel, Shibuya, and Epic Fields datasets\nFig. 3. Qualitative camera pose estimation results on Sintel, Shibuya, and Epic Fields datasets\n如图 3 所示，直观展示了 BA-Track 相机轨迹相比其他方法更平滑、更接近 ground truth，而其他方法在这些场景下往往表现不佳。在具有快速动态内容和复杂相机运动的 Epic Fields 数据集上，BA-Track 仍能保持出色的 VO accuracy。\n4.2 Depth Evaluation Tab. 2. Depth evaluation results on Sintel, Shibuya, and Bonn datasets\n基于尺度地图的优化方法在参数较少的情况下实现了高效的优化，同时保持了相对简单的结构。\n4.3 Motion Decoupling Fig. 4. Motion decoupling on the DAVIS dataset\n图 4 展示了在 DAVIS 数据集上观测到的点运动与静态点运动情况。红色点表示动态轨迹，绿色点代表静态点，突显了三维追踪器分离观测运动中动态成分的能力。\n4.4 Ablation Study 4.4.1 Dual-Tracker Architecture Tab. 3. Ablation study of dynamic handling methods on Sintel\n4.4.2 Global Refinement Tab. 4. Ablation study on depth refinement on Bonn crowd2 sequence\n分别尝试关闭 $\\mathcal{L}_\\text{depth}$ 和 $\\mathcal{L}_\\text{rigid}$ 进行联合关闭与交替关闭的实验。完全移除两项损失函数会导致重建精度下降。\nFig. 5. Visualization of global refinement on the DAVIS\n图 5 展示了采用与不采用我们深度优化方法进行重建的两种直观对比。直接将单目深度图与估计的相机位姿融合，会导致重建结果不一致，并出现重复的三维结构，而我们的全局优化显著提升了三维一致性。\n5. 批判性思考 (Critical Analysis \u0026amp; Personal Thoughts) 5.1 优点 (Strengths) 思想极其优雅 (Conceptual Elegance): 本文的核心 idea 非常漂亮。它没有去硬碰硬地为动态物体建模，而是通过“运动解耦”四两拨千斤，让经典的 BA 工具重获新生。这种解决问题的思路本身就极具启发性，是顶会 Oral 的典范。 效果非常扎实: 不仅提出了新颖的理论，更在多个高难度动态场景基准上取得了 SOTA 的性能，用实验结果充分证明了想法的有效性。 完美的混合系统: 它是“传统几何优化”与“深度学习感知”成功结合的又一力作。用学习的追踪器来做它擅长的事（提供鲁棒的对应和运动先验），用经典的 BA 来做它擅长的事（高精度、有几何约束的优化），强强联合。 5.2 潜在缺点/疑问 (Weaknesses/Questions) 对深度先验的依赖: 整个 pipeline 的起点是第三方的单目深度估计网络。如果初始深度质量很差（例如在透明、反光物体上），可能会影响第一阶段运动解耦的准确性，进而误差会传导至后续的 BA 和全局优化。 运动解耦的鲁棒性: 核心公式 $\\mathbf{X}_\\text{static} = \\mathbf{X}_\\text{total} - m \\cdot \\mathbf{X}_\\text{dyn}$ 的成败，高度依赖于动态标签 $m$ 和动态运动 $\\mathbf{X}_\\text{dyn}$ 的预测精度。当动态和静态界限模糊，或物体运动模式罕见时，Front-end 的预测错误可能会对最终结果产生较大影响。 对相机内参的假设: 论文中假设相机内参是已知的 。虽然作者在附录中讨论了未来可以联合优化内参，但这在当前版本中仍是一个限制。 ","permalink":"https://gavinsun0921.github.io/posts/fast-paper-reading-05/","summary":"本文巧妙地提出了一种“运动解耦”机制，通过一个学习的 3D Tracker 将动态物体的自身运动从观测运动中剥离，使得经典的 Bundle Adjustment 能够首次被统一地应用于含动态物体的场景中，极大地提升了动态场景重建中的相机位姿精度和三维重建质量。","title":"[ICCV'25 Oral] Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction 阅读报告"},{"content":"TL;DR 本文提出了一个 feed-forward 3D point tracking architecture，它将 video depth、camera pose 和 object motion 进行统一建模和 end-to-end 优化，并通过在 17 个异构数据集上的可扩展训练，实现了 SOTA 的 3D 追踪精度和推理速度。\n1. 研究动机 (Motivation) 1.1 背景 (Background) 3D point tracking，即从弹幕视频中恢复长期的 3D trajectories，是一种通用的 dynamic scene representation，它在机器人、视频生成、3D/4D 重建等领域有巨大潜力 。\n1.2 现有方法的局限性 (Gap/Limitation of Existing Work) 作者指出了当前方法的两大核心痛点：\n模块化 pipeline 导致的误差累积：现有方法大多依赖于现成的视觉模型（如 optical flow 、monocular depth estimation）构建模块化的 pipeline 。这种分离式的处理方式忽略了 scene geometry、camera motion 和 object motion 三者之间内在的强关联性，导致误差在不同模块间传递和累积。 训练数据限制了泛化能力：以往的 feed-forward 3D tracking models 严重依赖带有 ground-truth 3D tracks 的数据集进行监督训练。这类数据集难以大规模获取，导致模型在多样的 in-the-wild 视频上表现不佳，扩展性差。而基于优化的方法虽然效果好，但因其 per-scene optimization 的设计，推理速度很慢。 1.3 本文价值 (Value Proposition) 本文认为必须将 scene geometry、camera motion 和 object motion 三者进行联合推理和显式解耦，并设计一个能利用多样化、弱监督数据源的框架。其价值在于，通过一个统一、可微的 end-to-end pipeline，实现一个高精度、高速度、高泛化性的通用 3D point tracker。\n2. 解决的关键问题与贡献 (Key Problem Solved \u0026amp; Contribution) 2.1 解决的关键技术问题 如何设计一个可扩展的、feed-forward 的 3D tracking model，该模型能够显式地解耦（disentangle）并联合优化 scene geometry (depth), camera ego-motion (pose) 和 object motion，从而摆脱对 ground-truth 监督的强依赖，并利用海量异构视频数据提升模型的泛化性和鲁棒性？\n2.2 核心贡献 Unified Optimization Framework：提出了一个将 video depth, camera pose 和 pixel-wise 3D motion 分解并集成到一个 fully differentiable, end-to-end pipeline 中的新架构。 SyncFormer 模块：设计了一个名为 SyncFormer 的核心模块，它采用双分支（2D \u0026amp; 3D）结构，通过 cross-attention 进行信息交互，有效解耦了在图像空间（2D）和相机坐标空间（3D）中进行的 trajectories 更新，同时支持在循环中进行可微的 Bundle Adjustment。 Scalable Heterogeneous Training：该框架使得在17个不同类型的数据集上进行大规模联合训练成为可能，这些数据集的监督形式各异（如有标注的RGB-D视频、仅有位姿的视频、甚至是无标签的视频）。 SOTA的性能：实验证明，该方法在 3D tracking benchmark (TAPVid-3D) 上性能相对现有方法提升超过 30%，在 dynamic reconstruction 任务上，其性能与顶尖的 optimization-based 方法相当，而推理速度快50倍。 3. 方法详述 (Method) Fig. 1. SpatialTrackerV2 Pipeline Overview\nSpatialTrackerV2采用了一个前后端架构的设计。\n3.1 Front-end：尺度对齐的 video depth \u0026amp; camera pose estimation 使用 Temporal Encoder 来预测 consistent video depth，同时一个 Neural Camera Tracker 得到 coarse camera（包括 pose, scale, shift）。\n$$\\mathcal{P}^{t},a,b = \\mathcal{H}(\\mathbf{P}_{tok}, \\mathbf{S}\\_{tok}) \\tag{1}$$ 3.2 Back-end: Joint Motion Optimization Fig. 2. SyncFormer\n核心组件：SyncFormer，一个迭代式的 Transformer module。用来联合优化估计 2D trajectories $\\mathcal{T}^{2d} \\in \\mathbb{R}^{T \\times N \\times 2}$ in UV space 以及 3D trajectories $\\mathcal{T}^{3d} \\in \\mathbb{R}^{T \\times N \\times 3}$ in the camera coordinate system。同时对每一个 trajectory 它还动态估计 visibility probability $p^{vis}$ 和 dynamic probability $p^{dyn}$。\n$$ \\mathcal{T}^{2d}_{k+1}, \\mathcal{T}^{3d}_{k+1}, p^{vis}_{k+1}, p^{vis}_{k+1} = f_{sync}(\\mathcal{T}^{2d}_{k}, \\mathcal{T}^{3d}_{k}, p^{vis}_{k}, p^{vis}_{k}, \\mathcal{P}_k) \\tag{2} $$ 在每次迭代中，SyncFormer同时更新 2D trajectories、3D trajectories 和 camera pose。\ncamera pose 通过一个可微的 Bundle Adjustment 过程进行优化，该过程利用了 2D 和 3D 轨迹之间的重投影一致性约束。\nSyncFormer 关键采用了双分支（2D \u0026amp; 3D）解耦设计。2D 和 3D 的 Embeddings 在各自的分之内通过 self attention 处理，并通过 proxy tokens 之间的 cross attention 进行信息交换。这防止了两种不同空间（图像空间 vs. 相机空间）的更新信号相互干扰。\n4. 实验分析 (Experiments) 4.1 3D Point Tracking Tab. 1. 3D Point Tracking Results\n4.2 Dynamic 3D Reconstruction 4.2.1 Video Depth Evaluation Tab. 2. Video Depth Evaluation Results\n4.2.2 Camera Poses Tab. 3. Camera Poses Evaluation Results\n4.3 消融实验 (Ablation Analysis) Tab. 4. Ablation Study Results\n消融实验证明简单的 3D lifting (CoTracker3-3D baseline) 会导致 2D 追踪性能急剧下降（AJ 从 64.4 下降至 51.6）。这证明了 SyncFormer 的双分支解耦设计是有效且必要的，因为它避免了不同模态信号的纠缠。\nTab. 5. Heterogeneous Training Analysis\n实验表明，在更多、更真实的视频数据集上进行联合训练能显著提升模型在真实场景上的表现。\n5. 批判性思考 (Cirical Analysis \u0026amp; Personal Thoughts) 5.1 优点 (Strengths) 立意高远且切中要害: 准确地指出了现有模块化 pipeline 的核心弊病，并提出了一个逻辑自洽、优雅的“大一统”解决方案。 结构设计巧妙: SyncFormer 的双分支解耦设计和循环内的 differentiable BA，是解决 2D/3D 联合追踪问题的非常聪明的方案。 工程实践强大: 成功地在17个异构数据集上进行了复杂的多阶段训练，展示了强大的工程能力和模型的可扩展性，这是其取得 SOTA 性能的关键。 5.2 潜在缺点/可疑点 (Weaknesses/ Questionable Points) 复现门槛极高: 训练流程非常复杂，分为三阶段，使用了64块H20 GPU 。这对于算力有限的研究者来说，几乎无法复现或在此基础上进行改进。 对长视频的泛化能力: 论文中训练的视频长度在12-48帧之间 ，测试视频最长为300帧 。对于更长的视频（如数分钟级别），其累积误差和计算开销如何，没有深入探讨。 对 failure cases 分析不足: 尽管定性结果图很惊艳，但论文缺乏对模型典型 failure cases 的深入分析，例如在极端光照、快速运动模糊、或大面积无纹理区域下的表现。 5.3 Ideas to Borrow “分解+统一”：将一个复杂问题分解为几个更明确的子问题，然后设计一个统一框架进行联合优化的思想，值得借鉴。 异构数据训练策略：对于一个新任务如何整合多种不同监督形式的数据集来提升模型泛化能力，可以参考这个工作。 SyncFormer Architecture Pattern：在多模态或多任务学习中，当不同任务的 feature space 或更新动态不一致时，采用类似的解耦-交互的结构，可能是一个通用的有效策略。 ","permalink":"https://gavinsun0921.github.io/posts/fast-paper-reading-04/","summary":"本文提出了一个 feed-forward 3D point tracking architecture，它将 video depth、camera pose 和 object motion 进行统一建模和 end-to-end 优化，并通过在 17 个异构数据集上的可扩展训练，实现了 SOTA 的 3D 追踪精度和推理速度。","title":"[ICCV'25] SpatialTrackerV2: 3D Point Tracking Made Easy 阅读报告"},{"content":"0. 环境基础 系统平台配置： 图 1. 操作系统环境配置 Neovim版本： $ nvim --version NVIM v0.11.3 Build type: Release LuaJIT 2.1.1748459687 Run \u0026#34;nvim -V1 -v\u0026#34; for more info 配置目标：一个轻量化的 C++ 开发环境，带有自动补全、语法高亮、注释、括号补全等功能。 1. 安装 NvChad git clone https://github.com/NvChad/starter ~/.config/nvim \u0026amp;\u0026amp; nvim 安装好之后默认在 ~/.config/nvim 下的文件结构是这样的：\n$ tree nvim/ Wed Jul 23 19:44:25 2025 nvim/ ├── init.lua ├── LICENSE ├── lua │ ├── autocmds.lua │ ├── chadrc.lua │ ├── configs │ │ ├── conform.lua │ │ ├── lazy.lua │ │ └── lspconfig.lua │ ├── mappings.lua │ ├── options.lua │ └── plugins │ └── init.lua └── README.md 4 directories, 11 files 2. 配置 NvChad 2.1 配置 LSP Language Server Protocal (LSP)，即语言服务器协议，用于 编辑器或IDE 和 语言服务器 之间的通信。服务器提供特定于语言的功能，例如代码补全、错误检查、跳转到定义等，而 LSP 确保了不同编辑器和语言服务器之间的互操作性。\n直接在系统安装好 clangd ，能够直接在命令行直接调用，所以就不用再通过 Mason 再安装一份了，直接在 ~/.config/nvim/lua/configs/lspconfig.lua 文件内添加一行：\nrequire(\u0026#34;lspconfig\u0026#34;).clangd.setup {} 现在通过 Neovim 编辑 C/C++ 的文件已经有代码补全等功能了。\n2.2 配置代码缩进和格式化 在 ~/.config/nvim/lua/options.lua 文件中进行配置，原文件使用的 \u0026ldquo;nvchad.options\u0026rdquo; 默认的行缩进是2，我将其修改为4，修改后的文件内容为：\n-- ~/.config/nvim/lua/options.lua require \u0026#34;nvchad.options\u0026#34; -- add yours here! local o = vim.o o.cursorlineopt =\u0026#39;both\u0026#39; -- to enable cursorline! -- Indenting o.expandtab = true o.shiftwidth = 4 o.tabstop = 4 o.softtabstop = 4 然后在 ~/.config/nvim/lua/autocmds.lua 文件中进行配置，调用我们的 clangd 去执行 .clang-format 格式化代码。\n-- ~/.config/nvim/lua/autocmds.lua require \u0026#34;nvchad.autocmds\u0026#34; local autocmd = vim.api.nvim_create_autocmd -- 自动保存时用指定的 .clang-format 格式化 C/C++ 文件 autocmd(\u0026#34;BufWritePre\u0026#34;, { pattern = { \u0026#34;*.cpp\u0026#34;, \u0026#34;*.cc\u0026#34;, \u0026#34;*.c\u0026#34;, \u0026#34;*.h\u0026#34;, \u0026#34;*.hpp\u0026#34; }, callback = function() vim.lsp.buf.format({ async = false }) end, }) LSP（clangd）会自动查找 .clang-format 文件，查找顺序是：\n当前文件目录 -\u0026gt; 父目录 -\u0026gt; ... -\u0026gt; 根目录 -\u0026gt; 用户主目录（$HOME） 所以可以配置一个全局配置文件放在用户主目录里，如果具体文件内有更详细的要求，只要在贴近文件的目录或者项目目录再添加配置文件即可。\n目前在用的 .clang-foramt 配置文件供参考：\nLanguage: Cpp BasedOnStyle: LLVM AccessModifierOffset: -4 NamespaceIndentation: All IndentWidth: 4 TabWidth: 4 UseTab: Never BreakBeforeBraces: Attach IndentCaseLabels: true ColumnLimit: 120 # 连续赋值时，对齐所有等号 AlignConsecutiveAssignments: true # 连续声明时，对齐所有声明的变量名 AlignConsecutiveDeclarations: true AlignTrailingComments: true AllowAllArgumentsOnNextLine: true AllowAllConstructorInitializersOnNextLine: true AllowAllParametersOfDeclarationOnNextLine: true AllowShortBlocksOnASingleLine: Empty AlwaysBreakAfterReturnType: None AllowShortIfStatementsOnASingleLine: true BinPackArguments: false BinPackParameters: false 2.3 配置一键编译并运行 配置单文件编译并运行快捷键，主要用途是用于进行轻量化 C++ 编程，如果是基于 CMake 等工具进行构建的项目同理可以参考进行配置，在 ~/.config/nvim/lua/mappings.lua 添加对应内容：\n-- Compile \u0026amp; run current C++ file map(\u0026#39;n\u0026#39;, \u0026#39;\u0026lt;leader\u0026gt;rr\u0026#39;, function() local filename = vim.fn.expand(\u0026#39;%:t\u0026#39;) local output = vim.fn.expand(\u0026#39;%:r\u0026#39;) local cmd = string.format(\u0026#39;g++ -std=c++14 -O2 -Wall \u0026#34;%s\u0026#34; -o \u0026#34;%s\u0026#34; \u0026amp;\u0026amp; ./%s; rm \u0026#34;%s\u0026#34;\u0026#39;, filename, output, output, output) vim.cmd(\u0026#39;split | terminal \u0026#39; .. cmd) end, { noremap = true, silent = true, desc = \u0026#34;Compile \u0026amp; run current C++ file\u0026#34; }) 这段代码内的对应功能是按下 \u0026lt;leader\u0026gt;rr 快捷键后（默认 \u0026lt;leader\u0026gt; 键是空格），会以 C++ 14 标准去编译当前打开的文件并运行，且会在运行结束后删除编译出来的可执行文件 ，在运行时会拆分出来一个 buffer 用来显示执行界面。\n图 2. 按下 \u0026lt;leader\u0026gt; 键后的提示 图 3. 按下 r 键后的提示 图 4. 再次按下 r 键后运行效果，开始执行程序 图 5. 按下 i 键后进入 INSERT 模式，可以在可执行程序中输入数据 ⚠️注意：执行界面也是需要按下 i 进入 INSERT 模式的。\n3. 竞赛的额外配置（个性化） 3.1 一键复制代码 自己本地运行完代码经常需要复制代码提交到在线 OJ 上，所以额外配置了一个一键将当前编辑代码复制到剪贴板上的快捷键，添加到 ~/.config/nvim/lua/mappings.lua 末尾即可：\n-- 一键拷贝当前文件全部内容到系统剪贴板 map(\u0026#39;n\u0026#39;, \u0026#39;\u0026lt;leader\u0026gt;rc\u0026#39;, function() vim.cmd(\u0026#39;:%y+\u0026#39;) end, { noremap = true, silent = true, desc = \u0026#34;Copy entire file to clipboard\u0026#34; }) 依次按下 空格 r c 三个键后代码就被拷贝到剪贴板内了。\n3.2 更多操作 可以参考上方的操作自定义更多的快捷键。\n4. 尽情使用 可以使用它来写简单的 C++ 单文件代码了，完美的算法竞赛（ICPC/OI等）选手使用场景！\n简单做一道真题吧：P11361 [NOIP2024] 编辑字符串\n图 6. P11361 AC代码在 Neovim 中效果 图 7. P11361 提交结果 ","permalink":"https://gavinsun0921.github.io/posts/neovim01/","summary":"这篇教程为在macOS系统上基于NvChad配置Neovim简单C++开发环境的记录。","title":"Neovim简单C++开发环境配置过程"},{"content":"1. 这篇教程是做什么的？ 在科研工作中，普通用户在服务器上需要使用的 CUDA Toolkit 版本和服务器上已经安装的 CUDA Toolkit 版本不一定一致，在没有 ROOT 权限的情况下，也可以安装需要的 CUDA Toolkit 版本和对应的 cuDNN，本文的实践教程是在 Ubuntu 上。\n⚠️注意：非 ROOT 权限不应该对驱动（CUDA Driver）进行任何修改（也无法修改），如果驱动版本过低，请联系服务器管理员。\n1.1 使用服务器上已有的 CUDA Toolkit 和 cuDNN 常规服务器应该会有已安装的 CUDA Toolkit 和 cuDNN，一般位于 /usr/local/ 目录下，使用时仅需要配置环境变量即可！\n一般来说将下方命令添加到 ~/.bashrc 文件中，CUDA_HOME 环境变量就是 CUDA 对应位置（如果有多个版本，可以自行选择启用哪个版本）。\nexport CUDA_HOME=/usr/local/cuda export CPATH=$CUDA_HOME/include:$CPATH export PATH=$CUDA_HOME/bin:$PATH export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH export LIBRARY_PATH=$CUDA_HOME/lib64:$LIBRARY_PATH ⚠️注意：配置完环境变量后需要重新登陆账户或者 source ~/.bashrc 后才能生效。\n2. 无 ROOT 权限安装 CUDA Toolkit 和 cuDNN 2.1 安装 CUDA Toolkit 首先确定你需要下载的 CUDA Toolkit 版本，然后在官网下载对应版本的 CUDA，官网链接：CUDA Toolkit Archive 。\n这里以 CUDA Toolkit 12.1.0 为例，根据服务器的架构以及版本进行选择，最后的 Installer Type 选择 runfile(local) 。\n图 1. CUDA 安装包下载页面截图1 在选择完毕之后下方会出现对应的下载命令，我们仅运行 wget 命令（红框中圈出的命令）下载对应文件，由于没有 ROOT 权限，所以无法 sudo 运行该脚本。\n图 2. CUDA 安装包下载页面截图2 同时由于我们没有 ROOT 权限，所以安装的 CUDA Toolkit 不能放置在 /usr/local/ 目录中，我们需要提前创建一个安装 CUDA Toolkit 的目录，在之后执行安装文件时会用到，如：/home/gavin/environment/cuda-12.1。\n之后通过下方的命令给该脚本添加运行权限，然后开始运行安装脚本（可能会卡）：\nchmod +x cuda_*.run sh cuda_*.run 由于我们下载的 CUDA 安装包内同时包含 CUDA Driver 和 CUDA Toolkit，所以这里会有这个提示，我们不安装驱动，没有影响，选择 Continue。\n图 3. CUDA 安装页面截图1 输入 accept 同意用户协议（也没法不同意）\n图 4. CUDA 安装页面截图2 使用回车将除 CUDA Toolkit 12.1 以外的全部取消勾选，之后不要选 Install ，选择 Options 。\n图 5. CUDA 安装页面截图3 选择 Toolkit Options 进行设置。\n图 6. CUDA 安装页面截图4 取消选择所有选项，并且更改安装路径（选择 Change Toolkit Install Path ）。\n图 7. CUDA 安装页面截图5 将前方提前创建好的目录填写进去（建议复制粘贴，不会出错，pwd 可以获得当前位置的绝对路径），之后回车确认，\n图 8. CUDA 安装页面截图6 然后一路选择 Done 回退到如图5的界面选择 Install ，运行结束后会有如下信息。\n图 9. CUDA 安装页面截图7 2.2 启用刚刚安装的 CUDA Toolkit 配置环境变量，如同 1.1 章节的操作。\n将下方命令添加到 ~/.bashrc 文件中，CUDA_HOME 环境变量就是 CUDA 对应位置（如果有多个版本，可以自行选择启用哪个版本）。\nexport CUDA_HOME=/home/gavin/environment/cuda-12.1 # 选择你安装的路径 export CPATH=$CUDA_HOME/include:$CPATH export PATH=$CUDA_HOME/bin:$PATH export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH export LIBRARY_PATH=$CUDA_HOME/lib64:$LIBRARY_PATH ⚠️注意：配置完环境变量后需要重新登陆账户或者 source ~/.bashrc 后才能生效。\n在终端中运行 nvcc -V 后，输出的信息应该匹配你所安装的 CUDA Toolkit 版本。\n图 10. 环境变量配置完毕后检测 2.3 安装对应版本的 cuDNN 根据你安装的 CUDA Toolkit 版本在 cuDNN Archive 中选择对应的 cuDNN 包。\n图 11. cuDNN Archive 网页截图 这里选择 Linux x86_64 (Tar) 包，不选择 Deb 包，因为我们是无 ROOT 权限自定义安装的路径，而且 cuDNN 的安装非常简单，直接拷贝文件就可以了。\n图 12. cuDNN 下载对应 Tar 包 下载后使用下方命令将压缩包解压：\nxz -d cudnn-*-archive.tar.xz tar -xvf cudnn-*-archive.tar 之后复制对应文件到 CUDA Toolkit 安装目录\ncp cudnn-*-archive/include/cudnn*.h $CUDA_HOME/include cp -P cudnn-*-archive/lib/libcudnn* $CUDA_HOME/lib64 chmod a+r $CUDA_HOME/include/cudnn*.h $CUDA_HOME/lib64/libcudnn* 2.4 ⚠️注意 PyTorch 无法查看 CUDA 和 cuDNN 版本 这个其实显示的是当前 PyTorch 支持的最高 CUDA 版本，并不是当前的 CUDA 版本号。\n图 13. torch.version.cuda 的输出截图 这个其实显示的是 PyTorch 内部捆绑的 cuDNN 版本，而不是本地安装的 cuDNN 版本。输出的 90100 则表示内部捆绑的 cuDNN 版本为 9.1.0 。\n图 13. torch.backends.cudnn.version() 的输出截图 ","permalink":"https://gavinsun0921.github.io/posts/install_cuda_and_cudnn/","summary":"这篇教程为无ROOT权限的Ubuntu服务器用户提供了完整的CUDA Toolkit与cuDNN安装指南。采用静默安装模式将CUDA Toolkit部署到用户目录，通过交互式界面调整安装路径并仅保留必要组件。特别指出PyTorch内置版本与本地安装版本的区别，避免开发者误判环境状态。","title":"无ROOT权限在服务器安装CUDA和CUDNN教程"},{"content":"This is the second post in the Paper Research series. In this series I will continue to update some personal study notes on reading papers. This post will introduce the basic work of variational autoencoder (VAE), including the derivation of formulas and simple code verification.\nAutoencoder Autoencoder is a neural network designed to learn an identity function in an unsupervised way to reconstruct the original input while compressing the data in the process so as to discover a more efficient and compressed representation. The autoencoder was first proposed as a nonlinear generalization of principal components analysis (PCA) in Kramer, (1991). And later promoted by the seminal paper by Hinton \u0026amp; Salakhutdinov, (2006).\nIt consists of two networks:\nEncoder network $g_\\phi$: It translates the original high-dimension input into the latent low-dimensional code. The input size is larger than the output size. Decoder network $f_\\theta$: The decoder network recovers the high-dimension data from the latent low-dimensional code. The input size is smaller than the output size. Fig. 1. Illustration of autoencoder model architecture. (Image source: Weng, 2018) The encoder network essentially accomplishes the dimensionality reduction, just like how we would use Principal Component Analysis (PCA) or Matrix Factorization (MF) for. In addition, the autoencoder is explicitly optimized for the data reconstruction from the code. A good intermediate representation not only can capture latent variables, but also benefits a full decompression process.\nThe model contains an encoder function $g_\\phi(\\cdot)$ parameterized by $\\phi$ and a decoder function $f_\\theta(\\cdot)$ parameterized by $\\theta$. The latent low-dimensional code learned for input $\\mathbf{x}$ in the bottleneck layer is $\\mathbf{z} = g_\\phi(\\mathbf{x})$ and reconstructed input is $\\mathbf{x}' = f_\\theta(\\mathbf{z}) = f_\\theta(g_\\phi(\\mathbf{x}))$. The parameters $(\\phi, \\theta)$ are learned together to output a reconstructed data sample same as the original input $\\mathbf{x} \\approx f_\\theta(g_\\phi(\\mathbf{x}))$.\nVAE The idea of Variational Autoencoder (Kingma \u0026amp; Welling, 2014), short for VAE, is actually less similar to the autoencoder model above, but deeply rooted in the methods of variational bayesian and graphical model.\nFig. 2. The type of directed graphical model under consideration. (Image source: Kingma \u0026 Welling, 2014) Instead of mapping the input into a fixed vector, we want to map it into a distribution. In Fig. 2, solid lines denote the generative model $p_\\theta(\\mathbf{x} \\mid \\mathbf{z})$ with the analytically tractable prior distribution $p_\\theta(\\mathbf{z})$， dashed lines denote the variational approximation $q_\\phi(\\mathbf{z} \\mid \\mathbf{x})$ to the intractable posterior distribution $p_\\text{data}(\\mathbf{z} \\mid \\mathbf{x})$. The variational parameters $\\phi$ are learned jointly with the generative model parameters $\\theta$.\nProblem Scenario Suppose our dataset consists of i.i.d. samples $\\{ \\mathbf{x}_i \\in \\mathbb{R}^D \\} _ {i=1} ^N$ from an unknown data distribution $p_\\text{data}(\\mathbf{x})$. We wish to represent the distribution $p_\\text{data}(\\mathbf{x})$ of $\\mathbf{x}$ with the help of the latent variable $\\mathbf{z}$.\n$$ p_\\theta(\\mathbf{x}) = \\int p_\\theta(\\mathbf{x} \\mid \\mathbf{z}) p_\\theta(\\mathbf{z}) \\mathrm{d} \\mathbf{z} \\tag{1} $$\nWe want $p_\\theta(\\mathbf{x})$ to approximate $p_\\text{data}(\\mathbf{x})$, so that (theoretically) we both represent $p_\\text{data}(\\mathbf{x})$ in terms of latent variable $\\mathbf{z}$ and get the generative model $p_\\theta(\\mathbf{x} \\mid \\mathbf{z})$, killing two birds with one stone.\nLoss Function: ELBO Assuming that we know the real parameter $\\theta^*$ for this distribution. In order to generate a sample that looks like a real data point $\\mathbf{x}^{(i)}$, we follow these steps:\nSample a $\\mathbf{z}^{(i)}$ from the prior distribution $p_{\\theta^*}(\\mathbf{z})$. Generate the $\\mathbf{x}^{(i)}$ from the condition distribution (generative model) $p_{\\theta^*}(\\mathbf{x} \\mid \\mathbf{z} = \\mathbf{z}^{(i)})$. The optimal parameter $\\theta^{*}$ is the one that maximizes the probability of generating real data samples:\n$$ \\theta^* = \\argmax_\\theta \\prod_{i=1}^n p_\\theta(\\mathbf{x}^{(i)}) \\tag{2} $$\nCommonly we use the log probability to convert the product on RHS to a sum:\n$$ \\theta^* = \\argmax_\\theta \\sum_{i=1}^n \\log p_\\theta(\\mathbf{x}^{(i)}) \\tag{3} $$\nUnfortunately, the integral of $p_\\theta(\\mathbf{x})$ in Eq. (1) is not well calculated. Kingma \u0026amp; Welling, (2014) chose to use $q_\\phi(\\mathbf{z} \\mid \\mathbf{x})$ to approximate $p_\\text{data}(\\mathbf{z} \\mid \\mathbf{x})$. They focused primarily on describing the posterior $p_\\text{data}(\\mathbf{z} \\mid \\mathbf{x})$, which is difficult to compute so the EM algorithm could not be applied to this problem. But Su. (2018) gives another idea for approaching: a straightforward joint distribution. First we write out the joint probability distribution of the prior distribution $p_\\theta(\\mathbf{z})$ and the conditional distribution (generative model) $p_\\theta(\\mathbf{x} \\mid \\mathbf{z})$:\n$$ p_\\theta(\\mathbf{x}, \\mathbf{z}) = p_\\theta(\\mathbf{x} \\mid \\mathbf{z}) p_\\theta(\\mathbf{z}) \\tag{4} $$\nWe define the joint probability distribution $q_\\phi(\\mathbf{x}, \\mathbf{z})$ based on the data distribution $p_\\text{data}(\\mathbf{x})$ and the variational approximation $q_\\phi(\\mathbf{z} \\mid \\mathbf{x})$ of the posterior distribution.\n$$ q_\\phi(\\mathbf{x}, \\mathbf{z}) = p_\\phi(\\mathbf{z} \\mid \\mathbf{x}) p_\\text{data}(\\mathbf{x}) \\tag{5} $$\nWe want these two joint probability distributions to be as close together as possible, so we use KL divergence to measure the distance between these two distributions, and we want their KL divergence to be as small as possible.\n$$ \\begin{align} D_\\text{KL}(q_\\phi(\\mathbf{x},\\mathbf{z}) \\| p_\\theta(\\mathbf{x}, \\mathbf{z})) \u0026amp; = \\int\\int q_\\phi(\\mathbf{x}, \\mathbf{z}) \\ln \\frac{q_\\phi(\\mathbf{x}, \\mathbf{z})}{p_\\theta(\\mathbf{x}, \\mathbf{z})} \\mathrm{d}\\mathbf{z} \\mathrm{d}\\mathbf{x} \\\\ \u0026amp;= \\int\\int p_\\text{data}(\\mathbf{x})q_\\phi(\\mathbf{z} \\mid \\mathbf{x}) \\ln \\frac{p_\\text{data}(\\mathbf{x})q_\\phi(\\mathbf{z} \\mid \\mathbf{x})}{p_\\theta(\\mathbf{x},\\mathbf{z})} \\mathrm{d} \\mathbf{z} \\mathrm{d} \\mathbf{x} \\\\ \u0026amp;= \\int p_\\text{data}(\\mathbf{x}) \\left [ \\int q_\\phi(\\mathbf{z} \\mid \\mathbf{x}) \\ln \\frac{p_\\text{data}(\\mathbf{x})q_\\phi(\\mathbf{z} \\mid \\mathbf{x})}{p_\\theta(\\mathbf{x},\\mathbf{z})} \\mathrm{d}\\mathbf{z} \\right ] \\mathrm{d} \\mathbf{x} \\\\ \u0026amp;= \\mathbb{E}_{\\mathbf{x} \\sim p_\\text{data}(\\mathbf{x})} \\left [ \\int q_\\phi(\\mathbf{z} \\mid \\mathbf{x}) \\ln \\frac{p_\\text{data}(\\mathbf{x})q_\\phi(\\mathbf{z} \\mid \\mathbf{x})}{p_\\theta(\\mathbf{x},\\mathbf{z})} \\mathrm{d}\\mathbf{z} \\right ] \\\\ \u0026amp;= \\mathbb{E}_{\\mathbf{x} \\sim p_\\text{data}(\\mathbf{x})} \\left [ \\int q_\\phi(\\mathbf{z}) \\ln p_\\text{data}(\\mathbf{x}) \\mathrm{d} \\mathbf{z} + \\int q_\\phi(\\mathbf{z} \\mid \\mathbf{x}) \\ln \\frac{q_\\phi(\\mathbf{z} \\mid \\mathbf{x})}{p_\\theta(\\mathbf{x},\\mathbf{z})} \\mathrm{d}\\mathbf{z} \\right ] \\\\ \u0026amp;= \\mathbb{E}_{\\mathbf{x} \\sim p_\\text{data}(\\mathbf{x})} \\left [ \\ln p_\\text{data}(\\mathbf{x}) {\\color{blue} \\int q_\\phi(\\mathbf{z} \\mid \\mathbf{x}) \\mathrm{d} \\mathbf{z}} \\right ] + \\mathbb{E}_{\\mathbf{x} \\sim p_\\text{data}(\\mathbf{x})} \\left [ \\int q_\\phi(\\mathbf{z} \\mid \\mathbf{x}) \\ln \\frac{q_\\phi(\\mathbf{z} \\mid \\mathbf{x})}{p_\\theta(\\mathbf{x},\\mathbf{z})} \\mathrm{d} \\mathbf{z} \\right ] \\\\ \u0026amp;= {\\color{red} \\mathbb{E}_{\\mathbf{x} \\sim p_\\text{data}(\\mathbf{x})} \\left [ \\ln p_\\text{data}(\\mathbf{x}) \\right ]} + \\mathbb{E}_{\\mathbf{x} \\sim p_\\text{data}(\\mathbf{x})} \\left [ \\int q_\\phi(\\mathbf{z} \\mid \\mathbf{x}) \\ln \\frac{q_\\phi(\\mathbf{z} \\mid \\mathbf{x})}{p_\\theta(\\mathbf{x},\\mathbf{z})} \\mathrm{d} \\mathbf{z} \\right ] \\end{align} \\tag{6} $$ where the integral of the blue part is equal to 1.\n$p_\\text{data}(\\mathbf{x})$ is the prior over $\\mathbf{x}$ determined from the samples $\\mathbf{x}_1, \\mathbf{x}_2, \\cdots, \\mathbf{x}_n$. Although we cannot write its expression explicitly, it does exist. So for any particular dataset, the red part in Eq. (6) is a constant.\nSo the loss function can be written as:\n$$ \\mathcal{L} = D_\\text{KL}(q_\\phi(\\mathbf{x},\\mathbf{z}) \\| p_\\theta(\\mathbf{x}, \\mathbf{z})) - {\\color{red} C_\\text{data}} = \\mathbb{E}_{\\mathbf{x} \\sim p_\\text{data}(\\mathbf{x})} \\left [ \\int q_\\phi(\\mathbf{z} \\mid \\mathbf{x}) \\ln \\frac{q_\\phi(\\mathbf{z} \\mid \\mathbf{x})}{\\color{blue} p_\\theta(\\mathbf{x},\\mathbf{z})} \\mathrm{d} \\mathbf{z} \\right ] \\tag{7} $$\nBecause of the nonnegativity of the KL divergence, our loss function possesses a lower bound $-\\mathbb{E}_{\\mathbf{x} \\sim p_\\text{data}(\\mathbf{x})} \\left [ \\ln p_\\text{data}(\\mathbf{x}) \\right ]$.\nTo obtain the generative model $p_\\theta(\\mathbf{x} \\mid \\mathbf{z})$, we write $p_\\theta(\\mathbf{x} \\mid \\mathbf{z}) p_\\theta(\\mathbf{z})$ for the joint probability distribution $p_\\theta(\\mathbf{x},\\mathbf{z})$ of the blue part in Eq. (7):\n$$ \\begin{align} \\mathcal{L} \u0026amp;= \\mathbb{E}_{\\mathbf{x} \\sim p_\\text{data}(\\mathbf{x})} \\left [ \\int q_\\phi(\\mathbf{z} \\mid \\mathbf{x}) \\ln \\frac{q_\\phi(\\mathbf{z} \\mid \\mathbf{x})}{p_\\theta(\\mathbf{x} \\mid \\mathbf{z}) p_\\theta(\\mathbf{z})} \\mathrm{d} \\mathbf{z} \\right ] \\\\ \u0026amp;= \\mathbb{E}_{\\mathbf{x} \\sim p_\\text{data}(\\mathbf{x})} \\left [ \\int q_\\phi(\\mathbf{z} \\mid \\mathbf{x}) \\left ( \\ln \\frac{q_\\phi(\\mathbf{z} \\mid \\mathbf{x})}{p_\\theta(\\mathbf{z})} - \\ln p_\\theta(\\mathbf{x} \\mid \\mathbf{z}) \\right ) \\mathrm{d} \\mathbf{z} \\right ] \\\\ \u0026amp;= \\mathbb{E}_{\\mathbf{x} \\sim p_\\text{data}(\\mathbf{x})} \\left [ \\int q_\\phi(\\mathbf{z} \\mid \\mathbf{x}) \\ln \\frac{q_\\phi(\\mathbf{z} \\mid \\mathbf{x})}{p_\\theta(\\mathbf{z})} \\mathrm{d} \\mathbf{z} - \\int q_\\phi(\\mathbf{z} \\mid \\mathbf{x}) \\ln p_\\theta(\\mathbf{x} \\mid \\mathbf{z}) \\mathrm{d} \\mathbf{z} \\right ] \\\\ \u0026amp;= \\mathbb{E}_{\\mathbf{x} \\sim p_\\text{data}(\\mathbf{x})} \\left [ D_\\text{KL}(q_\\phi(\\mathbf{z} \\mid \\mathbf{x}) \\| p_\\theta(\\mathbf{z})) - \\mathbb{E}_{\\mathbf{z} \\sim q_\\phi(\\mathbf{z} \\mid \\mathbf{x})} [\\ln p_\\theta(\\mathbf{x} \\mid \\mathbf{z})] \\right ] \\end{align} \\tag{8} $$\nThe center bracket of Eq. (8) is the loss function of the VAE. Note that although the loss function in Eq. (8) are composed of two parts, the cannot be viewed as optimization problems in which the two parts are minimized separately.\nWhen $D_\\text{KL}(q_\\phi(\\mathbf{z} \\mid \\mathbf{x}) \\| p_\\theta(\\mathbf{z}))$ is 0, it shows that there is no difference between the two distributions $q_\\phi(\\mathbf{z} \\mid \\mathbf{x})$ and $p_\\theta(\\mathbf{z})$, i.e., $\\mathbf{x}$ and $\\mathbf{z}$ they two are both independent of each other, then the process of predicting $\\mathbf{x}$ using $\\mathbf{z}$ at this point must be inaccurate, i.e., $-\\mathbb{E}_{\\mathbf{z} \\sim q_\\phi(\\mathbf{z} \\mid \\mathbf{x})} [\\ln p_\\theta(\\mathbf{x} \\mid \\mathbf{z})]$ cannot be small. When $-\\mathbb{E}_{\\mathbf{z} \\sim q_\\phi(\\mathbf{z} \\mid \\mathbf{x})} [\\ln p_\\theta(\\mathbf{x} \\mid \\mathbf{z})]$ is small, it implies that $\\mathbb{E}_{\\mathbf{z} \\sim q_\\phi(\\mathbf{z} \\mid \\mathbf{x})} \\left [ p_\\theta(\\mathbf{x} \\mid \\mathbf{z}) \\right ]$ is large, i.e., predicting $\\mathbf{x}$ using $\\mathbf{z}$ is very accurate, and the relationship between $\\mathbf{x}$ and $\\mathbf{z}$ will be very strong at this time, i.e., $q_\\phi(\\mathbf{z} \\mid \\mathbf{x})$ will not be too random, so $D_\\text{KL}(q_\\phi(\\mathbf{z} \\mid \\mathbf{x}) \\| p_\\theta(\\mathbf{z}))$ will not be small. So these two parts of the loss are actually antagonistic to each other, and the loss function cannot be viewed separately, but as a whole. In fact, this is exactly what GANs dream of: having a total metric that can indicate the training process of the generative model. This capability is naturally available in VAE models, and GANs can\u0026rsquo;t do it until WGAN.\nBuild the Network So far, there are three distributions in the loss function in Eq. (8) that we don\u0026rsquo;t know: $q_\\phi(\\mathbf{z} \\mid \\mathbf{x})$, $p_\\theta(\\mathbf{x} \\mid \\mathbf{z})$, and $p_\\theta(\\mathbf{z})$. As for $p_\\text{data}(\\mathbf{x})$, while we can\u0026rsquo;t write its expression explicitly, sampling from it is easy to do (samples from the dataset). In order to solve the problem practically, we need to identify the three unknown distributions mentioned above or determine their form.\nFig. 3. Illustration of variational autoencoder model with the multivariate Gaussian assumption.\n(Image source: Weng, 2018) 1) latent variable distribution\nTo facilitate the generative model in sampling the latent variable $\\mathbf{z}$ when generating samples, we assume that $\\mathbf{z} \\sim p_\\theta(\\mathbf{z}) = \\mathcal{N}(\\mathbf{0}, \\mathbf{I})$.\n2) posterior distribution approximation\nWe assume that $q_\\phi(\\mathbf{z} \\mid \\mathbf{x})$ is also multivariate normally distributed (with independent components), with its mean and variance determined by $\\mathbf{x}$. The \u0026ldquo;determination\u0026rdquo; process is in fact a neural network with parameter $\\phi$. $$ \\begin{align} q_\\phi(\\mathbf{z} \\mid \\mathbf{x}) = \\frac{1}{\\prod\\limits_{i=1}\\limits^n \\sqrt{2\\pi[\\pmb{\\sigma}_\\phi(\\mathbf{x}_i)]^2}} \\exp ( - \\frac{1}{2} \\| \\frac{\\mathbf{z} - \\pmb{\\mu}_\\phi(\\mathbf{x}_i)}{\\pmb{\\sigma}_\\phi(\\mathbf{x}_i)} \\|^2) \\end{align} \\tag{9} $$\nTherefore, the KL divergence part of the loss function in Eq. (8) can be pre-written as a concrete expression by referring to Appendix B in Kingma \u0026amp; Welling, (2014):\n$$ \\begin{align} D_\\text{KL}(q_\\phi(\\mathbf{z} \\mid \\mathbf{x}) \\| p_\\theta(\\mathbf{x})) \u0026amp;= \\int q_\\phi(\\mathbf{z} \\mid \\mathbf{x}) \\ln \\frac{q_\\phi(\\mathbf{z} \\mid \\mathbf{x})}{p_\\theta(\\mathbf{z})} \\mathrm{d} \\mathbf{z} \\\\ \u0026amp;= \\int q_\\phi(\\mathbf{z} \\mid \\mathbf{x}) (\\ln q_\\phi(\\mathbf{z} \\mid \\mathbf{x}) - \\ln p_\\theta(\\mathbf{z})) \\mathrm{d} \\mathbf{z} \\\\ \u0026amp;= {\\color{blue} \\int q_\\phi(\\mathbf{z} \\mid \\mathbf{x}) \\ln q_\\phi(\\mathbf{z} \\mid \\mathbf{x}) \\mathrm{d} \\mathbf{z}} - {\\color{red} \\int q_\\phi(\\mathbf{z} \\mid \\mathbf{x}) \\ln p_\\theta(\\mathbf{z})) \\mathrm{d} \\mathbf{z}} \\end{align} \\tag{10} $$\nLet $\\mathcal{J}$ be the dimensionality of $\\mathbf{z}$, and let $\\mu_j$ and $\\sigma_j$ denote the $j\\text{-th}$ element of $\\pmb{\\mu}_\\phi(\\mathbf{x})$ and $\\pmb{\\sigma}_\\phi(\\mathbf{x})$. The blue part of Eq. (10): $$ \\begin{align} {\\color{blue} \\int q_\\phi(\\mathbf{z} \\mid \\mathbf{x}) \\ln q_\\phi(\\mathbf{z} \\mid \\mathbf{x}) \\mathrm{d} \\mathbf{z}} \u0026amp;= \\int \\mathcal{N}(\\mathbf{z}; \\pmb{\\mu}_\\phi(\\mathbf{x}), \\pmb{\\sigma}_\\phi^2(\\mathbf{x})) \\ln \\mathcal{N}(\\mathbf{z}; \\pmb{\\mu}_\\phi(\\mathbf{x}), \\pmb{\\sigma}_\\phi^2(\\mathbf{x})) \\mathrm{d} \\mathbf{z} \\\\ \u0026amp;= - \\frac{\\mathcal{J}}{2} \\ln(2\\pi) - \\frac{1}{2} \\sum_{j=1}^{\\mathcal{J}} (\\mu_j^2 + \\sigma_j^2) \\end{align} \\tag{11} $$\nThe red part of Eq. (10): $$ \\begin{align} {\\color{red} \\int q_\\phi(\\mathbf{z} \\mid \\mathbf{x}) \\ln p_\\theta(\\mathbf{z})) \\mathrm{d} \\mathbf{z}} \u0026amp;= \\int \\mathcal{N}(\\mathbf{z}; \\pmb{\\mu}_\\phi(\\mathbf{x}), \\pmb{\\sigma}_\\phi^2(\\mathbf{x})) \\ln \\mathcal{N}(\\mathbf{z}; \\mathbf{0}, \\mathbf{I}) \\mathrm{d} \\mathbf{z} \\\\ \u0026amp;= \\frac{1}{2} \\sum_{j=1}^{\\mathcal{J}} (1 + \\ln \\sigma_j^2) \\end{align} \\tag{12} $$\nThus, Eq. (10) can be written as\n$$ D_\\text{KL}(q_\\phi(\\mathbf{z} \\mid \\mathbf{x}) \\| p_\\theta(\\mathbf{x})) = \\frac{1}{2} \\sum_{j=1}^{\\mathcal{J}} (\\mu_j^2 + \\sigma_j^2 - \\ln \\sigma_j^2 - 1) \\tag{13} $$\n3) generative model approximation\nFor the distributional assumptions of generative model $p_\\theta(\\mathbf{x} \\mid \\mathbf{z})$, Kingma \u0026amp; Welling, (2014) gives two options: Bernoulli distribution or Normal distribution.\n3.1) Bernoulli distribution\nThe Bernoulli distribution is the discrete probability distribution of a random variable which takes the value 1 with probability $\\rho$ and the value 0 with probability $1−\\rho$. The probability mass function $\\operatorname{f}$ of this distribution, over possible outcomes $\\xi$, is\n$$ \\begin{cases} \\rho \u0026amp;,\\text{if } \\xi = 1 \\\\ 1 - \\rho \u0026amp;,\\text{if } \\xi = 0 \\tag{14} \\end{cases} $$\nSo when the generating model $p_\\theta(\\mathbf{x} \\mid \\mathbf{z})$ is a Bernoulli distribution, it is only appropriate for the case where $\\mathbf{x}$ is a multivariate binary vector, since the binary distribution can only produce 0s and 1s. The mnist dataset we\u0026rsquo;ll be working on later for a simple code demonstration can be viewed as this case. In this case, we use the neural network $\\rho(z)$ to count the parameters $\\rho$ and thus obtain\n$$ p_\\theta(\\mathbf{x} \\mid \\mathbf{z}) = \\prod_{k=1}^D \\left ( \\rho_{(k)}(\\mathbf{z}) \\right )^{\\mathbf{x}_{(k)}} \\left ( 1 - \\rho_{(k)}(\\mathbf{z}) \\right )^{1 - \\mathbf{x}_{(k)}} \\tag{15} $$\nwhere $D$ is the dimensionality of $\\mathbf{x}$. Thus, from the preceding equation, we deduce that\n$$ -\\ln p_\\theta(\\mathbf{x} \\mid \\mathbf{z}) = \\sum_{k=1}^D \\left [ -\\mathbf{x}_{(k)} \\ln \\rho_{(k)}(\\mathbf{z}) - (1 - \\mathbf{x}_{(k)}) \\ln \\left (1 - \\rho_{(k)}(\\mathbf{z}) \\right ) \\right] \\tag{16} $$\nThis suggests that $\\rho(\\mathbf{z})$ has to be compressed to between 0 and 1 (e.g., with sigmoid activation), and then cross entropy is used as the loss function, where $\\rho(\\mathbf{z})$ then plays a role similar to that of a decoder.\n3.2) Normal distribution / Gaussian distribution\nIn statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is\n$$ \\operatorname{f}(x) = \\frac{1}{\\sigma\\sqrt{2\\pi}} \\exp \\left \\{ -\\frac{1}{2}(\\frac{x-\\mu}{\\sigma})^2 \\right \\} \\tag{17} $$\nWhen the generated model $p_\\theta(\\mathbf{x} \\mid \\mathbf{z})$ is normally distributed, it is suitable for general data. In this case, we use the neural networks $\\tilde{\\mu}(\\mathbf{z})$ and $\\tilde{\\sigma}(\\mathbf{z})$ to compute the mean and variance , which yields $$ p_\\theta(\\mathbf{x} \\mid \\mathbf{z}) = \\frac{1}{\\prod\\limits_{k=1}^D \\sqrt{2 \\pi \\tilde{\\sigma}_{(k)}^2(\\mathbf{z})}} \\exp \\left ( -\\frac{1}{2} \\left \\| \\frac{\\mathbf{x} - \\tilde{\\mu}(\\mathbf{z})}{\\tilde{\\sigma}(\\mathbf{z})} \\right \\| \\right ) \\tag{18} $$\nThus, from the preceding equation, we deduce that $$ -\\ln p_\\theta(\\mathbf{x} \\mid \\mathbf{z}) = \\frac{1}{2} \\left \\| \\frac{\\mathbf{x} - \\tilde{\\mu}(\\mathbf{z})}{\\tilde{\\sigma}(\\mathbf{z})} \\right \\| + \\frac{D}{2} \\ln 2\\pi + \\frac{1}{2} \\sum_{k=1}^D \\ln \\tilde{\\sigma}_{(k)}^2(\\mathbf{z}) \\tag{19} $$\nFor ease of computation, we usually fix the variance as a constant $\\sigma^*$. Then the above equation can be written as\n$$ -\\ln p_\\theta(\\mathbf{x} \\mid \\mathbf{z}) = \\frac{1}{2 \\sigma^*} \\left \\| \\mathbf{x} - \\tilde{\\mu}(\\mathbf{z}) \\right \\| + C \\tag{20} $$\nSo this part becomes the MSE loss function.\nSampling Calculation After the previous efforts, we were finally able to write the loss function Eq. (8) concretely after making assumptions about each of the three as-yet-undetermined distributions $q_\\phi(\\mathbf{z} \\mid \\mathbf{x})$, $p_\\theta(\\mathbf{x} \\mid \\mathbf{z})$, and $p_\\theta(\\mathbf{z})$. When $q_\\phi(\\mathbf{z} \\mid \\mathbf{x})$ and $p_\\theta(\\mathbf{z})$ are both Gaussian distributions, the KL divergence portion of Equation (8) results in Equation (13). We also write out the corresponding generative modeling part of the loss when $p_\\theta(\\mathbf{x} \\mid \\mathbf{z})$ is either Bernoulli or Gaussian distributed. So we\u0026rsquo;re now just short of sampling from the model.\nThe expectation term in the loss function invokes generating samples from $\\mathbf{z}\\sim q_\\phi(\\mathbf{z} \\mid \\mathbf{x})$. Sampling is a stochastic process and therefore we cannot backpropagate the gradient. To make it trainable, the reparameterization trick is introduced: It is often possible to express the random variable $\\mathbf{z}$ as a deterministic variable $\\mathbf{z} = \\mathcal{F}_\\phi(\\mathbf{x}, \\epsilon)$, where $\\epsilon$ is an auxiliary independent random variable, and the transformation function $\\mathcal{F}_\\phi$ parameterized by $\\phi$ converts $\\epsilon$ to $\\mathbf{z}$.\nReparameterization Trick in Lil\u0026rsquo;Log\nSimple Code Implementation You can open it in Colab and run the code for free.\nReferences [1] Diederik P. Kingma \u0026amp; Max Welling. “Auto-Encoding Variational Bayes.” ICLR 2014.\n[2] Mark A. Kramer. “Nonlinear Principal Component Analysis Using Autoassociative Neural Networks.” AIChE Journal 1991.\n[3] Geoffrey E. Hinton \u0026amp; Ruslan R. Salakhutdinov. “Reducing the Dimensionality of Data with Neural Networks.” Science 2006.\n[4] Lilian Weng. “From Autoencoder to Beta-VAE.” [Blog post] Lil\u0026rsquo;Log 2018.\n[5] Jianlin Su. “变分自编码器（二）：从贝叶斯观点出发.” [Blog post] Scientific Spaces 2018.\n","permalink":"https://gavinsun0921.github.io/posts/paper-research-02/","summary":"Learn variational autoencoder (VAE) by reading and analyzing the paper: \u0026ldquo;Auto-Encoding Variational Bayes\u0026rdquo;. This post will introduce the basic work of VAE, including the derivation of formulas and simple code verification.","title":"A Brief Exploration to Variational Autoencoder (VAE) with Code Implementation"},{"content":"Overview This paper introduce a new generative model where samples are produced via Langevin dynamics using gradients of the data distribution estimated with score matching. And it is important to learn Score-Based generative network and Ito diffusion SDE. In this paper, the training and inference phases are analyzed separately and solutions are proposed for different problems. Different levels of noise are used during training to overcome the problem that gradients can be ill-defined and hard to estimate when the data resides on low-dimensional manifolds. For sampling, we propose an annealed Langevin dynamics where we use gradients corresponding to gradually decreasing noise levels as the sampling process gets closer to the data manifold. The models in this paper produce samples comparable to GANs on MNIST, CelebA and CIFAR-10 datasets, achieving a new state-of-the-art inception score of 8.87 on CIFAR-10.\nScore-based Generative Modeling Defination of Score Suppose our dataset consists of i.i.d. samples $\\{ \\mathbf{x}_i \\in \\mathbb{R}^D \\} _ {i=1} ^N$ from an unknown data distribution $p_\\text{data}(\\mathbf{x})$.\nWe define the score of a probability density $p(\\mathbf{x})$ to be $\\nabla_\\mathbf{x}\\log p(\\mathbf{x})$. The score network $\\mathbf{s}_\\mathbf{\\theta}:\\mathbb{R}^D \\to \\mathbb{R}^D$ is a neural network parameterized by $\\mathbf{\\theta}$, which will be trained to approximate the score of $p_\\text{data}(\\mathbf{x})$ The goal of generative modeling is to use the dataset to learn a model for generating new samples from $p_\\text{data}(\\mathbf{x})$. The framework of score-based generative modeling has two ingredients: score matching and Langevin dynamics.\nScore Matching for Score Estimation Score matching (Aapo Hyvärinen, 2005) is originally designed for learning non-normalized statistical models based on i.i.d. samples from an unknown data distribution. Following Song et al. (2019), authors repurpose it for score estimation. Using score matching, authors can directly train a score network $\\mathbf{s}_\\mathbf{\\theta}(\\mathbf{x})$ to estimate $\\nabla_\\mathbf{x}\\log p_\\text{data}(\\mathbf{x})$ without training a model to estimate $p_\\text{data}(\\mathbf{x})$ first. Different from the typical usage of score matching, authors opt not to use the gradient of an energy-based model as the score network to avoid extra computation due to higher-order gradients.\nThe objective minimizes $\\frac{1}{2}\\mathbb{E}_{p_\\text{data}(\\mathbf{x})}\\left[\\left \\| \\mathbf{s}_\\mathbf{\\theta}(\\mathbf{x}) - \\nabla_\\mathbf{x} \\log p_\\text{data}(\\mathbf{x}) \\right \\|^2_2 \\right]$, which can be shown equivalent to the following up to a constant $$ \\frac{1}{2}\\mathbb{E}_{p_\\text{data}(\\mathbf{x})}\\left[ \\text{tr}(\\nabla _\\mathbf{x} \\mathbf{s}_\\mathbf{\\theta}(\\mathbf{x})) + \\frac{1}{2} \\left \\| \\mathbf{s}_\\mathbf{\\theta}(\\mathbf{x}) \\right \\|^2_2 \\right] \\tag{1} $$ where $\\nabla _\\mathbf{x} \\mathbf{s}_\\mathbf{\\theta}(\\mathbf{x})$ denotes the Jacobian of $\\mathbf{s}_\\mathbf{\\theta}(\\mathbf{x})$. However, score matching is not scalable to deep networks and high dimensional data due to the computation of $\\text{tr}(\\nabla _\\mathbf{x} \\mathbf{s}_\\mathbf{\\theta}(\\mathbf{x}))$. Below authors discuss two popular methods for large scale score mathing.\nDenoising Score Mathcing This is the main method used by the authors in the methodology below.\nDenoising score mathcing (Pascal Vincent, 2011) is a variant of score matching that completely circumvents $\\text{tr}(\\nabla _\\mathbf{x} \\mathbf{s}_\\mathbf{\\theta}(\\mathbf{x}))$. It first perturbs the data point $\\mathbf{x}$ with a pre-specified noise distribution $q_\\sigma(\\tilde{\\mathbf{x}} \\mid \\mathbf{x})$ and then employs score matching to estimate the score of the perturbed data distribution $q_\\sigma(\\tilde{\\mathbf{x}}) \\triangleq \\int q_\\sigma (\\tilde{\\mathbf{x}} \\mid \\mathbf{x}) \\mathrm{d}\\mathbf{x}$. The objective was proved equivalent to the following: $$ \\frac{1}{2}\\mathbb{E}_{q_\\sigma(\\tilde{\\mathbf{x}} \\mid \\mathbf{x}) p_\\text{data}(\\mathbf{x}) } \\left [ \\| \\mathbf{s}_\\mathbf{\\theta}(\\tilde{\\mathbf{x}}) - \\nabla_{\\tilde{\\mathbf{x}}} \\log q_\\sigma (\\tilde{\\mathbf{x}} \\mid \\mathbf{x}) \\|^2_2 \\right ] \\tag{2} $$ However, $\\mathbf{s}_\\mathbf{\\theta}^{*} = \\nabla_\\mathbf{x} \\log q_\\sigma(\\mathbf{x}) \\approx \\nabla_\\mathbf{x} \\log p_{\\text{data}}(\\mathbf{x})$ is true only when the noise is small enough such that $q_\\sigma(\\mathbf{x}) \\approx p_\\text{data}(\\mathbf{x})$.\nSliced Score Matching Sliced score matching (Song et al. 2019) uses random projections to approximate $\\text{tr}(\\nabla _\\mathbf{x} \\mathbf{s}_\\mathbf{\\theta}(\\mathbf{x}))$ in score matching. The objective is $$ \\mathbb{E}_{p_\\mathbf{v}}\\mathbb{E}_{p_\\text{data}} \\left [ \\mathbf{v}^\\top \\nabla_\\mathbf{x}\\mathbf{s}_\\mathbf{\\theta}(\\mathbf{x})\\mathbf{v} + \\frac{1}{2} \\| \\mathbf{s}_\\mathbf{\\theta}(\\mathbf{x}) \\|^2_2 \\right ] \\tag{3} $$ where $p_\\mathbf{v}$ is a simple ditribution of random vectors, e.g., the multivariate standard normal. The term $\\mathbf{v}^\\top \\nabla_\\mathbf{x}\\mathbf{s}_\\mathbf{\\theta}(\\mathbf{x})\\mathbf{v}$ can be efficiently computed by forward mode auto-differentiation. Unlike denoising score matching which estimates the scores of perturbed data, sliced score matching provides score estimation for the original unperturbed data distribution, but requires around four times more computations due to the forward mode auto-differentiation.\nSampling with Langevin Dynamics Langevin dynamics can produce samples from a probability density $p(\\mathbf{x})$ using only the score function $\\nabla_\\mathbf{x} \\log p(\\mathbf{x})$. Given a fixed step size $\\epsilon \u0026gt; 0$, and an initial value $\\tilde{\\mathbf{x}}_0 \\sim \\pi(\\mathbf{x})$ with $\\pi$ being a prior distribution (arbitrary), the Langevin method recursively computes the following $$ \\tilde{\\mathbf{x}}_t = \\tilde{\\mathbf{x}}_{t-1} + \\frac{\\epsilon}{2} \\nabla_\\mathbf{x} \\log p(\\tilde{\\mathbf{x}}_{t-1}) + \\sqrt{\\epsilon} \\mathbf{z}_t \\tag{4} $$ where $\\mathbf{z}_t \\sim \\mathcal{N}(\\mathbf{0}, \\mathbf{I})$. The distribution of $\\tilde{\\mathbf{x}}_T$ equals $p(\\mathbf{x})$ when $\\epsilon \\to 0$ and $T \\to \\infin$, in which case $\\tilde{x}_T$ becomes an exact sample from $p(\\mathbf{x})$ under some regularity conditions. When $\\epsilon \u0026gt; 0$ and $T \u0026lt; \\infin$, a Metropolis-Hastings update is needed to correct the error of Eq. (4), but it can often be ignored in practice. In this work, authors assume this error is negligible when $\\epsilon$ is small and $T$ is large.\nThe authors give the goals and reasons for network modeling. Sampling from Eq. (4) only requires the score function $\\nabla_\\mathbf{x} \\log p(\\mathbf{x})$. Therefore, in order to obtain samples from $p_\\text{data}(\\mathbf{x})$, authors first train score network such that $\\mathbf{s}_\\theta(\\mathbf{x}) \\approx \\nabla_\\mathbf{x} \\log p_\\text{data}(\\mathbf{x})$ and then approximately obtain samples with Langevin dynamics using $\\mathbf{s}_\\mathbf{\\theta}(\\mathbf{x})$. This is the key idea of our framework of score-based generative modeling.\nChallenges of Low Data Density Regions In regions of low data density, score matching may not have enough evidence to estimate score functions accurately, due to the lack of data samples. When sampling with Langevin dynamics, our initial sample is highly likely in low density regions when data reside in a high dimensional space. Therefore, having an inaccurate score-based model will derail Langevin dynamics from the very beginning of the procedure, preventing it from generating high quality samples that are representative of the data.\nFig. 1. Estimated scores are only accurate in high density regions. (Image source: Yang Song\u0026rsquo; blog, 2021)\nAuthors solution is to perturb data points with noise and train score-based models on the noisy data points instead. When the noise magnitude is sufficiently large, it can populate low data density regions to improve the accuracy of estimated scores. For example, here is what happens when we perturb a mixture of two Gaussians perturbed by additional Gaussian noise.\nFig. 2. Estimated scores are accurate everywhere for the noise-perturbed data distribution due to reduced low data density regions. (Image source: Yang Song\u0026rsquo; blog, 2021)\nYet another question remains: how to choose an appropriate noise scale for the perturbation process? Larger noise can obviously cover more low density regions for better score estimation, but it over-corrupts the data and alters it significantly from the original distribution. Smaller noise, on the other hand, causes less corruption of the original data distribution, but does not cover the low density regions as well as we would like. To achieve the best of both worlds, authors use multiple scales of noise perturbations simultaneously.\nNoise Conditional Score Networks Perturbing the data using various levels of noise; Simultaneously estimating scores corresponding all noise levels by training a single conditional score network. After training, when using Langevin dynamics to generate samples, we initially use scores corresponding to large noise, and gradually anneal down the noise level. Note that conditional in NCSN is for noise and remains unconditional for the image generation task. Define Noise Condtional Score Networks Let $ \\{ \\sigma_i \\} _{i=1}^L $ be a positive geometric sequence that satisfies $\\frac{\\sigma_1}{\\sigma_2} = \\cdots = \\frac{\\sigma_{L-1}}{\\sigma_{L}} \u0026gt; 1$.\nLet $q_\\sigma(\\mathbf{x}) \\triangleq \\int p_\\text{data}(\\mathbf{t}) \\mathcal{N}(\\mathbf{x} \\mid \\mathbf{t}, \\sigma^2 \\mathbf{I}) \\mathrm(d) \\mathbf{t}$ denote the perturbed data distribution.\nAuthors choose the noise levels $\\{ \\sigma_i \\}_{i=1}^L$ such that $\\sigma_1$ is large enough to mitigate the difficulties discussed in Eq. (4), and $\\sigma_L$ is small enough to minimize the effect on data.\nAuthors aim to train a conditional score network to jointly estimate the scores of all perturbed data distributions, i.e., $\\forall \\sigma \\in \\{ \\sigma_i \\}_{i=1}^L : \\mathbf{s}_\\mathbf{\\theta}(\\mathbf{x}, \\sigma) \\approx \\nabla_\\mathbf{x} \\log q_\\sigma(\\mathbf{x})$. Note that $\\mathbf{s}_\\mathbf{\\theta}(\\mathbf{x}, \\sigma) \\in \\mathbb{R}^D$ when $\\mathbf{x} \\in \\mathbb{R}^D$. Authors call $\\mathbf{s}_\\mathbf{\\theta}(\\mathbf{x}, \\sigma)$ a Noise Conditional Score Network (NCSN).\nTraining NCSNs via score matching Both sliced and denoising score matching can train NCSNs. Authors adopt denoising score matching as it is slightly faster and naturally fits the task of estimating scores of noise-perturbed data distributions.\nAuthors choose the noise distribution to be $q_\\sigma(\\tilde{\\mathbf{x}} \\mid \\mathbf{x}) = \\mathcal{N}(\\tilde{\\mathbf{x}} \\mid \\mathbf{x}, \\sigma^2\\mathbf{I})$; therefore $\\nabla_{\\tilde{\\mathbf{x}}} \\log q_\\sigma (\\tilde{\\mathbf{x}} \\mid \\mathbf{x}) = - \\frac{\\tilde{\\mathbf{x}} - \\mathbf{x}}{\\sigma^2}$. For a given $\\sigma$, the denoising score matching objective is $$ \\ell(\\mathbf{\\theta; \\sigma}) \\triangleq \\frac{1}{2} \\mathbb{E}_{p_\\text{data}(\\mathbf{x})} \\mathbb{E}_{\\tilde{\\mathbf{x}} \\sim \\mathcal{N}(\\mathbf{x}, \\sigma^2\\mathbf{I})} \\left [ \\left \\| \\mathbf{s}_\\mathbf{\\theta}(\\tilde{\\mathbf{x}}, \\sigma) + \\frac{\\tilde{\\mathbf{x}} - \\mathbf{x}}{\\sigma^2} \\right \\|^2_2 \\right ] \\tag{5} $$ Then, author combine Eq. (5) for all $\\sigma \\in \\{ \\sigma_i \\}_{i=1}^L$ to get one unified objective $$ \\mathcal{L}(\\mathbf{\\theta}; \\{ \\sigma_i \\}_{i=1}^L) \\triangleq \\frac{1}{L} \\sum_{i=1}^L \\lambda(\\sigma_i) \\ell(\\mathbf{\\theta; \\sigma_i}) \\tag{6} $$ where $\\lambda(\\sigma_i) \u0026gt; 0$ is a coefficient function depending on $\\sigma_i$. Assuming $\\mathbf{s}_\\mathbf{\\theta}(\\mathbf{x}, \\sigma)$ has enough capacity, $\\mathbf{s}_\\mathbf{\\theta}^*(\\mathbf{x}, \\sigma)$ minimizes Eq. (6) iff $\\mathbf{s}_\\mathbf{\\theta}(\\mathbf{x}, \\sigma_i) = \\nabla_\\mathbf{x} \\log q_{\\sigma_i}(\\mathbf{x})$ a.s. for all $i \\in \\{ 1, 2, \\cdots, L \\}$, because Eq. (6) is a conical combination of $L$ denoising score matching objectives.\niff: if and only if a.s.: almost surely There can be many possible choices of $\\lambda(\\cdot)$. Ideally, authors hope that the values of $\\lambda(\\sigma_i)\\ell(\\mathbf{\\theta};\\sigma_i)$ for all $\\{ \\sigma_i \\}_{i=1}^L$ are roughly of the same order of magnitude. Empirically, we observe that when the score networks are trained to optimality, authors approximately have $\\| \\mathbf{s}_\\mathbf{\\theta}(\\mathbf{x}, \\sigma) \\|_2 \\propto \\frac{1}{\\sigma}$. This inspires authors to choose $\\lambda(\\sigma) = \\sigma^2$. Because under this choice, there is $\\lambda(\\sigma)\\ell(\\mathbf{\\theta};\\sigma) = \\sigma^2 \\ell(\\mathbf{\\theta}; \\sigma) = \\frac{1}{2} \\mathbb{E} [ \\| \\sigma \\mathbf{s}_\\mathbf{\\theta}(\\tilde{\\mathbf{x}}, \\sigma) + \\frac{\\tilde{\\mathbf{x}} - \\mathbf{x}}{\\sigma} \\|_2^2 ]$. Since $\\frac{\\tilde{\\mathbf{x}} - \\mathbf{x}}{\\sigma} \\sim \\mathcal{N}(\\mathbf{0}, \\mathbf{I})$ and $\\| \\sigma \\mathbf{s}_\\mathbf{\\theta}(\\mathbf{x}, \\sigma) \\|_2 \\propto 1$, authors conclude that the order of magnitude of $\\lambda(\\sigma)\\ell(\\mathbf{\\theta};\\sigma)$ does not depend on $\\sigma$.\nWhat specific benefit this has it not stated by the authors in the original article, but I think it should be to standardize the magnitude for different levels of noise, so that a single loss function (Eq. (5)) after perturbation of the data by different levels of noise will have the same weight in the overall loss function (Eq. (6)), i.e., the supervisory weights for matching scores to the data after perturbation of all levels of noise at training time are equal.\nNCSN inference via annealed Langevin dynamics After the NCSN $\\mathbf{s}_\\mathbf{\\theta}(\\mathbf{x}, \\sigma)$ is trained, authors proposed a sampling approach\u0026mdash;annealed Langevin dynamics (Fig. 3).\nFig. 3. Algorithm of annealed Langevin dynamics. (Algorithm source: Song \u0026amp; Ermon, 2019)\nThis algorithm is inspired by simulated annealing and annealed importance sampling. This algorithm start annealed Langevin dynamics by initializing the samples from some fixed prior distribution, e.g., uniform noise. Then run Langevin dynamics to sample from $q_{\\sigma_1}(\\mathbf{x})$ with step size $\\alpha_1$. Next run Langevin dynamics to sample $q_{\\sigma_2}(\\mathbf{x})$, starting from the final samples of the previous simulation and using a reduced step size $\\alpha_2$. Authors continue in this fashion, using the final samples of Langevin dynamics for $q_{\\sigma_{i-1}}(\\mathbf{x})$ as the initial samples of Lnagevin dynamic for $q_{\\sigma_i}(\\mathbf{x})$, and tuning down the step size $\\alpha_i$ gradually with $\\alpha_i = \\epsilon \\cdot \\sigma_i^2 / \\sigma_L^2$. Finnaly, run Langevin dynamics to sample from $q_{\\sigma_L}(\\mathbf{x})$, which is close to $p_\\text{data}(\\mathbf{x})$ when $\\sigma_L \\approx 0$.\nResult and Conclusion The authors conducted quantitative tests with excellent results, but of more interest in this article is the theoretical foundation of the Score-Based Generative Model, and much of the knowledge and assumptions in this article were utilized in Yang Songs subsequent diffusion work.\nAs an unconditional model, we achieve the state-of-the-art inception score of 8.87, which is even better than most reported values for class-conditional generative models.\nTable. 1. Inception and FID scores for CIFAR-10. (Table source: Song \u0026amp; Ermon, 2019)\nReferences [1] Yang Song \u0026amp; Stefano Ermon. \u0026ldquo;Generative Modeling by Estimating Gradients of the Data Distribution.\u0026rdquo; NeurIPS 2019.\n[2] Aapo Hyvärinen. \u0026ldquo;Estimation of Non-Normalized Statistical Models by Score Matching.\u0026rdquo; JMLR 2005.\n[3] Yang Song et al. \u0026ldquo;Sliced Score Matching: A Scalable Approach to Density and Score Estimation.\u0026rdquo; Uncertainty in Artificial Intelligence 2019.\n[4] Pascal Vincent. \u0026ldquo;A Connection Between Score Matching and Denoising Autoencoders.\u0026rdquo; Neural Computation 2011.\n","permalink":"https://gavinsun0921.github.io/posts/fast-paper-reading-03/","summary":"This paper introduce a new generative model where samples are produced via Langevin dynamics using gradients of the data distribution estimated with score matching. And it is important to learn Score-Based generative network and Ito diffusion SDE.","title":"[NeurIPS'19 Oral] Generative Modeling by Estimating Gradients of the Data Distribution 阅读报告"},{"content":" This paper is published in TPAMI 2023. Overview Problem In the field of image super-resolution, existing approaches often suffer from various limitations; e.g., autoregressive models are prohibitively expensive for high-resolution image generation, Normalizing Flows (NFs) and variational autoencoders (VAEs) often yield sub-optimal sample quality, and GANs require carefully designed regularization and optimization tricks to tame optimization instability and model collapse. Solution Present SR3, an approach to image super-resolution via repeated refinement based on DDPM. Results The high-frequency information of the image can be well resored compared to other methods. Despite mediocre performance in SSIM and PSNR metrics, visualization and consistency are good. Related Work Diffusion Probabilistic Models I\u0026rsquo;ve written a blog about diffusion probabilistic models (DPM). It has the derivation of the basic formulas of the DPM as well as a simple code implementation.\nA Breif Exploration to Diffusion Probabilistic Models with Code Implementation. Method Fig. 1. The forward diffusion process $q$ (left to right) gradually adds Gaussian noise to the target image. The reverse inference process $p$ (right to left) iteratively denoises the target image conditioned on a source image x. (Image source: Saharia et al. 2023) SR3 is a model obtained by improving on DDPM. Instead of randomly generating images, low resolution images are used as conditions to generate images. The main changes in SR3 are:\nThe low resolution image is concatenated to the original input (x_t-1) after bicubic interpolation to get a 6-channel tensor as the new input to the DDPM. We experimented with more sophisticated methods of conditioning, such as using FiLM (Perez et al. 2018), but we found that the simple concatenation yielded similar generation quality.\nInstead of sampling $\\bar{\\alpha}_t$ directly using timestep $t$ to compute the correlation variable and loss, a random value is sampled from the distribution $\\bar{\\alpha} \\sim p(\\bar{\\alpha}) = U(\\bar{\\alpha}_{t-1}, \\bar{\\alpha}_{t})$. (Section 2.4 in Saharia et al. 2023) The model receives noise level $\\bar{\\alpha}_t$ directly instead of timestamp $t$. This allows flexibility in adjusting the noise level and the number of sampling steps during inferring. Experrimental Study New metric: Consistency As a measure of the consistentcy of the superresolution outputs, we compute MSE between the downsampled outputs and the low resolution inputs.\nNew metric: Classification Accuracy In the field of low-level vision, metrics often do not comprehensively represent the quality of images. Therefore the effectiveness of low-level models is often evaluated in terms of proxy tasks.\nThis paper mirror the evalution setup of Zhang et al. (2018) and apply 4$\\times$ superresolution models to 56$\\times$56 center crops from the validation set of ImageNet.\nQuantitative Results Compared to PULSE (Menon et al. 2020), FSRGAN (Chen et al. 2018), and Regressive models, the results in terms of PSNR and SSIM are relatively average. This is because traditional super-resolution models are typically trained based on PSNR, which SR3 is not. Therefore, it is normal for the metrics to be relatively low. However, the consistency metrics, on the other hand, perform very well.\nTable 1. PSNR \u0026 SSIM on 16$\\times$16 $\\to$ 128$\\times$128 face superresolution. Consistency measures MSE ($\\times10^{−5}$) between the lowresolution inputs and the down-sampled super-resolution outputs. (Table source: Saharia et al. 2023 as a screenshot) Table 2. Performance comparison between SR3 and Regression baseline on natural image super-resolution using standard metrics computed on the ImageNet validation set. (Table source: Saharia et al. 2023 as a screenshot) Evaluation of Proxy Task Object recognition baseline: ResNet-50 (He et al. 2016). Table 3. Comparison of classification accuracy scores for 4$\\times$ natural image super-resolution on the first 1K images from the ImageNet Validation set. (Table source: Saharia et al. 2023 as a screenshot) Human Evaluation (2AFC) This paper use a 2-alternative forced-choice (2AFC) paradigm to measure how well humans can discriminate true images from those generated from a model.\nFig. 2. Face super-resolution human fool rates (higher is better, photo-realistic samples yield a fool rate of 50%). Outputs of 4 models are compared against ground truth. (top) Subjects are shown low-resolution inputs. (bottom) Inputs are not shown. (Image source: Saharia et al. 2023) Fig. 3. ImageNet super-resolution fool rates (higher is better, photo-realistic samples yield a fool rate of 50%). SR3 and Regression outputs are compared against ground truth. (top) Subjects are shown low-resolution inputs. (bottom) Inputs are not shown. (Image source: Saharia et al. 2023) Visualization Fig. 3. Comparison of different methods on the 16$\\times$16 $\\to$ 128$\\times$128 face super-resolution task. Reference image has not been included because of privacy concerns. (Image source: Saharia et al. 2023) Fig. 4. Results of a SR3 model (64$\\times$64 $\\to$ 512$\\times$512), trained on FFHQ, and applied to images outside of the training set, along with enlarged patches to show finer details. (Image source: Saharia et al. 2023) Fig. 4 shows that the image obtained by SR3 has more details (high-frequency information of the image) compared to the regression model.\nSummary SR3 employs a completely novel approach to super-resolution, distinct from previous approaches based on GANs and CNNs. It primarily generates high-resolution images by denoising progressively from low resolution images conditioned on diffusion models. In the experimental section, the PSNR and SSIM metrics show relatively less impressive performance compared to other methods. However, it outperforms the Regression model in terms of FID and IS metrics, which would be more convincing if PULSE and FSRGAN also be evaluated. Personally, I find the consistency metric not very meaningful. Still, its remarkable performance in proxy task compared to the Regression model is worth attention (through there is still a lack of experimental comparisons with PULSE and FSRGAN). The approach of using diffusion models for image super-resolution is effective, and there is potential for further research in the future.\nReference [1] Chitwan Saharia et al. \u0026ldquo;Image Super-Resolution via Iterative Refinement.\u0026rdquo; TPAMI 2023.\n[2] Ethan Perez et al. \u0026ldquo;FiLM: Visual Reasoning with a General Conditioning Layer.\u0026rdquo; AAAI 2018.\n[3] Yulun Zhang et al. \u0026ldquo;Image Super-Resolution Using Very Deep Residual Channel Attention Networks.\u0026rdquo; ECCV 2018.\n[4] Sachit Menon et al. \u0026ldquo;PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models.\u0026rdquo; CVPR 2020.\n[5] Yu Chen et al. \u0026ldquo;FSRNet: End-to-End Learning Face Super-Resolution With Facial Priors.\u0026rdquo; CVPR 2018.\n[6] Kaiming He et al. \u0026ldquo;Deep residual learning for image recognition.\u0026rdquo; CVPR 2016.\n","permalink":"https://gavinsun0921.github.io/posts/fast-paper-reading-02/","summary":"Image super-resolution with conditional diffusion model.","title":"[T-PAMI'23] Image Super-Resolution via Iterative Refinement 阅读报告"},{"content":" This paper is published in CVPR 2022. Overview Problem Image deblurring is an ill-posed problem, and most existing mothods are ineffective because they produce a deterministic estimate of the clean image. Point-estimators that directly minimize a distortion loss suffers from the problem of \u0026ldquo;regression to the mean\u0026rdquo;. Solution Present a new framework for blind deblurring based on conditional diffusion models. Porducing a diverse set of plausible reconstructions for a given input. Results A significant improvement in perceptual qulity over existing state-of-the-art methods across multiple standard benchmarks. Much more efficient sampling compared to typical diffusion models. Challenging the widely used strategy of producing a single, deterministic reconstruction. Related Work Diffusion Probabilistic Models I\u0026rsquo;ve written a blog about diffusion probabilistic models (DPM). It has the derivation of the basic formulas of the DPM as well as a simple code implementation.\nA Breif Exploration to Diffusion Probabilistic Models with Code Implementation. The Perception-Distortion Tradeoff Fig. 1. The perception-distortion tradeoff. (Image source: Blau et al. 2018 with a few additional annotations) The ability of a model on a curve is the same. When perceptual metrics get better (smaller y-axis), distortion metrics get worse (bigger x-axis), and vice versa. Usually non-GAN models will tend to be more towards the upper left corner, while GAN models will tend to be more towards the lower right corner.\nMethod This paper introduce a \u0026ldquo;predict-and-refine\u0026rdquo; conditional diffusion model, where a deterministic data-adaptive predictor is jointly trained with a stochastic sampler that refines the output of the said predictor (see Fig. 2).\nFig. 2. Diagram of dual-network architechture. (Image source: Jay et al. 2022) The method in this paper is to use an initial predictor $g_\\theta$ to process the blurry image to get the initial prediction, and then models the residual of ground truth and initial prediction using a conditional diffusion model. Fig. 3. Diagram of U-Net architechture used for both the denoiser network and the initial predictor. Note that the input and output depicted here are for the denoiser network. (Image source: Jay et al. 2022) The loss function of predict-and-refine diffusion model in paper is Eq. (6). $$ L_{\\text{Ours}}(\\theta) = \\mathbb{E} \\left \\| \\mathbf{\\epsilon} - f_\\theta \\left ( \\sqrt{\\bar{\\alpha}} (\\underbrace{ \\mathbf{x}_0 - g_\\theta({\\color{red} \\mathbf{x}_0})}_{\\text{residual}}) + \\sqrt{1 - \\bar{\\alpha}} \\epsilon, \\bar{\\alpha}, \\mathbf{y} \\right ) \\right \\| \\tag{6} $$ Gaution! The red part of Eq. (6) is wrong. Here, $\\mathbf{x}_0$ and $g_\\theta$ stands for ground truth and initial predictor. The residual portion in the lower brackets should be the residual of ground truth and initial prediction. Therefore, the input to $g_\\theta$ should be the blurry image $\\mathbf{y}$.\nExperimental Study Quantitative Results The \u0026ldquo;SA\u0026rdquo; suffix in the table stands for Sample Averaging, i.e. averaging over multiple samples. This operation can significantly improve the distortion metrics at the expense of the perception metrics.\nTable 1. Image deblurring results on the GoPro dataset (Nah et al. 2017). Best values and second-best values for each metric are color-coded. (Table source: Jay et al. 2022 as a screenshot) Table 2. Image deblurring results on the HIDE dataset (Shen et al. 2019). Best values and second-best values for each metric are color-coded. (Table source: Jay et al. 2022 as a screenshot) The model in this paper achieves state-of-the-art performance across all perceptual metrics while maintaining competitive PSNR and SSIM to existing methods.\nP-D tradeoff Inference steps (T): 10, 20, 30, 50, 100, 200, 300, 500. Fig. 4. Additional Perception-Distortion plots with respect to different metrics. Left column contains perceptual metrics vs. PSNR, and the right column contains SSIM comparisons.(Image source: Jay et al. 2022) Traversing the Perception-Distortion curve: The more steps sampled, the better the subjective quality, and vice versa for the objective quality.\nNetwork Architecture Ablation Study Table 3. Ablation study on the effects of various hyperparameters. (Table source: Jay et al. 2022 as a screenshot) As the results show, all three hyperparameters were critical to the model\u0026rsquo;s performance.\nSummary The diffusion model is successfully applied to the deblurring task, enhancing the stochastic process of generating clear images. Good results were achieved in the quality of reconstruction observed by the human, and the metrics, such as PSNR, are comparable. The biggest highlight is that the idea of residual is proposed in the model, which makes the inference faster, and together with the initial inference achieved good results.\nReferences [1] Whang Jay et al. \u0026ldquo;Deblurring via Stochastic Refinement.\u0026rdquo; CVPR 2022.\n[2] Yochai Blau \u0026amp; Tomer Michaeli. \u0026ldquo;The Perception-Distortion Tradeoff.\u0026rdquo; CVPR 2018.\n[3] Seungjun Nah et al. \u0026ldquo;Deep Multi-Scale Convolutional Neural Network for Dynamic Scene Deblurring.\u0026rdquo; CVPR 2017.\n[4] Ziyi Shen et al. \u0026ldquo;Human-Aware Motion Deblurring.\u0026rdquo; ICCV 2019.\n","permalink":"https://gavinsun0921.github.io/posts/fast-paper-reading-01/","summary":"Image deblurring with \u0026ldquo;predict-and-refine\u0026rdquo; conditional diffusion model. An brand new strategy for ill-posed problem.","title":"[CVPR'22] Deblurring via Stochastic Refinement 阅读报告"},{"content":"This is the first post in the Paper Research series. In this series I will continue to update some personal study notes on reading papers. This post will introduce the basic work of diffusion probabilistic models (DPM), including the derivation of formulas and simple code verification. The content is mainly from Sohl-Dickstein et al. (2015) and Ho et al. (2020). If you have any suggestions on this post or would like to communicate with me, please leave comments below.\nDiffusion Models What are Diffusion Models? Refer to Weng, (2021):\nDiffusion models are inspired by non-equilibrium thermodynamics. They define a Markov chain of diffusion steps to slowly add random noise to data and then learn to reverse the diffusion process to construct desired data samples from the noise.\nFig. 1. Framework Diagram of Diffusion Models.\nMy personal understanding of Diffusion Models is a framework (Fig. 1) where there are no trainable parameters in the forward process and there are training parameters in the reverse process. And there is no restriction on what type of neural network to use in terms of the distribution that needs to be expressed implicitly in the reverse process.\nFig. 2. Flowchart of Diffusion Models. (Image source: Das, 2021)\nForward Process The forward process is a Markov process. According to Wikipedia, a Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, this may be thought of as, \u0026ldquo;What happens next depends only on the state of affairs now.\u0026rdquo;\nThe main goal of the forward process is to gradually convert the data distribution $q(\\mathbf{x}_0)$ into an analytically tractable distribution $\\pi(\\mathbf{y})$ by repeated application of a Markov diffusion kernel $T_\\pi(\\mathbf{y}|\\mathbf{y}'; \\beta)$ for $\\pi(\\mathbf{y})$, where $\\beta$ is the diffusion rate,\n$$\\pi(\\mathbf{y}) = \\int T_\\pi(\\mathbf{y}|\\mathbf{y}'; \\beta) \\mathrm{d}\\mathbf{y}' \\tag{1}$$ $$q(\\mathbf{x}_t|\\mathbf{x}_{t-1}) = T_\\pi(\\mathbf{x}_t|\\mathbf{x}_{t-1}; \\beta_t) \\tag{2}$$ The forward trajectory (joint distribution), corresponding to starting at the data distribution and performing T steps of diffusion, is thus\n$$q(\\mathbf{x}_0, \\mathbf{x}_1, \\cdots, \\mathbf{x}_T) = q(\\mathbf{x}_{(0\\cdots T)}) = q(\\mathbf{x}_0)\\prod_{t=1}^T q(\\mathbf{x}_t | \\mathbf{x}_{t-1}) \\tag{3}$$ Fig. 3. Illustration of forward (diffusion) trajectory.\nReverse Process The reverse process also is a Markov process. If we can reverse the above process and sample from $q(\\mathbf{x}_{t-1} | \\mathbf{x}_t)$, we will be able to recreate the true sample from a Gaussian noise input, $\\mathbf{x}_T \\sim \\mathcal{N}(\\mathbf{0}, \\mathbf{I})$. Unfortunately, we cannot easily estimate $q(\\mathbf{x}_{t-1} | \\mathbf{x}_t)$ and there fore we need to learn a model $p_\\theta$ to approximate these conditional probabilities in order to run the reverse process. We want $p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_t)$ to approximate $q(\\mathbf{x}_{t-1} | \\mathbf{x}_t)$ as closely as possible for all $t$.\nThe generative distribution $p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}\\_t)$ will be trained to describe the same trajectory (also a joint distribution), but in reverse,\n$$p_\\theta (\\mathbf{x}_T) = \\pi(\\mathbf{x}_T) \\tag{4}$$ $$ p_\\theta(\\mathbf{x}_0, \\mathbf{x}_1, \\cdots, \\mathbf{x}_T) = p_\\theta(\\mathbf{x}_{(0\\cdots T)}) = p_\\theta(\\mathbf{x}_T)\\prod_{t=1}^T p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t}) \\tag{5}$$ Generative Model The forward trajectory (joint distribution): image to noise. The reverse trajectory (joint distribution): noise to image. The Model Probability (marginal distribution): the probability the generative model assigns to the data. The probability the generative model assigns to the data is $$p_\\theta(\\mathbf{x}_0) = \\int \\int \\cdots \\int p_\\theta(\\mathbf{x}_0, \\mathbf{x}_1, \\cdots, \\mathbf{x}_T) \\mathrm{d}\\mathbf{x}_1 \\mathrm{d}\\mathbf{x}_2 \\cdots \\mathrm{d}\\mathbf{x}_T \\tag{6}$$ For convenience, we simply denote it as: $$p_\\theta(\\mathbf{x}_0) = \\int \\mathrm{d}\\mathbf{x}_{(1\\cdots T)} \\ p_\\theta(\\mathbf{x}_{(0\\cdots T)}) \\tag{7}$$\nBut this integral (7) is intractable! We can handle this integral similarly to some of the ways in VAE. Taking a cue from annealed importance sampling and the Jarzynski equality, we instead evaluate the relative probability of the forward and reverse trajectories, averaged over forward trajectories,\n$$ \\begin{equation*} \\begin{split} a=b \\end{split} \\end{equation*} $$ $$ \\begin{equation*} \\begin{split} p_\\theta(\\mathbf{x}_0) \u0026= \\int \\mathrm{d}\\mathbf{x}_{(1\\cdots T)} \\ p_\\theta(\\mathbf{x}_{(0\\cdots T)}) \\frac{q(\\mathbf{x}_{(1\\cdots T)} | \\mathbf{x}_0)}{q(\\mathbf{x}_{(1\\cdots T)} | \\mathbf{x}_0)} \\\\ \u0026= \\int \\mathrm{d}\\mathbf{x}_{(1\\cdots T)} \\ q(\\mathbf{x}_{(1\\cdots T)} | \\mathbf{x}_0) \\frac{\\color{red} p_\\theta(\\mathbf{x}_{(0\\cdots T)})}{\\color{blue} q(\\mathbf{x}_{(1\\cdots T)} | \\mathbf{x}_0)} \\\\ \u0026= \\int \\mathrm{d}\\mathbf{x}_{(1\\cdots T)} \\ q(\\mathbf{x}_{(1\\cdots T)} | \\mathbf{x}_0) \\frac{\\color{red} p_\\theta(\\mathbf{x}_T) \\prod_{t=1}^T p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{\\color{blue} \\frac{q(\\mathbf{x}_0, \\mathbf{x}_{(1\\cdots T)})}{q(\\mathbf{x}_0)}} \\\\ \u0026= \\int \\mathrm{d}\\mathbf{x}_{(1\\cdots T)} \\ q(\\mathbf{x}_{(1\\cdots T)} | \\mathbf{x}_0) \\frac{p_\\theta(\\mathbf{x}_T) \\prod_{t=1}^T p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{\\color{blue} \\frac{q(\\mathbf{x}_0) \\prod_{t=1}^T q(\\mathbf{x}_t | \\mathbf{x}_{t-1})}{q(\\mathbf{x}_0)}} \\\\ \u0026= \\int \\mathrm{d}\\mathbf{x}_{(1\\cdots T)} \\ q(\\mathbf{x}_{(1\\cdots T)} | \\mathbf{x}_0) \\frac{p_\\theta(\\mathbf{x}_T) \\prod_{t=1}^T p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{\\prod_{t=1}^T q(\\mathbf{x}_t | \\mathbf{x}_{t-1})} \\\\ \u0026= \\int \\mathrm{d}\\mathbf{x}_{(1\\cdots T)} \\ q(\\mathbf{x}_{(1\\cdots T)} | \\mathbf{x}_0) p_\\theta(\\mathbf{x}_T) \\prod_{t=1}^T \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{q(\\mathbf{x}_t | \\mathbf{x}_{t-1})} \\end{split} \\end{equation*} \\tag{8} $$ Model Log Likelihood We want the estimated data distribution ($p_\\theta(\\mathbf{x}\\_0)$) to be as close as possible to the actual data distribution ($q(\\mathbf{x}\\_0)$). So training amounts to maximizing the model log likelihood, $$ \\begin{equation*} \\begin{split} \\mathcal{L} \u0026amp;= \\int \\mathrm{d}\\mathbf{x}_0 \\ q(\\mathbf{x}_0) \\log p_\\theta (\\mathbf{x}_0) \\\\ \u0026amp;= \\int \\mathrm{d}\\mathbf{x}_0 \\ q(\\mathbf{x}_0) { \\log \\left [ \\int \\mathrm{d}\\mathbf{x}_{(1\\cdots T)} q(\\mathbf{x}_{(1\\cdots T)} | \\mathbf{x}_0) p_\\theta(\\mathbf{x}_T) \\prod_{t=1}^T \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{q(\\mathbf{x}_t | \\mathbf{x}_{t-1})} \\right ]} \\\\ \u0026amp;= \\int \\mathrm{d}\\mathbf{x}_0 \\ q(\\mathbf{x}_0) {\\color{blue} \\log \\left \\{\\mathbb{E}_{\\mathbf{x}_{(1\\cdots T)} \\sim q(\\mathbf{x}_{(1\\cdots T)} | \\mathbf{x}_0)} \\left [ p_\\theta(\\mathbf{x}_T) \\prod_{t=1}^T \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{q(\\mathbf{x}_t | \\mathbf{x}_{t-1})} \\right ] \\right \\}} \\\\ \u0026amp;\\geq \\int \\mathrm{d}\\mathbf{x}_0 \\ q(\\mathbf{x}_0) {\\color{blue} \\mathbb{E}_{\\mathbf{x}_{(1\\cdots T)} \\sim q(\\mathbf{x}_{(1\\cdots T)} | \\mathbf{x}_0)} \\left \\{ \\log \\left [ p_\\theta(\\mathbf{x}_T) \\prod_{t=1}^T \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{q(\\mathbf{x}_t | \\mathbf{x}_{t-1})} \\right ] \\right \\}} \\\\ \u0026amp;= \\int \\mathrm{d}\\mathbf{x}_0 \\ q(\\mathbf{x}_0) \\int \\mathbb{d}\\mathbf{x}_{(1\\cdots T)} \\ q(\\mathbf{x}_{(1\\cdots T)} | \\mathbf{x}_0) \\log \\left [ p_\\theta(\\mathbf{x}_T) \\prod_{t=1}^T \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{q(\\mathbf{x}_t | \\mathbf{x}_{t-1})} \\right ] \\\\ \u0026amp;= \\int \\mathbb{d}\\mathbf{x}_{(0\\cdots T)} \\ q(\\mathbf{x}_0) q(\\mathbf{x}_{(1\\cdots T)} | \\mathbf{x}_0) \\log \\left [ p_\\theta(\\mathbf{x}_T) \\prod_{t=1}^T \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{q(\\mathbf{x}_t | \\mathbf{x}_{t-1})} \\right ] \\\\ \u0026amp;= \\int \\mathbb{d}\\mathbf{x}_{(0\\cdots T)} \\ q(\\mathbf{x}_{(0\\cdots T)}) \\log \\left [ p_\\theta(\\mathbf{x}_T) \\prod_{t=1}^T \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{q(\\mathbf{x}_t | \\mathbf{x}_{t-1})} \\right ] \\\\ \\end{split} \\end{equation*} \\tag{9} $$\nThe blue part in Eq. (9) provided by Jensen\u0026rsquo;s inequality as Fig. 5.\nFig. 4. Visualization of Jensen's inequality in logarithmic function. So we have the lower bound of $\\mathcal{L}$, let\u0026rsquo;s write it down as $$\\mathcal{K} = \\int \\mathbb{d}\\mathbf{x}_{(0\\cdots T)} \\ q(\\mathbf{x}_{(0\\cdots T)}) \\log \\left [ p_\\theta(\\mathbf{x}_T) \\prod_{t=1}^T \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{q(\\mathbf{x}_t | \\mathbf{x}_{t-1})} \\right ] \\tag{10}$$\n1) Peel off $p_\\theta(\\mathbf{x}_T)$ in $\\mathcal{K}$ as an entropy\nWe can peel off the contribution from $p_\\theta(\\mathbf{x}_T)$, and rewrite it as an entropy, $$ \\begin{equation*} \\begin{split} \\mathcal{K} \u0026amp;= \\int \\mathrm{d}\\mathbf{x}_{(0\\cdots T)} \\ q(\\mathbf{x}_{(0\\cdots T)}) {\\color{blue} \\log \\left [ p_\\theta(\\mathbf{x}_T) \\prod_{t=1}^T \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{q(\\mathbf{x}_t | \\mathbf{x}_{t-1})} \\right ]} \\\\ \u0026amp;= \\int \\mathrm{d}\\mathbf{x}_{(0\\cdots T)} \\ q(\\mathbf{x}_{(0\\cdots T)}) {\\color{blue} \\left \\{ \\log p_\\theta(\\mathbf{x}_T) + \\sum_{t=1}^T \\log \\left [ \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{q(\\mathbf{x}_t | \\mathbf{x}_{t-1})} \\right ] \\right \\} } \\\\ \u0026amp;= \\int \\mathrm{d}\\mathbf{x}_{(0\\cdots T)} \\ q(\\mathbf{x}_{(0\\cdots T)}) {\\color{blue} \\sum_{t=1}^T \\log \\left [ \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{q(\\mathbf{x}_t | \\mathbf{x}_{t-1})} \\right ]} + \\int \\mathrm{d}\\mathbf{x}_{(0\\cdots T)} \\ q(\\mathbf{x}_{(0\\cdots T)}) {\\color{blue} \\log p_\\theta(\\mathbf{x}_T)} \\\\ \u0026amp;= \\int \\mathrm{d}\\mathbf{x}_{(0\\cdots T)} \\ q(\\mathbf{x}_{(0\\cdots T)}) { \\sum_{t=1}^T \\log \\left [ \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{q(\\mathbf{x}_t | \\mathbf{x}_{t-1})} \\right ]} + {\\color{red} \\int \\mathrm{d}\\mathbf{x}_T \\ q(\\mathbf{x}_T) \\log \\underbrace{p_\\theta(\\mathbf{x}_T)}_{{\\normalsize \\pi}(\\mathbf{x}_T)}} \\\\ \\end{split} \\end{equation*} \\tag{11} $$\nBy design, the cross entropy to $\\pi(\\mathbf{x}_T)$ is constant under our diffusion kernels, and equal to the entropy of $p_\\theta(\\mathbf{x}_T)$. Therefore, $$ \\mathcal{K} = \\int \\mathrm{d}\\mathbf{x}_{(0\\cdots T)} \\ q(\\mathbf{x}_{(0\\cdots T)}) \\sum_{t=1}^T \\log \\left [ \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{q(\\mathbf{x}_t | \\mathbf{x}_{t-1})} \\right ] {\\color{red} - \\ \\mathcal{H}_p(\\mathbf{x}_T)} \\tag{12} $$\n2) Remove the edge effect at $t=0$\nIn order to avoid edge effects, we set the final step of the reverse trajectory to be identical to the corresponding forward diffusion step, $$p_\\theta(\\mathbf{x}_0 | \\mathbf{x}_1) = q(\\mathbf{x}_1 | \\mathbf{x}_0) \\frac{\\pi(\\mathbf{x}_{0})}{\\pi(\\mathbf{x}_{1})} = T_\\pi(\\mathbf{x}_0 | \\mathbf{x}_1 ; \\beta) \\tag{13}$$\nWe then use this equivalence to remove the contribution of the first time-step in the sum, $$ \\begin{equation*} \\begin{split} \\mathcal{K} \u0026amp;= \\int \\mathrm{d}\\mathbf{x}_{(0\\cdots T)} \\ q(\\mathbf{x}_{(0\\cdots T)}) {\\color{blue} \\sum_{t=1}^T \\log \\left [ \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{q(\\mathbf{x}_t | \\mathbf{x}_{t-1})} \\right ]} - \\mathcal{H}_p(\\mathbf{x}_T) \\\\ \u0026amp;= \\int \\mathrm{d}\\mathbf{x}_{(0\\cdots T)} \\ q(\\mathbf{x}_{(0\\cdots T)}) {\\color{blue} \\sum_{t=2}^T \\log \\left [ \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{q(\\mathbf{x}_t | \\mathbf{x}_{t-1})} \\right ]} + \\underbrace{\\int \\mathrm{d}\\mathbf{x}_{(0\\cdots T)} \\ q(\\mathbf{x}_{(0\\cdots T)}) {\\color{blue} \\log \\frac{p_\\theta(\\mathbf{x}_0 | \\mathbf{x}_1)}{q(\\mathbf{x}_1 | \\mathbf{x}_0)}}}_{ {\\large \\mathbb{E}}_{\\mathbf{x}_{(0\\cdots T)} \\sim q(\\mathbf{x}_{(0\\cdots T)})} {\\normalsize \\log \\frac{p_\\theta(\\mathbf{x}_0 | \\mathbf{x}_1)}{q(\\mathbf{x}_1 | \\mathbf{x}_0)}}} - \\mathcal{H}_p(\\mathbf{x}_T) \\\\ \u0026amp;= \\int \\mathrm{d}\\mathbf{x}_{(0\\cdots T)} \\ q(\\mathbf{x}_{(0\\cdots T)}) \\sum_{t=2}^T \\log \\left [ \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{q(\\mathbf{x}_t | \\mathbf{x}_{t-1})} \\right ] + \\int \\mathrm{d}\\mathbf{x}_{(0, 1)} \\ q(\\mathbf{x}_0, \\mathbf{x}_1) \\log \\frac{\\color{red} p_\\theta(\\mathbf{x}_0 | \\mathbf{x}_1)}{q(\\mathbf{x}_1 | \\mathbf{x}_0)} - \\mathcal{H}_p(\\mathbf{x}_T) \\\\ \u0026amp;= \\int \\mathrm{d}\\mathbf{x}_{(0\\cdots T)} \\ q(\\mathbf{x}_{(0\\cdots T)}) \\sum_{t=2}^T \\log \\left [ \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{q(\\mathbf{x}_t | \\mathbf{x}_{t-1})} \\right ] + \\int \\mathrm{d}\\mathbf{x}_{(0, 1)} \\ q(\\mathbf{x}_0, \\mathbf{x}_1) \\log \\frac{\\color{red} q(\\mathbf{x}_1 | \\mathbf{x}_0) \\pi(\\mathbf{x}_0)}{q(\\mathbf{x}_1 | \\mathbf{x}_0) {\\color{red} \\pi(\\mathbf{x}_1)}} - \\mathcal{H}_p(\\mathbf{x}_T) \\\\ \u0026amp;= \\int \\mathrm{d}\\mathbf{x}_{(0\\cdots T)} \\ q(\\mathbf{x}_{(0\\cdots T)}) \\sum_{t=2}^T \\log \\left [ \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{q(\\mathbf{x}_t | \\mathbf{x}_{t-1})} \\right ] + {\\color{green} \\int \\mathrm{d}\\mathbf{x}_{(0, 1)} \\ q(\\mathbf{x}_0, \\mathbf{x}_1) \\log \\frac{\\pi(\\mathbf{x}_0)}{\\pi(\\mathbf{x}_1)}} - \\mathcal{H}_p(\\mathbf{x}_T) \\\\ \\end{split} \\end{equation*} \\tag{14} $$\nFor ease of presentation, the green part of Eq. (14) is derived separately, $$ \\begin{equation*} \\begin{split} {\\color{green} \\int \\mathrm{d}\\mathbf{x}_{(0, 1)} \\ q(\\mathbf{x}_0, \\mathbf{x}_1) \\log \\frac{\\pi(\\mathbf{x}_0)}{\\pi(\\mathbf{x}_1)}} \u0026amp;= \\int \\mathrm{d}\\mathbf{x}_{(0, 1)} \\ q(\\mathbf{x}_0, \\mathbf{x}_1) \\left [ \\log \\pi(\\mathbf{x}_0) - \\log \\pi(\\mathbf{x}_1) \\right ] \\\\ \u0026amp;= \\int \\mathrm{d}\\mathbf{x}_{(0, 1)} \\ q(\\mathbf{x}_0, \\mathbf{x}_1) \\log \\pi(\\mathbf{x}_0) - \\int \\mathrm{d}\\mathbf{x}_{(0, 1)} \\ q(\\mathbf{x}_0, \\mathbf{x}_1) \\log \\pi(\\mathbf{x}_1) \\\\ \u0026amp;= {\\color{red} \\int \\mathrm{d}\\mathbf{x}_{0} \\ q(\\mathbf{x}_{0}) \\log \\pi(\\mathbf{x}_0)} - {\\color{red} \\int \\mathrm{d}\\mathbf{x}_{1} \\ q(\\mathbf{x}_{1}) \\log \\pi(\\mathbf{x}_1)} \\\\ \u0026amp;= {\\color{red} \\mathcal{H}_p(\\mathbf{x}_T)} - {\\color{red} \\mathcal{H}_p(\\mathbf{x}_T)} \\\\ \u0026amp;= 0 \\\\ \\end{split} \\end{equation*} \\tag{15} $$\nwhere we again used the fact that by design $\\color{red} -\\int \\mathrm{d}\\mathbf{x}_t \\ q(\\mathbf{x}_t) \\log \\pi(\\mathbf{x}_t) = \\mathcal{H}_p(\\mathbf{x}_T)$ is a constant for all $t$.\nTherefore, the lower bound in Eq. (14) becomes $$\\mathcal{K} = \\int \\mathrm{d}\\mathbf{x}_{(0\\cdots T)} \\ q(\\mathbf{x}_{(0\\cdots T)}) \\sum_{t=2}^T \\log \\left [ \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{q(\\mathbf{x}_t | \\mathbf{x}_{t-1})} \\right ] - \\mathcal{H}_p(\\mathbf{x}_T) \\tag{16}$$\n3) Rewrite in terms of $q(\\mathbf{x}_{t-1} | \\mathbf{x}_t)$\nBecause the forward trajectory is a Markov process, $$ \\begin{equation*} q(\\mathbf{x}_t | \\mathbf{x}_{t-1}) = \\left \\{ \\begin{matrix} q(\\mathbf{x}_t | \\mathbf{x}_{t-1}, \\mathbf{x}_0) \u0026amp; , t \u0026gt; 1 \\\\ q(\\mathbf{x}_1 | \\mathbf{x}_{0}, \\mathbf{x}_0) = q(\\mathbf{x}_1 | \\mathbf{x}_{0}) \u0026amp; , t = 1 \\end{matrix} \\right . \\end{equation*} \\tag{17} $$\n$$ \\mathcal{K} = \\int \\mathrm{d}\\mathbf{x}_{(0\\cdots T)} \\ q(\\mathbf{x}_{(0\\cdots T)}) \\sum_{t=2}^T \\log \\left [ \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{\\color{blue} q(\\mathbf{x}_t | \\mathbf{x}_{t-1}, \\mathbf{x}_0)} \\right ] - \\mathcal{H}_p(\\mathbf{x}_T) \\tag{18} $$\nUsing Bayes’ rule we can rewrite this in terms of a posterior and marginals from the forward trajectory, $$ \\mathcal{K} = \\int \\mathrm{d}\\mathbf{x}_{(0\\cdots T)} \\ q(\\mathbf{x}_{(0\\cdots T)}) \\sum_{t=2}^T \\log \\left [ \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{\\color{blue} q(\\mathbf{x}_{t-1} | \\mathbf{x}_{t}, \\mathbf{x}_0)} \\frac{\\color{blue} q(\\mathbf{x}_{t-1} | \\mathbf{x}_0)}{\\color{blue} q(\\mathbf{x}_{t} | \\mathbf{x}_0)} \\right ] - \\mathcal{H}_p(\\mathbf{x}_T) \\tag{19} $$\n4) Rewrite in terms of KL divergences and entropies\n$$ \\begin{equation*} \\begin{split} \\mathcal{K} \u0026amp;= \\int \\mathrm{d}\\mathbf{x}_{(0\\cdots T)} \\ q(\\mathbf{x}_{(0\\cdots T)}) \\sum_{t=2}^T \\log \\left [ \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{q(\\mathbf{x}_{t-1} | \\mathbf{x}_{t}, \\mathbf{x}_0)} \\frac{q(\\mathbf{x}_{t-1} | \\mathbf{x}_0)}{q(\\mathbf{x}_{t} | \\mathbf{x}_0)} \\right ] - \\mathcal{H}_p(\\mathbf{x}_T) \\\\ \u0026amp;= \\int \\mathrm{d}\\mathbf{x}_{(0\\cdots T)} \\ q(\\mathbf{x}_{(0\\cdots T)}) \\sum_{t=2}^T \\left [ \\log \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{q(\\mathbf{x}_{t-1} | \\mathbf{x}_{t}, \\mathbf{x}_0)} + \\log \\frac{q(\\mathbf{x}_{t-1} | \\mathbf{x}_0)}{q(\\mathbf{x}_{t} | \\mathbf{x}_0)} \\right ] - \\mathcal{H}_p(\\mathbf{x}_T) \\\\ \u0026amp;= \\int \\mathrm{d}\\mathbf{x}_{(0\\cdots T)} \\ q(\\mathbf{x}_{(0\\cdots T)}) \\sum_{t=2}^T \\log \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{q(\\mathbf{x}_{t-1} | \\mathbf{x}_{t}, \\mathbf{x}_0)} \\\\ \u0026amp;\\quad + {\\color{green} \\int \\mathrm{d}\\mathbf{x}_{(0\\cdots T)} \\ q(\\mathbf{x}_{(0\\cdots T)}) \\sum_{t=2}^T \\log \\frac{q(\\mathbf{x}_{t-1} | \\mathbf{x}_0)}{q(\\mathbf{x}_{t} | \\mathbf{x}_0)}} - \\mathcal{H}_p(\\mathbf{x}_T) \\\\ \\end{split} \\end{equation*} \\tag{20} $$\nFor ease of presentation, the green part of Eq. (20) is derived separately, $$ \\begin{equation*} \\begin{split} {\\color{green} \\int \\mathrm{d}\\mathbf{x}_{(0\\cdots T)}} \u0026amp;\\ {\\color{green} q(\\mathbf{x}_{(0\\cdots T)}) \\sum_{t=2}^T \\log \\frac{q(\\mathbf{x}_{t-1} | \\mathbf{x}_0)}{q(\\mathbf{x}_{t} | \\mathbf{x}_0)}} \\\\ \u0026amp;= \\int \\mathrm{d}\\mathbf{x}_{(0\\cdots T)} \\ q(\\mathbf{x}_{(0\\cdots T)}) \\log {\\color{blue} \\prod_{t=2}^T \\frac{q(\\mathbf{x}_{t-1} | \\mathbf{x}_0)}{q(\\mathbf{x}_{t} | \\mathbf{x}_0)}} \\\\ \u0026amp;= \\int \\mathrm{d}\\mathbf{x}_{(0\\cdots T)} \\ q(\\mathbf{x}_{(0\\cdots T)}) \\log {\\color{blue} \\frac{q(\\mathbf{x}_{1} | \\mathbf{x}_0)}{q(\\mathbf{x}_{T} | \\mathbf{x}_0)}} \\\\ \u0026amp;= \\int \\mathrm{d}\\mathbf{x}_{(0\\cdots T)} \\ q(\\mathbf{x}_{(0\\cdots T)}) \\left [ \\log q(\\mathbf{x}_{1} | \\mathbf{x}_0) - \\log q(\\mathbf{x}_{T} | \\mathbf{x}_0) \\right ] \\\\ \u0026amp;= \\int \\mathrm{d}\\mathbf{x}_{(0\\cdots T)} \\ q(\\mathbf{x}_{(0\\cdots T)}) \\log q(\\mathbf{x}_{1} | \\mathbf{x}_0) - \\int \\mathrm{d}\\mathbf{x}_{(0\\cdots T)} \\ q(\\mathbf{x}_{(0\\cdots T)}) \\log q(\\mathbf{x}_{T} | \\mathbf{x}_0) \\\\ \u0026amp;= {\\color{red} \\int \\mathrm{d}\\mathbf{x}_{(0,1)} \\ q(\\mathbf{x}_0, \\mathbf{x}_1) \\log q(\\mathbf{x}_{1} | \\mathbf{x}_0)} - {\\color{red} \\int \\mathrm{d}\\mathbf{x}_{(0,T)} \\ q(\\mathbf{x}_0, \\mathbf{x}_T) \\log q(\\mathbf{x}_{T} | \\mathbf{x}_0)} \\\\ \u0026amp;= {\\color{red} \\mathcal{H}_q(\\mathbf{x}_T | \\mathbf{x}_0)} - {\\color{red} \\mathcal{H}_q(\\mathbf{x}_1 | \\mathbf{x}_0)} \\quad ; \\text{(conditional entropy)} \\end{split} \\end{equation*} \\tag{21} $$\nTherefore, the lower bound in Eq. (20) becomes $$ \\mathcal{K} = {\\color{brown} \\int \\mathrm{d}\\mathbf{x}_{(0\\cdots T)} \\ q(\\mathbf{x}_{(0\\cdots T)}) \\sum_{t=2}^T \\log \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{q(\\mathbf{x}_{t-1} | \\mathbf{x}_{t}, \\mathbf{x}_0)}} + \\mathcal{H}_q(\\mathbf{x}_T | \\mathbf{x}_0) - \\mathcal{H}_q(\\mathbf{x}_1 | \\mathbf{x}_0) - \\mathcal{H}_p(\\mathbf{x}_T) \\tag{22} $$\nFor ease of presentation, the brown part of Eq. (22) is derived separately, $$ \\begin{equation*} \\begin{split} {\\color{brown} \\int \\mathrm{d}\\mathbf{x}_{(0\\cdots T)}} \u0026amp; \\ {\\color{brown} q(\\mathbf{x}_{(0\\cdots T)}) \\sum_{t=2}^T \\log \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{q(\\mathbf{x}_{t-1} | \\mathbf{x}_{t}, \\mathbf{x}_0)}} \\\\ \u0026amp;= \\sum_{t=2}^T \\int \\mathrm{d}\\mathbf{x}_{(0\\cdots T)} \\ q(\\mathbf{x}_{(0\\cdots T)}) \\log \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{q(\\mathbf{x}_{t-1} | \\mathbf{x}_{t}, \\mathbf{x}_0)} \\\\ \u0026amp;= \\sum_{t=2}^T \\int \\mathrm{d}\\mathbf{x}_{0}\\mathrm{d}\\mathbf{x}_{t-1}\\mathrm{d}\\mathbf{x}_{t} \\ {\\color{blue} q(\\mathbf{x}_0, \\mathbf{x}_{t-1}, \\mathbf{x}_t)} \\log \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{q(\\mathbf{x}_{t-1} | \\mathbf{x}_{t}, \\mathbf{x}_0)} \\\\ \u0026amp;= \\sum_{t=2}^T \\int \\mathrm{d}\\mathbf{x}_{0}\\mathrm{d}\\mathbf{x}_{t-1}\\mathrm{d}\\mathbf{x}_{t} \\ {\\color{blue} q(\\mathbf{x}_0, \\mathbf{x}_t) q(\\mathbf{x}_{t-1}| \\mathbf{x}_t, \\mathbf{x}_0)} \\log \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{q(\\mathbf{x}_{t-1} | \\mathbf{x}_{t}, \\mathbf{x}_0)} \\\\ \u0026amp;= \\sum_{t=2}^T \\int \\mathrm{d}\\mathbf{x}_{0}\\mathrm{d}\\mathbf{x}_{t} \\ q(\\mathbf{x}_0, \\mathbf{x}_t) \\underbrace{\\color{red} \\left \\{ \\int \\mathrm{d}\\mathbf{x}_{t-1} \\ q(\\mathbf{x}_{t-1}| \\mathbf{x}_t, \\mathbf{x}_0) \\log \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})}{q(\\mathbf{x}_{t-1} | \\mathbf{x}_{t}, \\mathbf{x}_0)} \\right \\} }_{ \\begin{array}{c} \\small \\text{KL Divergence (also called relative entropy)} \\\\ {\\color{violet} \\mathcal{D}_{KL}(P \\| Q) = \\int_{-\\infty}^{+\\infty} p(x) \\log \\frac{p(x)}{q(x)} \\mathrm{d}x} \\end{array} } \\\\ \u0026amp;= {\\color{red} -} \\sum_{t=2}^T \\int \\mathrm{d}\\mathbf{x}_{0}\\mathrm{d}\\mathbf{x}_{t} \\ q(\\mathbf{x}_0, \\mathbf{x}_t) {\\color{red} \\mathcal{D}_{KL}(q(\\mathbf{x}_{t-1} | \\mathbf{x}_t, \\mathbf{x}_0) \\| p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_t))} \\end{split} \\end{equation*} \\tag{23} $$\nTherefore, the lower bound in Eq. (22) becomes $$ \\begin{equation*} \\begin{split} \\mathcal{K} = \u0026amp;- \\sum_{t=2}^T \\int \\mathrm{d}\\mathbf{x}_{0}\\mathrm{d}\\mathbf{x}_{t} \\ q(\\mathbf{x}_0, \\mathbf{x}_t) \\mathcal{D}_{KL}(q(\\mathbf{x}_{t-1} | \\mathbf{x}_t, \\mathbf{x}_0) \\| p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_t)) \\\\ \u0026amp;+ \\mathcal{H}_q(\\mathbf{x}_T | \\mathbf{x}_0) - \\mathcal{H}_q(\\mathbf{x}_1 | \\mathbf{x}_0) - \\mathcal{H}_p(\\mathbf{x}_T) \\end{split} \\end{equation*} \\tag{24} $$\nNote that the entropies can be analytically computed, and the KL divergence can be analytically computed given $\\mathbf{x}_0$ and $\\mathbf{x}_t$.\nTraining consists of finding the reverse Markov transitions which maximize this lower bound on the log likelihood, $$ p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_t) = \\argmax_{\\theta} \\mathcal{K} \\tag{25} $$\nSpecific Diffusion Kernel Forward Process Specify that the Markov diffusion kernel in Eq. (2) is subject to a Gaussian distribution, $$ q(\\mathbf{x}_t | \\mathbf{x}_{t-1}) = T_\\pi(\\mathbf{x}_t | \\mathbf{x}_{t-1}; \\beta_t) = \\mathcal{N}(\\mathbf{x}_t; \\sqrt{1 - \\beta_t} \\mathbf{x}_{t-1}, \\beta_t \\mathbf{I}) \\tag{26} $$\nA nice property of the above process is that we can sample $\\mathbf{x}_t$ at any arbitrary time step $t$ in a closed form using reparameterization trick. Let $\\alpha_t = 1 - \\beta_t$ and $\\bar{\\alpha}_t = \\prod_{i=1}^t \\alpha_i$ : $$ \\begin{equation*} \\begin{split} \\mathbf{x}_t \u0026amp;= \\sqrt{\\alpha_t} {\\color{blue} \\mathbf{x}_{t-1}} + \\sqrt{1 - \\alpha_t} \\bm{\\epsilon}_{t-1} \\\\ \u0026amp;= \\sqrt{\\alpha_t} {\\color{blue} (\\sqrt{\\alpha_{t-1}} \\mathbf{x}_{t-2} + \\sqrt{1 - \\alpha_{t-1}} \\bm{\\epsilon}_{t-2})} + \\sqrt{1 - \\alpha_t} \\bm{\\epsilon}_{t-1} \\\\ \u0026amp;= \\sqrt{\\alpha_t \\alpha_{t-1}} \\mathbf{x}_{t-2} + {\\color{red} \\sqrt{\\alpha_t - \\alpha_t \\alpha_{t-1}} \\bm{\\epsilon}_{t-2} + \\sqrt{1 - \\alpha_t} \\bm{\\epsilon}_{t-1}} \\\\ \u0026amp;= \\sqrt{\\alpha_t \\alpha_{t-1}} \\mathbf{x}_{t-2} + {\\color{red} \\sqrt{\\sqrt{\\alpha_t - \\alpha_t \\alpha_{t-1}}^2 + \\sqrt{1 - \\alpha_t}^2} \\bar{\\bm{\\epsilon}}_{t-2}} \\\\ \u0026amp;= \\sqrt{\\alpha_t \\alpha_{t-1}} \\mathbf{x}_{t-2} + \\sqrt{1 - \\alpha_t \\alpha_{t-1}} \\bar{\\bm{\\epsilon}}_{t-2} \\\\ \u0026amp;= \\cdots \\\\ \u0026amp;= \\sqrt{\\bar{\\alpha}_t} \\mathbf{x}_{0} + \\sqrt{1 - \\bar{\\alpha}_t} \\bar{\\bm{\\epsilon}}_0 \\\\ \u0026amp;= \\sqrt{\\bar{\\alpha}_t} \\mathbf{x}_{0} + \\sqrt{1 - \\bar{\\alpha}_t} {\\color{green} \\bm{\\epsilon}_{t}} \\quad \\text{; to correspond to the subscript of } \\mathbf{x}_t \\\\ \\end{split} \\end{equation*} \\tag{27} $$ where ${\\color{green} \\bm{\\epsilon}_{t}}, \\bm{\\epsilon}_{t-1}, \\bm{\\epsilon}_{t-2}, \\cdots \\sim \\mathcal{N}(\\mathbf{0}, \\mathbf{I})$.\nRecall the red part of Eq. (27) when we merge two Gaussians with different variance, $\\mathcal{N}(\\mathbf{0}, \\sigma_1^2\\mathbf{I})$ and $\\mathcal{N}(\\mathbf{0}, \\sigma_2^2\\mathbf{I})$, the new distribution is $\\mathcal{N}(\\mathbf{0}, (\\sigma_1^2 + \\sigma_2^2)\\mathbf{I})$.\nThus, we have $$ q(\\mathbf{x}_t | \\mathbf{x}_0) = \\mathcal{N}(\\mathbf{x}_t; \\sqrt{\\bar{\\alpha}_t} \\mathbf{x}_0, (1 - \\bar{\\alpha}_t) \\mathbf{I}) \\tag{28} $$\nUsually, we can afford a larger update step when the sample gets noisier, so $\\beta_1 \u0026lt; \\beta_2 \u0026lt; \\cdots \u0026lt; \\beta_T$ and therefore $\\bar{\\alpha}_1 \u0026gt; \\bar{\\alpha}_2 \u0026gt; \\cdots \u0026gt; \\bar{\\alpha}_T$.\nReverse Process According to Eq. (26), $$ p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_t) = \\mathcal{N}(\\mathbf{x}_{t-1}; \\bm{\\mu}_\\theta(\\mathbf{x}_t, t), \\bm{\\sigma}_\\theta(\\mathbf{x}_t, t)) \\tag{29} $$\nIt is noteworthy that the reverse conditional probability is tractable when conditioned on $\\mathbf{x}_0$: $$ q(\\mathbf{x}_{t-1} | \\mathbf{x}_t, \\mathbf{x}_0) = \\mathcal{N}(\\mathbf{x}_{t-1}; { \\bm{\\tilde{\\mu}}_t(\\mathbf{x}_t, \\mathbf{x}_0)}, { \\tilde{\\beta}_t \\mathbf{I}}) \\tag{30} $$\nUsing Bayes\u0026rsquo; rule, we have: $$ \\begin{equation*} \\begin{split} q(\\mathbf{x}_{t-1} | \\mathbf{x}_t, \\mathbf{x}_0) \u0026amp;= q(\\mathbf{x}_{t} | \\mathbf{x}_{t-1}, \\mathbf{x}_0) \\frac{q(\\mathbf{x}_{t-1} | \\mathbf{x}_0)}{q(\\mathbf{x}_{t} | \\mathbf{x}_0)} \\quad ; \\text{bringing in Eq. (26) and Eq. (28)} \\\\ \u0026amp;\\propto \\exp \\left ( -\\frac{1}{2}(\\frac{(\\mathbf{x}_t - \\sqrt{\\alpha_t} \\mathbf{x}_{t-1})^2}{\\beta_t}) \\right ) \\frac{\\displaystyle \\exp \\left ( -\\frac{1}{2}( \\frac{(\\mathbf{x}_{t-1} - \\sqrt{\\bar{\\alpha}_{t-1}} \\mathbf{x}_0)^2}{1 - \\bar{\\alpha}_{t-1}} ) \\right ) }{\\displaystyle \\exp \\left ( -\\frac{1}{2}( \\frac{(\\mathbf{x}_{t} - \\sqrt{\\bar{\\alpha}_{t}} \\mathbf{x}_0)^2}{1 - \\bar{\\alpha}_{t}} ) \\right ) } \\\\ \u0026amp;= \\exp \\left ( -\\frac{1}{2} ( \\frac{(\\mathbf{x}_t - \\sqrt{\\alpha_t} \\mathbf{x}_{t-1})^2}{\\beta_t} + \\frac{(\\mathbf{x}_{t-1} - \\sqrt{\\bar{\\alpha}_{t-1}} \\mathbf{x}_0)^2}{1 - \\bar{\\alpha}_{t-1}} - \\frac{(\\mathbf{x}_{t} - \\sqrt{\\bar{\\alpha}_{t}} \\mathbf{x}_0)^2}{1 - \\bar{\\alpha}_{t}} ) \\right ) \\\\ \u0026amp;= \\exp \\left ( -\\frac{1}{2}( \\frac{ \\mathbf{x}_t^2 - 2\\sqrt{\\alpha_t} \\mathbf{x}_t {\\color{blue} \\mathbf{x}_{t-1}} + \\alpha_t {\\color{red} \\mathbf{x}_{t-1}^2} }{\\beta_t} + \\frac{ {\\color{red} \\mathbf{x}_{t-1}^2} -2\\sqrt{\\bar{\\alpha}_{t-1}}\\mathbf{x}_0 {\\color{blue} \\mathbf{x}_{t-1}} + \\bar{\\alpha}_{t-1} \\mathbf{x}_0^2 }{1 - \\bar{\\alpha}_{t-1}} \\right. \\\\ \u0026amp;\\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\left. - \\ \\frac{(\\mathbf{x}_{t} - \\sqrt{\\bar{\\alpha}_{t}} \\mathbf{x}_0)^2}{1 - \\bar{\\alpha}_{t}} ) \\right ) \\\\ \u0026amp;= \\exp \\left ( -\\frac{1}{2} {\\Large (} (\\frac{\\alpha_t}{\\beta_t} + \\frac{1}{1 - \\bar{\\alpha}_{t-1}}) {\\color{red} \\mathbf{x}_{t-1}^2} - (\\frac{2\\sqrt{\\alpha_t}}{\\beta_t} \\mathbf{x}_t + \\frac{2\\sqrt{\\bar{\\alpha}_{t-1}}}{1 - \\bar{\\alpha}_{t-1}} \\mathbf{x}_t) {\\color{blue} \\mathbf{x}_{t-1}} + C(\\mathbf{x}_t, \\mathbf{x}_0) {\\Large )} \\right ) \\end{split} \\end{equation*} \\tag{31} $$ where $C(\\mathbf{x}_t, \\mathbf{x}_0)$ is some function not involving $\\mathbf{x}_{t-1}$ and details are omitted.\nFollowing the general form of $\\mathcal{N}(\\mu, \\sigma^2)$ probability density function $f(x) = \\frac{1}{\\sigma \\sqrt{2 \\pi}} \\exp \\left ( -\\frac{1}{2} (\\frac{x - \\mu}{\\sigma})^2 \\right )$, $$ (\\frac{\\alpha_t}{\\beta_t} + \\frac{1}{1 - \\bar{\\alpha}_{t-1}}) {\\color{red} \\mathbf{x}_{t-1}^2} - (\\frac{2\\sqrt{\\alpha_t}}{\\beta_t} \\mathbf{x}_t + \\frac{2\\sqrt{\\bar{\\alpha}_{t-1}}}{1 - \\bar{\\alpha}_{t-1}} \\mathbf{x}_t) {\\color{blue} \\mathbf{x}_{t-1}} + C(\\mathbf{x}_t, \\mathbf{x}_0) = (\\frac{x-\\mu}{\\sigma})^2 = \\frac{{\\color{red} x^2} - 2\\mu {\\color{blue} x} + \\mu^2}{\\sigma^2} \\tag{32} $$\nThe variance $(\\tilde{\\beta}_t \\mathbf{I})$ and mean $(\\bm{\\tilde{\\mu}}_t(\\mathbf{x}_t, \\mathbf{x}_0))$ in Eq. (30) can be parameterized as follows: $$ \\begin{equation*} \\begin{split} \\tilde{\\beta}_t = 1 / (\\frac{\\alpha_t}{\\beta_t} + \\frac{1}{1 - \\bar{\\alpha}_{t-1}}) = \\frac{1 - \\bar{\\alpha}_{t-1}}{1 - \\bar{\\alpha}_{t}} \\beta_t \\end{split} \\end{equation*} \\tag{33} $$\n$$ \\begin{equation*} \\begin{split} \\bm{\\tilde{\\mu}}_t(\\mathbf{x}_t, \\mathbf{x}_0) \u0026amp;= \\frac{(\\frac{2\\sqrt{\\alpha_t}}{\\beta_t} \\mathbf{x}_t + \\frac{2\\sqrt{\\bar{\\alpha}_{t-1}}}{1 - \\bar{\\alpha}_{t-1}} \\mathbf{x}_t) \\tilde{\\beta}_t }{-2} = \\frac{\\sqrt{\\alpha}_t (1 - \\bar{\\alpha}_{t-1})}{1 - \\bar{\\alpha}_t} \\mathbf{x}_t + \\frac{\\sqrt{\\bar{\\alpha}_{t-1}} \\beta_t}{1 - \\bar{\\alpha}_t} \\mathbf{x}_0 \\end{split} \\end{equation*} \\tag{34} $$\nThanks to the nice property, we can represent Eq. (27) to $\\mathbf{x}_0 = (\\mathbf{x}_t - \\sqrt{1 - \\bar{\\alpha}_t} \\bm{\\epsilon}_t) / \\sqrt{\\bar{\\alpha}_t}$ and bring it into Eq. (34), $$ \\begin{equation*} \\begin{split} \\bm{\\mu}_t(\\mathbf{x}_t) \u0026amp;= \\bm{\\tilde{\\mu}}_t\\left (\\mathbf{x}_t, (\\mathbf{x}_t - \\sqrt{1 - \\bar{\\alpha}_t} \\bm{\\epsilon}_t) / \\sqrt{\\bar{\\alpha}_t} \\right ) \\\\ \u0026amp;= \\frac{\\sqrt{\\alpha}_t (1 - \\bar{\\alpha}_{t-1})}{1 - \\bar{\\alpha}_t} \\mathbf{x}_t + \\frac{\\sqrt{\\bar{\\alpha}_{t-1}} \\beta_t}{1 - \\bar{\\alpha}_t} (\\frac{(\\mathbf{x}_t - \\sqrt{1 - \\bar{\\alpha}_t} \\bm{\\epsilon}_t)}{\\sqrt{\\bar{\\alpha}_t}}) \\\\ \u0026amp;= {\\color{red} \\frac{1}{\\sqrt{\\alpha_t}} \\left ( \\mathbf{x}_t - \\frac{1 - \\alpha_t}{\\sqrt{1 - \\bar{\\alpha}_t}} \\bm{\\epsilon}_t \\right ) } \\end{split} \\end{equation*} \\tag{35} $$\nLoss Function We define the lower bound of the negative log likelihood as the variational lower bound loss function, $$ \\begin{equation*} \\begin{split} \\mathcal{L}_{VLB} \u0026amp;= - \\mathcal{K} \\\\ \u0026amp;= \\sum_{t=2}^T \\int \\mathrm{d}\\mathbf{x}_{0}\\mathrm{d}\\mathbf{x}_{t} \\ q(\\mathbf{x}_0, \\mathbf{x}_t) {\\color{blue} \\mathcal{D}_{KL}(q(\\mathbf{x}_{t-1} | \\mathbf{x}_t, \\mathbf{x}_0) \\| p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_t))} \\\\ \u0026amp;\\quad - \\mathcal{H}_q(\\mathbf{x}_T | \\mathbf{x}_0) + \\mathcal{H}_q(\\mathbf{x}_1 | \\mathbf{x}_0) + \\mathcal{H}_p(\\mathbf{x}_T) \\\\ \u0026amp;= \\sum_{t=2}^T {\\Large \\mathbb{E}}_{\\mathbf{x}_0, \\mathbf{x}_t \\sim q(\\mathbf{x}_0, \\mathbf{x}_t)} {\\Large [} \\underbrace{\\color{blue} \\mathcal{D}_{KL}(q(\\mathbf{x}_{t-1} | \\mathbf{x}_t, \\mathbf{x}_0) \\| p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_t))}_{\\color{blue} \\mathcal{L}_{t-1}} {\\Large ]} - \\mathcal{H}_q(\\mathbf{x}_T | \\mathbf{x}_0) + \\mathcal{H}_q(\\mathbf{x}_1 | \\mathbf{x}_0) + \\mathcal{H}_p(\\mathbf{x}_T) \\end{split} \\end{equation*} \\tag{36} $$\nRecall that we need to learn a model to approximate the conditioned probability distributions in the reverse diffusion process, $p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_t) = \\mathcal{N}(\\mathbf{x}_{t-1}; \\bm{\\mu}_\\theta(\\mathbf{x}_t, t), \\bm{\\sigma}_\\theta(\\mathbf{x}_t, t))$. We would like to train $\\bm{\\mu}_\\theta(\\mathbf{x}_t, t)$ to predict $\\bm{\\mu}_t(\\mathbf{x}_t)$ in Eq. (35), and set $\\bm{\\sigma}_\\theta(\\mathbf{x}_t, t)$ is equal to $\\sigma_t^2\\mathbf{I}$, where $\\sigma_t^2$ is equal to $\\tilde{\\beta}_t$ in Eq. (33) or $\\beta_t$ for simplify. The loss term $\\mathcal{L}_{t-1}$ in Eq. (36) is parameterized to minimize the difference from $\\bm{\\mu}_t(\\mathbf{x}_t)$: $$ \\begin{equation*} \\begin{split} \\mathcal{L}_{t-1} \u0026amp;= \\mathcal{D}_{KL}(q(\\mathbf{x}_{t-1} | \\mathbf{x}_t, \\mathbf{x}_0) \\| p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_t)) \\\\ \u0026amp;= \\int \\mathrm{d}\\mathbf{x}_{t-1} \\ q(\\mathbf{x}_{t-1} | \\mathbf{x}_t, \\mathbf{x}_0) \\log \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_(t))}{q(\\mathbf{x}_{t-1} | \\mathbf{x}_(t), \\mathbf{x}_0)} \\\\ \u0026amp;= {\\Large \\mathbb{E}}_{\\small \\mathbf{x}_{t-1} \\sim q(\\mathbf{x}_{t-1} | \\mathbf{x}_t, \\mathbf{x}_0)} \\left [ \\log \\frac{p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_(t))}{q(\\mathbf{x}_{t-1} | \\mathbf{x}_(t), \\mathbf{x}_0)} \\right ] \\\\ \u0026amp;\\propto {\\Large \\mathbb{E}}_{\\small \\mathbf{x}_0 \\sim q(\\mathbf{x}_0), \\bm{\\epsilon}_t \\sim \\mathcal{N}(\\mathbf{0}, \\mathbf{I})} \\left [ \\frac{1}{2\\sigma_t^2} \\left \\| {\\color{blue} \\bm{\\mu}_t(\\mathbf{x}_t)} - {\\color{red} \\bm{\\mu}_\\theta(\\mathbf{x}_t, t)} \\right \\| ^2 \\right ] \\\\ \u0026amp;= {\\Large \\mathbb{E}}_{\\small \\mathbf{x}_0 \\sim q(\\mathbf{x}_0), \\bm{\\epsilon}_t \\sim \\mathcal{N}(\\mathbf{0}, \\mathbf{I})} \\left [ \\frac{1}{2\\sigma_t^2} \\left \\| {\\color{blue} \\frac{1}{\\sqrt{\\alpha_t}} \\left ( \\mathbf{x}_t - \\frac{1 - \\alpha_t}{\\sqrt{1 - \\bar{\\alpha}_t}} \\bm{\\epsilon}_t \\right )} - {\\color{red} \\bm{\\mu}_\\theta(\\mathbf{x}_t, t)} \\right \\| ^2 \\right ] \\end{split} \\end{equation*} \\tag{37} $$\nBecause $\\mathbf{x}_t$ is available as input at training time, we can reparameterize the Gaussian noise term instead to make it predict $\\bm{\\epsilon}$ from the input $\\mathbf{x}_t$ at time step $t$: $$ \\bm{\\mu}_\\theta(\\mathbf{x}_t, t) = \\frac{1}{\\sqrt{\\alpha_t}} \\left ( \\mathbf{x}_t - \\frac{1 - \\alpha_t}{\\sqrt{1 - \\bar{\\alpha}}_t} \\bm{\\epsilon}_\\theta(\\mathbf{x}_t, t) \\right ) \\tag{38} $$ where $\\bm{\\epsilon}_\\theta$ is a function approximator (the model) intended to predict $\\bm{\\epsilon}$ from $\\mathbf{x}_t$.\nThus, Eq. (29) can be written as $$ p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t}) = \\mathcal{N} \\left ( \\mathbf{x}_{t-1} ; \\frac{1}{\\sqrt{\\alpha_t}} \\left ( \\mathbf{x}_t - \\frac{1 - \\alpha_t}{\\sqrt{1 - \\bar{\\alpha}}_t} \\bm{\\epsilon}_\\theta(\\mathbf{x}_t, t) \\right ) , \\tilde{\\beta}_t \\mathbf{I} \\right ) \\tag{39} $$\nAccording to Eq. (39), sampling $\\mathbf{x}_{t-1} \\sim p_\\theta(\\mathbf{x}_{t-1} | \\mathbf{x}_{t})$ is: $$ \\mathbf{x}_{t-1} = \\frac{1}{\\sqrt{\\alpha_t}} \\left ( \\mathbf{x}_t - \\frac{1 - \\alpha_t}{\\sqrt{1 - \\bar{\\alpha}}_t} \\bm{\\epsilon}_\\theta(\\mathbf{x}_t, t) \\right ) + \\sqrt{\\tilde{\\beta}_t} \\bm{\\epsilon}^* \\tag{40} $$ where $\\bm{\\epsilon}^* \\sim \\mathcal{N}(\\mathbf{0}, \\mathbf{I})$.\nFurthermore, with the parameterization Eq. (38), the $\\mathcal{L}_{t-1}$ in Eq. (37) simplifies to: $$ \\begin{equation*} \\begin{split} \\mathcal{L}_{t-1} \u0026amp;= {\\Large \\mathbb{E}}_{\\small \\mathbf{x}_0 \\sim q(\\mathbf{x}_0), \\bm{\\epsilon}_t \\sim \\mathcal{N}(\\mathbf{0}, \\mathbf{I})} \\left [ \\frac{1}{2\\sigma_t^2} \\left \\| {\\color{blue} \\frac{1}{\\sqrt{\\alpha_t}} \\left ( \\mathbf{x}_t - \\frac{1 - \\alpha_t}{\\sqrt{1 - \\bar{\\alpha}_t}} \\bm{\\epsilon}_t \\right )} - {\\color{red} \\bm{\\mu}_\\theta(\\mathbf{x}_t, t)} \\right \\| ^2 \\right ] \\\\ \u0026amp;= {\\Large \\mathbb{E}}_{\\small \\mathbf{x}_0 \\sim q(\\mathbf{x}_0), \\bm{\\epsilon}_t \\sim \\mathcal{N}(\\mathbf{0}, \\mathbf{I})} \\left [ \\frac{1}{2\\sigma_t^2} \\left \\| {\\color{blue} \\frac{1}{\\sqrt{\\alpha_t}} \\left ( \\mathbf{x}_t - \\frac{1 - \\alpha_t}{\\sqrt{1 - \\bar{\\alpha}_t}} \\bm{\\epsilon}_t \\right )} - {\\color{red} \\frac{1}{\\sqrt{\\alpha_t}} \\left ( \\mathbf{x}_t - \\frac{1 - \\alpha_t}{\\sqrt{1 - \\bar{\\alpha}}_t} \\bm{\\epsilon}_\\theta(\\mathbf{x}_t, t) \\right )} \\right \\| ^2 \\right ] \\\\ \u0026amp;= {\\Large \\mathbb{E}}_{\\small \\mathbf{x}_0 \\sim q(\\mathbf{x}_0), \\bm{\\epsilon}_t \\sim \\mathcal{N}(\\mathbf{0}, \\mathbf{I})} \\left [ \\frac{1}{2\\sigma_t^2} \\left \\| \\frac{1}{\\sqrt{\\alpha_t}} \\frac{1 - \\alpha_t}{\\sqrt{1 - \\bar{\\alpha}_t}} \\left (\\bm{\\epsilon}_t - \\bm{\\epsilon}_\\theta(\\mathbf{x}_t, t) \\right ) \\right \\| ^2 \\right ] \\\\ \u0026amp;= {\\Large \\mathbb{E}}_{\\small \\mathbf{x}_0 \\sim q(\\mathbf{x}_0), \\bm{\\epsilon}_t \\sim \\mathcal{N}(\\mathbf{0}, \\mathbf{I})} \\left [ \\frac{\\beta_t^2}{2 \\alpha_t (1 - \\alpha_t) \\sigma_t^2} \\left \\| \\bm{\\epsilon}_t - \\bm{\\epsilon}_\\theta({\\color{orange} \\mathbf{x}_t}, t) \\right \\| ^2 \\right ] \\quad ; \\text{bringing in Eq. (27)} \\\\ \u0026amp;= {\\Large \\mathbb{E}}_{\\small \\mathbf{x}_0 \\sim q(\\mathbf{x}_0), \\bm{\\epsilon}_t \\sim \\mathcal{N}(\\mathbf{0}, \\mathbf{I})} \\left [ {\\color{green} \\frac{\\beta_t^2}{2 \\alpha_t (1 - \\alpha_t) \\sigma_t^2}} \\left \\| \\bm{\\epsilon}_t - \\bm{\\epsilon}_\\theta({\\color{orange} \\sqrt{\\bar{\\alpha}_t}\\mathbf{x}_0 + \\sqrt{1 - \\bar{\\alpha}_t} \\bm{\\epsilon}_t}, t) \\right \\| ^2 \\right ] \\\\ \\end{split} \\end{equation*} \\tag{41} $$\nSimplification Empirically, Ho et al. (2020) found that training the diffusion model works better with a simplified objective that ignores the weighting term (the green part in Eq. (41)): $$ \\mathcal{L}_t^{\\text{simple}} = {\\Large \\mathbb{E}}_{\\small \\mathbf{x}_0 \\sim q(\\mathbf{x}_0), \\bm{\\epsilon}_t \\sim \\mathcal{N}(\\mathbf{0}, \\mathbf{I})} \\left [ \\left \\| \\bm{\\epsilon}_t - \\bm{\\epsilon}_\\theta(\\sqrt{\\bar{\\alpha}_t}\\mathbf{x}_0 + \\sqrt{1 - \\bar{\\alpha}_t} \\bm{\\epsilon}_t, t) \\right \\| ^2 \\right ] \\tag{42} $$\nSimple Code Implementation The jupyter notebook is available at GitHub Gist.\nYou can click the button at the top of the notebook to open it in Colab and run the code for free.\nCitation Cited as:\nGavin, Sun. (May 2023). A Brief Exploration to Diffusion Probabilistic Models [Blog post]. Retrieved from https://gavinsun0921.github.io/posts/paper-reading-01/. Or\n@online{gavin2023diffusion, title = {A Brief Exploration to Diffusion Probabilistic Models}, author = {Gavin, Sun}, year = {2023}, month = {May}, url = {\\url{https://gavinsun0921.github.io/posts/paper-reading-01/}} } References [1] Jascha Sohl-Dickstein et al. \u0026ldquo;Deep Unsupervised Learning using Nonequilibrium Thermodynamics.\u0026rdquo; ICML 2015.\n[2] Jonathan Ho et al. \u0026ldquo;Denoising Diffusion Probabilistic Models.\u0026rdquo; NeurIPS 2020.\n[3] Jiaming Song et al. \u0026ldquo;Denoising Diffusion Implicit Models.\u0026rdquo; ICLR 2021.\n[4] Alex Nichol \u0026amp; Prafulla Dhariwal. \u0026ldquo;Improved Denoising Diffusion Probabilistic Models.\u0026rdquo; ICML 2021.\n[5] Lilian Weng. \u0026ldquo;What are Diffusion Models? Lil’Log.\u0026rdquo; [Blog post] Lil\u0026rsquo;Log 2021.\n[6] Ayan Das. \u0026ldquo;An Introduction to Diffusion Probabilistic Models.\u0026rdquo; [Blog post] Ayan\u0026rsquo;s Blog 2021.\n","permalink":"https://gavinsun0921.github.io/posts/paper-research-01/","summary":"Learn diffusion probabilistic models (DPM) by reading and analyzing the papers: \u0026ldquo;Deep Unsupervised Learning using Nonequilibrium Thermodynamics\u0026rdquo; and \u0026ldquo;Denoising Diffusion Probabilistic Models\u0026rdquo;. This post will introduce the basic work of DPM, including the derivation of formulas and simple code verification.","title":"A Brief Exploration to Diffusion Probabilistic Models with Code Implementation"},{"content":"Gavin Sun (孙国栋) I\u0026rsquo;m a PhD student at the Northwestern Polytechnical University, advised by Prof. Yuchao Dai.\nMy research focuses on 4D Reconstruction \u0026amp; Generation and Spatial Intelligence. I am particularly interested in understanding dynamic scenes and building coherent spatio-temporal representations of the physical world. My current work explores how tracking, geometric structure, and generative priors can be integrated to achieve robust and scalable 4D representations.\nThis page serves as a brief academic profile, while the rest of the site functions as a personal research log containing notes, experiments, and occasional audio reflections.\n📚 Education PhD Student (2025 - Present) Northwestern Polytechnical University Majoring in Information and Communication Engineering Master Degree (2022 - 2025) Northwestern Polytechnical University Majoring in Artificial Intelligence Bachelor Degree (2018-2022) Shandong University of Science and Technology Majoring in Statistics 🏆 Selected Awards 2022 \u0026ldquo;航天宏图杯\u0026rdquo; 遥感影像智能处理算法大赛 · 全国冠军（十万元） 国家自然学科基金委信息科学部 2023 \u0026ldquo;国丰东方慧眼杯\u0026rdquo; 遥感影响智能处理算法大赛 · 全国亚军（三万元） 国家自然学科基金委信息科学部 中国高校计算机大赛 2020团体程序设计天梯赛 · 团队金奖 全国高等学校计算机教育研究会 The 2020 ICPC Asia Nanjing Regional Contest · Silver Medal The International Collegiate Programming Contest Foundation The 2019 ICPC Asia Shanghai Regional Contest · Bronze Medal The International Collegiate Programming Contest Foundation 2024 华为软件精英挑战赛——普朗克计划 · 西北赛区二等奖 华为技术有限公司 昇腾 AI 创新大赛 2023 全国总决赛 · 铜奖 华为技术有限公司 📖 Selected Publications Cong Wang, Guodong Sun (co-first author \u0026amp; first student author), Cailing Wang, Zixuan Gao and Hongwei Wang, \u0026ldquo;Monte Carlo-Based Restoration of Images Degraded by Atmospheric Turbulence,\u0026rdquo; in IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 54, no. 11, pp. 6610-6620, Nov. 2024, doi: 10.1109/TSMC.2024.3399464. Qixiang Ma, Jian Wu, Runze Fan, Guodong Sun, Xuehuai Shi, \u0026ldquo;ViP-Fluid: Visual Perception Driven Method for VR Fluid Rendering,\u0026rdquo; 2024 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Bellevue, WA, USA, 2024, pp. 359-367, doi: 10.1109/ISMAR62088.2024.00050. Guodong Sun, Qixiang Ma, Liqiang Zhang, Hongwei Wang, Zixuan Gao, Haotian Zhang, \u0026ldquo;Probabilistic Prior Driven Attention Mechanism Based on Diffusion Model for Imaging Through Atmospheric Turbulence,\u0026rdquo; arXiv preprint, arXiv:2411.10321, 2024. 🌐 Connect With Me Google Scholar: Gavin Sun GitHub: @GavinSun0921 Email: gavinsun0921@foxmail.com ","permalink":"https://gavinsun0921.github.io/about/","summary":"\u003ch2 id=\"gavin-sun-孙国栋\"\u003eGavin Sun (孙国栋)\u003c/h2\u003e\n\u003cp\u003eI\u0026rsquo;m a PhD student at the \u003cstrong\u003eNorthwestern Polytechnical University\u003c/strong\u003e, advised by Prof. \u003ca href=\"https://scholar.google.com/citations?user=fddAbqsAAAAJ\u0026amp;hl=en\"\u003eYuchao Dai\u003c/a\u003e.\u003c/p\u003e\n\u003cp\u003eMy research focuses on \u003cstrong\u003e4D Reconstruction \u0026amp; Generation\u003c/strong\u003e and \u003cstrong\u003eSpatial Intelligence\u003c/strong\u003e. I am particularly interested in understanding dynamic scenes and building coherent spatio-temporal representations of the physical world. My current work explores how \u003cstrong\u003etracking\u003c/strong\u003e, \u003cstrong\u003egeometric structure\u003c/strong\u003e, and \u003cstrong\u003egenerative priors\u003c/strong\u003e can be integrated to achieve robust and scalable 4D representations.\u003c/p\u003e\n\u003cp\u003eThis page serves as a brief academic profile, while the rest of the site functions as a personal research log containing notes, experiments, and occasional audio reflections.\u003c/p\u003e","title":"About Me"}]