Skip to content

Commit 800eca3

Browse files
authored
update doc
1 parent 10ad266 commit 800eca3

File tree

1 file changed

+43
-10
lines changed

1 file changed

+43
-10
lines changed

readme.md

Lines changed: 43 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
- **Can be 5-10x faster than the original software CUDA rasterizer ([diff-gaussian-rasterization](https://github.com/graphdeco-inria/diff-gaussian-rasterization)).**
44
- **Can be 2-3x faster if using offline rendering. (Bottleneck: copying rendered images around, thinking about improvements.)**
5-
- **Speedup most visible with high pixel-to-point ratio (large gaussians, small point count, high-res rendering).**
5+
- **Speedup most visible with high pixel-to-point ratio (large Gaussians, small point count, high-res rendering).**
66

77
https://github.com/dendenxu/fast-gaussian-splatting/assets/43734697/f50afd6f-bbd5-4e18-aca6-a7356a5d3f75
88

@@ -13,13 +13,13 @@ Discussion welcomed.
1313

1414
## Installation
1515

16-
Latest release from PyPI:
16+
Install the latest release from PyPI:
1717

1818
```shell
1919
pip install fast_gauss
2020
```
2121

22-
Latest commit from GitHub:
22+
Or the latest commit from GitHub:
2323

2424
```shell
2525
pip install git+https://github.com/dendenxu/fast-gaussian-rasterization
@@ -61,14 +61,30 @@ Thus if you're running in a GUI (OpenGL-based) environment, the output of our ra
6161

6262
**Note: the speedup is the most visible when the pixel-to-point ratio is high.**
6363

64-
That is, when there're large gaussians and very high resolution rendering, the speedup is more visible.
64+
That is, when there are large Gaussians and very high-resolution rendering, the speedup is more visible.
6565

6666
The CUDA-based software implementation is more resolution sensitive and for some extremely dense point clouds (> 1 million points), the CUDA implementation might be faster.
6767

68-
This is because the typical rasterization-based pipeline on modern graphics are [not well-optimized for small triangles](https://www.youtube.com/watch?v=hf27qsQPRLQ&list=WL).
68+
This is because the typical rasterization-based pipeline on modern graphics hardware is [not well-optimized for small triangles](https://www.youtube.com/watch?v=hf27qsQPRLQ&list=WL).
6969

7070

71-
**Note: it's recommended to pass in a CPU tensor in the GaussianRasterizationSettings to avoid explicit synchronizations for even better performance.**
71+
**Note: for best performance, cache the persistent results (for example, the 6 elements of the covariance matrix).**
72+
73+
This is more of a general tip and not directly related to `fast_gauss`.
74+
75+
However, the impact is more observable here since we haven't implemented a fast 3D covariance computation (from scales and rotations) in the shader yet.
76+
Only PyTorch implementation is available for now.
77+
78+
When the point count increases, even the smallest `precomputation` can help.
79+
An example is the concatenation of the base 0-degree SH parameter and the rest, that small maneuver might cost us 10ms on a 3060 with 5 million points.
80+
Thus, store the concatenated tensors instead and avoid concatenating them in every frame.
81+
82+
- [ ] TODO: Implement SH eval in the vertex shader.
83+
- [ ] TODO: Warn users if they're not properly precomputing the covariance matrix.
84+
- [ ] TODO: Implement a more optimized `OptimizedGaussians` for precomputing things and apply a cache. Similar to that of the vertex shader (see [Invokation frequency](https://www.khronos.org/opengl/wiki/Vertex_Shader)).
85+
86+
87+
**Note: it's recommended to pass in a CPU tensor in the `GaussianRasterizationSettings` to avoid explicit synchronizations for even better performance.**
7288

7389
- [ ] TODO: Add a warning to the user if GPU tensors are detected.
7490

@@ -77,26 +93,43 @@ This is because the typical rasterization-based pipeline on modern graphics are
7793

7894
And the alpha channel content seems to be bugged currently, will debug.
7995

80-
- [ ] TODO: Debug alpha channel
81-
96+
- [ ] TODO: Debug alpha channel values
8297

8398
## TODOs
8499

85100
- [ ] TODO: Apply more of the optimization techniques used by similar shaders, including packing the data into a texture and bit reduction during computation.
86101
- [ ] TODO: Thinks of ways for a backward pass. Welcome to discuss!
87102
- [ ] TODO: Compute covariance from scaling and rotation in the shader, currently it's on the CUDA (PyTorch) side.
88103
- [ ] TODO: Compute SH in the shader, currently it's on the CUDA (PyTorch) side.
104+
- [ ] TODO: Try to align the rendering results at the pixel level, small deviation exists currently.
105+
- [ ] TODO: Use indexed draw calls to minimize data passing and shuffling.
106+
- [ ] TODO: Do incremental sorting based on viewport change, currently it's a full resort on with CUDA (PyTorch).
107+
108+
## Implementation
109+
110+
**Goal:**
111+
112+
- Let the professionals do the work.
113+
- Let GPU do the large-scale sorting.
114+
- Let the graphics pipeline do the rasterization for us, not the other way around.
115+
- Let OpenGL directly write to your framebuffer.
116+
- Minimize repeated work.
117+
- Compute the 3D to 2D covariance projection only once for each Gaussian, instead of 4 times for the quad, enabled by the geometry shader.
118+
- Minimize stalls (minimize explicit synchronizations between GPU and CPU).
119+
- Enabled by using `non_blocking=True` data passing and moving sync points to as early as possible.
120+
- Boosted by the fact that we're sorting on the GPU, thus no need to perform synchronized host-to-device copies.
121+
122+
- [ ] TODO: Expand implementation details.
89123

90124
## Environment
91125

92126
This project requires you to have an NVIDIA GPU with the ability to interop between CUDA and OpenGL.
93127
Thus, WSL is [not supported](https://docs.nvidia.com/cuda/wsl-user-guide/index.html#features-not-yet-supported) and OSX (MacOS) is not supported.
128+
Tested on Linux and Windows.
94129

95130
For offline rendering (the drop-in replacement of the original CUDA rasterizer), we also need a valid EGL environment.
96131
It can sometimes be hard to set up for virtualized machines. [Potential fix](https://github.com/zju3dv/4K4D/issues/27#issuecomment-2026747401).
97132

98-
- [ ] TODO: Test on more platforms.
99-
100133
## Credits
101134

102135
Inspired by those insanely fast WebGL-based 3DGS viewers:

0 commit comments

Comments
 (0)