You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: readme.md
+43-10Lines changed: 43 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
-**Can be 5-10x faster than the original software CUDA rasterizer ([diff-gaussian-rasterization](https://github.com/graphdeco-inria/diff-gaussian-rasterization)).**
4
4
-**Can be 2-3x faster if using offline rendering. (Bottleneck: copying rendered images around, thinking about improvements.)**
5
-
-**Speedup most visible with high pixel-to-point ratio (large gaussians, small point count, high-res rendering).**
5
+
-**Speedup most visible with high pixel-to-point ratio (large Gaussians, small point count, high-res rendering).**
@@ -61,14 +61,30 @@ Thus if you're running in a GUI (OpenGL-based) environment, the output of our ra
61
61
62
62
**Note: the speedup is the most visible when the pixel-to-point ratio is high.**
63
63
64
-
That is, when there're large gaussians and very highresolution rendering, the speedup is more visible.
64
+
That is, when there are large Gaussians and very high-resolution rendering, the speedup is more visible.
65
65
66
66
The CUDA-based software implementation is more resolution sensitive and for some extremely dense point clouds (> 1 million points), the CUDA implementation might be faster.
67
67
68
-
This is because the typical rasterization-based pipeline on modern graphics are[not well-optimized for small triangles](https://www.youtube.com/watch?v=hf27qsQPRLQ&list=WL).
68
+
This is because the typical rasterization-based pipeline on modern graphics hardware is[not well-optimized for small triangles](https://www.youtube.com/watch?v=hf27qsQPRLQ&list=WL).
69
69
70
70
71
-
**Note: it's recommended to pass in a CPU tensor in the GaussianRasterizationSettings to avoid explicit synchronizations for even better performance.**
71
+
**Note: for best performance, cache the persistent results (for example, the 6 elements of the covariance matrix).**
72
+
73
+
This is more of a general tip and not directly related to `fast_gauss`.
74
+
75
+
However, the impact is more observable here since we haven't implemented a fast 3D covariance computation (from scales and rotations) in the shader yet.
76
+
Only PyTorch implementation is available for now.
77
+
78
+
When the point count increases, even the smallest `precomputation` can help.
79
+
An example is the concatenation of the base 0-degree SH parameter and the rest, that small maneuver might cost us 10ms on a 3060 with 5 million points.
80
+
Thus, store the concatenated tensors instead and avoid concatenating them in every frame.
81
+
82
+
-[ ] TODO: Implement SH eval in the vertex shader.
83
+
-[ ] TODO: Warn users if they're not properly precomputing the covariance matrix.
84
+
-[ ] TODO: Implement a more optimized `OptimizedGaussians` for precomputing things and apply a cache. Similar to that of the vertex shader (see [Invokation frequency](https://www.khronos.org/opengl/wiki/Vertex_Shader)).
85
+
86
+
87
+
**Note: it's recommended to pass in a CPU tensor in the `GaussianRasterizationSettings` to avoid explicit synchronizations for even better performance.**
72
88
73
89
-[ ] TODO: Add a warning to the user if GPU tensors are detected.
74
90
@@ -77,26 +93,43 @@ This is because the typical rasterization-based pipeline on modern graphics are
77
93
78
94
And the alpha channel content seems to be bugged currently, will debug.
79
95
80
-
-[ ] TODO: Debug alpha channel
81
-
96
+
-[ ] TODO: Debug alpha channel values
82
97
83
98
## TODOs
84
99
85
100
-[ ] TODO: Apply more of the optimization techniques used by similar shaders, including packing the data into a texture and bit reduction during computation.
86
101
-[ ] TODO: Thinks of ways for a backward pass. Welcome to discuss!
87
102
-[ ] TODO: Compute covariance from scaling and rotation in the shader, currently it's on the CUDA (PyTorch) side.
88
103
-[ ] TODO: Compute SH in the shader, currently it's on the CUDA (PyTorch) side.
104
+
-[ ] TODO: Try to align the rendering results at the pixel level, small deviation exists currently.
105
+
-[ ] TODO: Use indexed draw calls to minimize data passing and shuffling.
106
+
-[ ] TODO: Do incremental sorting based on viewport change, currently it's a full resort on with CUDA (PyTorch).
107
+
108
+
## Implementation
109
+
110
+
**Goal:**
111
+
112
+
- Let the professionals do the work.
113
+
- Let GPU do the large-scale sorting.
114
+
- Let the graphics pipeline do the rasterization for us, not the other way around.
115
+
- Let OpenGL directly write to your framebuffer.
116
+
- Minimize repeated work.
117
+
- Compute the 3D to 2D covariance projection only once for each Gaussian, instead of 4 times for the quad, enabled by the geometry shader.
118
+
- Minimize stalls (minimize explicit synchronizations between GPU and CPU).
119
+
- Enabled by using `non_blocking=True` data passing and moving sync points to as early as possible.
120
+
- Boosted by the fact that we're sorting on the GPU, thus no need to perform synchronized host-to-device copies.
121
+
122
+
-[ ] TODO: Expand implementation details.
89
123
90
124
## Environment
91
125
92
126
This project requires you to have an NVIDIA GPU with the ability to interop between CUDA and OpenGL.
93
127
Thus, WSL is [not supported](https://docs.nvidia.com/cuda/wsl-user-guide/index.html#features-not-yet-supported) and OSX (MacOS) is not supported.
128
+
Tested on Linux and Windows.
94
129
95
130
For offline rendering (the drop-in replacement of the original CUDA rasterizer), we also need a valid EGL environment.
96
131
It can sometimes be hard to set up for virtualized machines. [Potential fix](https://github.com/zju3dv/4K4D/issues/27#issuecomment-2026747401).
97
132
98
-
-[ ] TODO: Test on more platforms.
99
-
100
133
## Credits
101
134
102
135
Inspired by those insanely fast WebGL-based 3DGS viewers:
0 commit comments