Voxy achieves render distances of 1-2km+ while FarHorizons is currently limited to ~256 blocks (16 chunks). The fundamental difference: Voxy doesn't render distant terrain at full detail - it uses a sophisticated LOD system that FarHorizons lacks entirely.
The Core Idea: Instead of rendering every block at full detail, voxy uses 5 LOD levels (0-4) where distant terrain is progressively simplified.
| Level | Resolution | Coverage |
|---|---|---|
| 0 | 1 block per voxel | 32×32×32 blocks |
| 1 | 2 blocks per voxel | 64×64×64 blocks |
| 2 | 4 blocks per voxel | 128×128×128 blocks |
| 3 | 8 blocks per voxel | 256×256×256 blocks |
| 4 | 16 blocks per voxel | 512×512×512 blocks |
Key Files:
WorldEngine.java- MAX_LOD_LAYER = 4, section ID encodingWorldSection.java- 32×32×32 voxel sections withnonEmptyChildrentrackingNodeManager.java- Hierarchical octree management
The Magic: The GPU decides what's visible, not the CPU. This is huge for performance.
Technique Stack:
- HiZ (Hierarchical Z-Buffer): Builds mipmap pyramid from depth buffer for fast depth tests
- Frustum Culling: Standard 6-plane AABB tests in compute shader
- Screen-Space Size Culling: Nodes below a pixel threshold aren't subdivided further
- Rasterization-Based Visibility: Renders AABBs to visibility buffer, then generates draw commands
Key Files:
HierarchicalOcclusionTraverser.java- Orchestrates GPU cullingHiZBuffer.java- Hierarchical depth buffertraversal_dev.comp- Core traversal compute shadercmdgen.comp- Indirect draw command generation
Why It's Fast: Single draw call renders potentially hundreds of thousands of sections.
CPU doesn't know what to render → GPU decides visibility → GPU generates draw commands → GPU renders
- Uses
glMultiDrawElementsIndirectCountARB - Up to 400K opaque + 100K translucent draw commands
- No CPU-GPU round-trips for visibility decisions
Key File: MDICSectionRenderer.java
Block Storage: 64-bit packed format per block
Light (8 bits) | Block ID (20 bits) | Biome ID (9 bits) | Reserved (27 bits)
Quad Storage: 8 bytes per quad containing position, size, face, block state, biome, light
Compression Pipeline:
- Palette encoding - Only stores unique blocks/biomes per section
- Bit-packing - Dynamic bit width (1-12 bits per entry)
- Light optimization - Single byte if all lights identical
- ZSTD/LZ4 - Optional external compression layer
Key Files:
SaveLoadSystem2.java- Palette + bitpacking serializationMapper.java- Block/biome ID mappingZSTDCompressor.java,LZ4Compressor.java
Thread Distribution:
| Thread | Purpose |
|---|---|
| Main/Render | Frame rendering, GPU uploads |
| AsyncNodeManager | Dedicated thread for node lifecycle |
| 10 Worker Threads | Mesh generation (RenderGenerationService) |
| Model Baking | Block model processing |
| Storage I/O | Database read/write |
Key Sync Mechanism: Lock-free VarHandle operations between AsyncNodeManager and render thread
Key Files:
AsyncNodeManager.java- Separate thread for node managementRenderGenerationService.java- 10-worker priority queueServiceManager.java- Weighted job distribution
GPU Memory:
AllocationArena- Custom allocator with coalescing, RB-tree tracking- 256MB-4GB geometry buffer (auto-sized to GPU)
- Sparse buffer fallback for low-memory situations
CPU Memory:
- 3-tier caching: Primary sharded (64 shards), LRU (1024-2048), Array pool (400)
- Section array reuse eliminates GC pressure
- StampedLock for high-concurrency reads
Key Files:
AllocationArena.java- GPU memory allocatorActiveSectionTracker.java- Multi-tier section cachingBufferArena.java- SSBO management
Technique: ScanMesher2D aggregates adjacent same-material faces into larger quads
Result: Dramatically fewer quads to render. A flat wall becomes 1 quad instead of hundreds.
Key File: RenderDataFactory.java
VoxyRenderSystem
├── ModelBakerySubsystem (block model caching)
├── RenderGenerationService (10 mesh generation workers)
├── AsyncNodeManager (dedicated thread)
│ ├── NodeManager (hierarchical octree)
│ ├── BasicAsyncGeometryManager (GPU uploads)
│ └── GeometryCache (mesh deduplication)
├── HierarchicalOcclusionTraverser (GPU culling)
├── RenderDistanceTracker (spatial load management)
├── MDICSectionRenderer (indirect rendering)
└── WorldEngine
├── SectionStorage (RocksDB/LMDB)
├── Mapper (block/biome IDs)
└── ActiveSectionTracker (LRU cache)
| Standard Minecraft | Voxy |
|---|---|
| CPU decides visibility | GPU decides visibility |
| Fixed detail everywhere | 5 LOD levels |
| One draw call per chunk | Single MDIC draw call |
| Vanilla chunk format | 64-bit packed + palette compression |
| Single-threaded meshing | 10-thread priority queue |
| No distant terrain | Renders 1-2km+ with LOD |
- prep.comp - Reset counters, initialize buffers
- traversal_dev.comp - Hierarchical tree traversal with frustum/HiZ culling
- raster.vert/frag - AABB rasterization for visibility buffer
- cmdgen.comp - Generate indirect draw commands from visibility
- quads3.vert + quads.frag - Actual rendering with model atlas sampling
sectionRenderDistance- How many sections to render (default 16)subDivisionSize- Screen-space size threshold (28-256, default 64)serviceThreads- Worker thread count (auto: CPU cores / 1.5)- Dynamic FPS targeting (55-65 FPS) with adaptive subdivision
- LOD = Don't render what you can't see in detail
- GPU culling = Don't let CPU be the bottleneck
- MDIC = One draw call to rule them all
- Compression = Fit more in memory
- Async = Never stall the render thread
- Greedy meshing = Fewer triangles
This is a highly sophisticated system that pushes modern GPU capabilities to their limits while maintaining efficient CPU-side data management.
| Aspect | FarHorizons Implementation |
|---|---|
| Language/API | Zig + Vulkan |
| Chunk Size | 16×16×16 blocks |
| Block Storage | 16-bit packed (8-bit ID + 8-bit state) = 8KB/chunk |
| View Distance | 16 chunks horizontal (~256 blocks) |
| LOD System | ❌ None |
| GPU Culling | ❌ None (relies on Z-buffer) |
| Frustum Culling | ❌ None detected |
| Indirect Rendering | ✅ Yes (vkCmdDrawIndexedIndirect) |
| Multi-threading | ✅ C2ME-style 2-phase loading |
| Memory Management | ✅ Growable buffer arenas |
- C2ME-Style Two-Phase Loading - Terrain gen and meshing decoupled
- Growable Buffer Arenas - Async expansion, no frame stalls
- Ring Buffer Chunk Storage - O(1) access, no hash collisions
- Indirect Draw Batching - Groups by layer/arena
- Per-Layer Rendering - Solid/cutout/translucent separation
- Worker Buffer Reuse - 3.6MB per worker, eliminates GC
- Staging Ring Buffer - Efficient GPU uploads
- Ambient Occlusion - Per-vertex smooth lighting
- Occlusion Cache - LRU cache for face culling tests
Voxy: 5 LOD levels (0-4), distant terrain at 16:1 simplification FarHorizons: ❌ All chunks rendered at full detail
Impact: This is THE major bottleneck. Without LOD, you cannot render far terrain - the geometry count explodes.
What Voxy Does:
- Level 0: 32×32×32 at 1:1 (near)
- Level 4: 32×32×32 representing 512×512×512 blocks (far)
WorldSectionwithnonEmptyChildrentracking for sparse storage- Hierarchical octree structure managed by
NodeManager
Voxy: Compute shader traverses octree, GPU decides visibility FarHorizons: ❌ No frustum culling, no occlusion culling
Impact: Without GPU culling, CPU must decide what to render. This doesn't scale.
What Voxy Does:
HiZBuffer- Hierarchical depth buffer for fast occlusion teststraversal_dev.comp- GPU compute shader traverses LOD tree- Screen-space size culling - nodes below pixel threshold skipped
- Frustum culling in shader with 6-plane AABB tests
Voxy: GPU generates draw commands, single MDIC call renders everything
FarHorizons:
Impact: FarHorizons batches draws but CPU still iterates chunks. Voxy's GPU generates commands.
What Voxy Does:
cmdgen.comp- GPU generatesDrawIndexedIndirectcommandsglMultiDrawElementsIndirectCountARB- GPU knows count- Up to 500K draw commands from single CPU call
Voxy: Multi-level sections with sparse storage FarHorizons: ❌ Flat chunk storage, no hierarchy
Impact: Can't represent distant terrain compactly without hierarchy.
What Voxy Does:
WorldEnginewith 5 LOD layersWorldSection- 32³ voxels at each levelnonEmptyChildrenbyte for sparse traversal- Section ID packs level + position into 64-bit key
Voxy: GPU-side frustum test in traversal shader FarHorizons: ❌ All ready chunks rendered
Impact: Rendering chunks behind camera wastes GPU time.
Voxy: ScanMesher2D merges adjacent faces into larger quads
FarHorizons:
Impact: A flat wall is many quads instead of one. More triangles = slower.
Voxy: 8 bytes per quad (packed position, size, face, block, biome, light) FarHorizons: 36 bytes per vertex × 4 vertices per face = 144 bytes per quad
Impact: ~18× more VRAM per quad. Limits how much geometry fits in memory.
Voxy: Palette encoding + bit-packing + ZSTD/LZ4 FarHorizons: Fixed 16-bit per block, no compression
Impact: For storage/streaming, voxy can compress 50-80%.
Voxy: Dedicated thread for LOD tree management with lock-free sync FarHorizons: Main thread handles chunk state
Impact: Render thread never stalls on node operations in voxy.
| Feature | Voxy | FarHorizons |
|---|---|---|
| Multi-threaded meshing | 10 workers | Configurable workers |
| Async GPU uploads | Persistent mapped buffers | Staging ring buffer |
| Per-layer rendering | 3 layers | 3 layers (solid/cutout/translucent) |
| Indirect draw calls | ✅ MDIC | ✅ Indirect indexed |
| Buffer arena management | AllocationArena | GrowableBufferArena |
| Block model system | Runtime baking | JSON-based model loading |
| Texture arrays | Bindless-like | Texture array indexing |
- Implement LOD System - Multi-level world sections (most critical)
- Add Frustum Culling - CPU-side first, then GPU
- Hierarchical Storage - Octree or level-based sections
- GPU Occlusion Culling - HiZ buffer + compute shader
- GPU Command Generation - Move draw decision to GPU
- MDIC Rendering -
vkCmdDrawIndexedIndirectCount
- Greedy Meshing - Merge adjacent same-material faces
- Geometry Compression - Pack quads into fewer bytes
- Block Data Compression - Palette + bit-packing for storage
| Technique | Voxy | FarHorizons | Priority |
|---|---|---|---|
| LOD System | ✅ 5 levels | ❌ None | 🔴 Critical |
| GPU Culling | ✅ HiZ + Compute | ❌ None | 🔴 Critical |
| Frustum Culling | ✅ GPU | ❌ None | 🔴 Critical |
| Hierarchical Storage | ✅ Octree-like | ❌ Flat | 🔴 Critical |
| MDIC Rendering | ✅ GPU count | 🟡 Important | |
| Greedy Meshing | ✅ ScanMesher2D | ❌ Per-face | 🟡 Important |
| Geometry Compression | ✅ 8 bytes/quad | ❌ 144 bytes/quad | 🟡 Important |
| Block Compression | ✅ Palette+ZSTD | ❌ Fixed 16-bit | 🟢 Nice-to-have |
| Multi-thread Meshing | ✅ | ✅ | ✅ Done |
| Async GPU Upload | ✅ | ✅ | ✅ Done |
| Per-layer Render | ✅ | ✅ | ✅ Done |
| Indirect Draw | ✅ | ✅ | ✅ Done |
FarHorizons can't achieve high render distances without an LOD system. Everything else is optimization. The core architectural difference is:
- Voxy: "Render distant terrain at lower detail"
- FarHorizons: "Render all terrain at full detail" (limits max distance)
To match voxy's capabilities, FarHorizons needs:
- Hierarchical LOD world representation
- GPU-driven visibility determination
- Efficient distant terrain mesh generation
The good news: FarHorizons already has solid foundations (Vulkan, indirect draw, multi-threading, buffer management) that will support these additions.
Voxy's LOD is NOT about simplifying geometry at distance. It's about:
- Hierarchical spatial organization - World divided into multi-level sections
- GPU-driven traversal - Compute shader decides what to render
- Screen-space LOD selection - If a node covers fewer pixels than threshold, don't subdivide
Every LOD level uses the same 32×32×32 section structure:
WorldSection {
long[] data; // 32,768 blocks (32³)
byte nonEmptyChildren; // 8-bit mask: which octants have data
int atomicState; // Bit 0 = loaded, bits 1-30 = ref count
}
Index calculation:
index = (y << 10) | (z << 5) | x
// x: bits 0-4, z: bits 5-9, y: bits 10-14
NonEmptyChildren bitmask (for sparse traversal):
Bit 0: child at (x&1=0, y&1=0, z&1=0)
Bit 1: child at (x&1=1, y&1=0, z&1=0)
Bit 2: child at (x&1=0, y&1=0, z&1=1)
...
Bit 7: child at (x&1=1, y&1=1, z&1=1)
Bits 60-63 (4 bits): LOD Level (0-4)
Bits 52-59 (8 bits): Y coordinate
Bits 28-51 (24 bits): Z coordinate
Bits 4-27 (24 bits): X coordinate
Bits 0-3 (4 bits): Reserved
Decoding:
level = (id >> 60) & 0xF
y = (id >> 52) & 0xFF
z = (id >> 28) & 0xFFFFFF
x = (id >> 4) & 0xFFFFFF| Level | Section Size | World Coverage | Block Resolution |
|---|---|---|---|
| 0 | 32×32×32 | 32×32×32 blocks | 1:1 (full detail) |
| 1 | 32×32×32 | 64×64×64 blocks | 2:1 |
| 2 | 32×32×32 | 128×128×128 blocks | 4:1 |
| 3 | 32×32×32 | 256×256×256 blocks | 8:1 |
| 4 | 32×32×32 | 512×512×512 blocks | 16:1 |
Key insight: Every level stores 32³ voxels. Higher levels just represent larger world areas.
World coordinate calculation:
worldBlockX = sectionX * 32 * (1 << level)
worldBlockY = sectionY * 32 * (1 << level)
worldBlockZ = sectionZ * 32 * (1 << level)
Each node is 16 bytes (4 × uint32):
node[0]: Position (packed section ID)
node[1]:
- bits 0-23: Geometry pointer (mesh ID)
- bits 24-47: Child pointer (first child index)
- bits 48-55: Child existence mask (8 bits)
- bits 56-58: Child count - 1 (3 bits, 1-8 children)
- bit 59: Geometry in-flight flag
- bit 63: Request in-flight flag
node[2]: Request ID + flags
node[3]: Reserved
Parent → Children:
parentLevel = N
childLevel = N - 1
For each bit i in nonEmptyChildren (0-7):
childX = (parentX << 1) | (i & 1)
childY = (parentY << 1) | ((i >> 2) & 1)
childZ = (parentZ << 1) | ((i >> 1) & 1)
Child → Parent:
parentX = childX >> 1
parentY = childY >> 1
parentZ = childZ >> 1
parentLevel = childLevel + 1
void traverse(UnpackedNode node) {
// 1. Compute screen-space projection
setupScreenspace(node);
// 2. Early culling
if (outsideFrustum() || isCulledByHiz()) {
return; // DISCARD
}
// 3. LOD decision
if (node.lodLevel != 0 && shouldDescend()) {
// Need more detail
if (hasChildren(node)) {
enqueueChildren(node); // Process children next iteration
} else {
addRequest(node); // Request children to be loaded
enqueueSelfForRender(node); // Render self as fallback
}
} else {
// Current detail is enough
if (hasMesh(node)) {
enqueueSelfForRender(node);
} else {
addRequest(node);
if (node.lodLevel != 0) {
enqueueChildren(node);
}
}
}
}void setupScreenspace(UnpackedNode node) {
// World position
vec3 worldPos = vec3((node.pos << node.lodLevel) - camPos) * 32.0;
float nodeSize = 32.0 * (1 << node.lodLevel);
// Project 8 AABB corners
vec4 corners[8];
corners[0] = VP * vec4(worldPos, 1);
corners[1] = VP * vec4(worldPos + vec3(nodeSize, 0, 0), 1);
// ... all 8 corners
// Perspective divide to NDC
for (int i = 0; i < 8; i++) {
corners[i].xyz /= corners[i].w;
corners[i].xyz = corners[i].xyz * 0.5 + 0.5; // [0,1] range
}
// Compute 2D projected area using cross products
screenSize = computeProjectedArea(corners);
// Store min/max for HiZ culling
minBB = min(all corners);
maxBB = max(all corners);
}uniform float minSSS; // Minimum screen-space size threshold
bool shouldDescend() {
return screenSize > minSSS;
}minSSS calculation (CPU-side):
float subDivisionSize = 64; // pixels (configurable 28-256)
float minSSS = (subDivisionSize * subDivisionSize) / (viewportWidth * viewportHeight);Auto-tuning based on FPS:
if (fps < 55) {
subDivisionSize = min(subDivisionSize + 2, 256); // Coarser LOD
}
if (fps > 65) {
subDivisionSize = max(subDivisionSize - 1, 28); // Finer LOD
}bool isCulledByHiz() {
// Compute mip level based on screen coverage
vec2 screenExtent = (maxBB.xy - minBB.xy) * hizSize;
float mipLevel = floor(log2(max(screenExtent.x, screenExtent.y))) - 1;
mipLevel = clamp(mipLevel, 0, maxMipLevel);
// Sample HiZ at that mip level
float maxDepth = texelFetch(hizSampler, screenCoord, int(mipLevel)).r;
// Cull if all sampled depths are closer than node's minimum depth
return maxDepth <= minBB.z;
}The traversal runs in layers, one compute dispatch per LOD level:
Iteration 0: Process top-level nodes (LOD 4)
└─ Enqueue visible children to scratch buffer A
Iteration 1: Process LOD 3 nodes from buffer A
└─ Enqueue visible children to scratch buffer B
Iteration 2: Process LOD 2 nodes from buffer B
└─ Enqueue visible children to scratch buffer A (flip-flop)
... up to MAX_ITERATIONS (16)
Queue metadata (per iteration):
uvec4 {
.x = dispatch width (workgroups)
.y = 1
.z = 1
.w = queue size (node count)
}
When the GPU needs children that aren't loaded:
- GPU adds request:
requestQueue[atomicAdd(requestCount, 1)] = nodePosition - CPU reads request buffer after frame
- AsyncNodeManager thread processes requests:
- Load/generate child sections from storage
- Generate meshes for children
- Update node's child pointer
- Next frame: Children available for traversal
Rate limiting:
// Limit requests based on mesh generation queue depth
double fillness = (4000 - meshGenQueue.size()) / 4000.0;
int maxRequests = (int)(fillness * fillness * 50); // 0-50 requestsKey insight: Voxy uses the SAME mesher for ALL LOD levels. There's no geometry simplification.
The "LOD" comes from:
- Spatial aggregation: Level 2 section represents 4× the world area
- Rendering selection: GPU decides which level to render based on screen size
- Priority: Near sections (LOD 0) mesh first, distant (LOD 4) mesh later
Greedy meshing (ScanMesher2D):
- Processes 32×32 layers
- Merges adjacent same-material blocks horizontally (up to 16 wide)
- Merges identical rows vertically (up to 16 tall)
- Result: 1 quad instead of up to 256 per-block quads
-
Multi-level section storage
- Same 16×16×16 or 32×32×32 structure at each level
- Section ID encoding with level + coordinates
nonEmptyChildrenbitmask for sparse traversal
-
Hierarchical node tree
- Nodes track: position, mesh pointer, child pointer, child existence
- Parent-child coordinate relationships (shift + mask)
-
Screen-space LOD selection
- Project node AABB to screen
- Compute projected area
- Compare to threshold (
shouldDescend = screenSize > minSSS)
-
Iterative traversal
- Process one LOD level per iteration
- Flip-flop between queue buffers
- Enqueue children or render self
-
Request system
- GPU outputs requests for missing children
- CPU processes requests async
- Children available next frame
- HiZ culling - Build depth pyramid, cull occluded nodes
- Frustum culling - 6-plane AABB test
- Adaptive threshold - Auto-tune
minSSSbased on FPS - Greedy meshing - Reduce quad count dramatically
| Voxy (OpenGL) | FarHorizons (Vulkan) | Purpose |
|---|---|---|
GL_SHADER_STORAGE_BUFFER (SSBO) |
VK_BUFFER_USAGE_STORAGE_BUFFER_BIT |
Node data, queues, requests |
GL_DRAW_INDIRECT_BUFFER |
VK_BUFFER_USAGE_INDIRECT_BUFFER_BIT |
Indirect draw commands |
GL_UNIFORM_BUFFER |
VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT |
Per-frame constants |
| Persistent mapped buffer | VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT + vkMapMemory |
CPU→GPU uploads |
// Bind SSBOs
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, nodeBuffer);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 1, queueSourceBuffer);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 2, queueSinkBuffer);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 3, renderQueueBuffer);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 4, requestQueueBuffer);
// Dispatch
glDispatchCompute(workgroupsX, 1, 1);
glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);// Descriptor set layout
const bindings = [_]vk.DescriptorSetLayoutBinding{
.{ .binding = 0, .descriptor_type = .storage_buffer, ... }, // nodes
.{ .binding = 1, .descriptor_type = .storage_buffer, ... }, // queue source
.{ .binding = 2, .descriptor_type = .storage_buffer, ... }, // queue sink
.{ .binding =3, .descriptor_type = .storage_buffer, ... }, // render queue
.{ .binding = 4, .descriptor_type = .storage_buffer, ... }, // request queue
};
// Record commands
vk.cmdBindPipeline(cmd, .compute, traversal_pipeline);
vk.cmdBindDescriptorSets(cmd, .compute, layout, 0, &.{descriptor_set}, &.{});
vk.cmdDispatch(cmd, workgroups_x, 1, 1);
// Memory barrier
const barrier = vk.MemoryBarrier{
.src_access_mask = .{ .shader_write_bit = true },
.dst_access_mask = .{ .shader_read_bit = true, .indirect_command_read_bit = true },
};
vk.cmdPipelineBarrier(cmd, .compute_shader_bit, .compute_shader_bit | .draw_indirect_bit, ...);// Queue metadata buffer contains dispatch parameters
// struct { uvec4 metadata[16]; } where .xyz = dispatch size
glDispatchComputeIndirect(iteration * 16); // Offset into metadata buffer// VkDispatchIndirectCommand struct: { u32 x, y, z }
// Store at offset = iteration * sizeof(VkDispatchIndirectCommand)
vk.cmdDispatchIndirect(cmd, metadata_buffer, iteration * 12);Buffer layout:
Offset 0: { workgroups_x, 1, 1 } // Iteration 0
Offset 12: { workgroups_x, 1, 1 } // Iteration 1
Offset 24: { workgroups_x, 1, 1 } // Iteration 2
...
for (int iter = 0; iter < MAX_ITERATIONS; iter++) {
// Alternate between buffers A and B
glBindBufferBase(SSBO, SOURCE_BINDING, (iter & 1) == 0 ? bufferA : bufferB);
glBindBufferBase(SSBO, SINK_BINDING, (iter & 1) == 0 ? bufferB : bufferA);
glDispatchComputeIndirect(iter * 16);
glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);
}// Option 1: Multiple descriptor sets (preferred)
const desc_sets = [2]vk.DescriptorSet{ desc_set_a_to_b, desc_set_b_to_a };
for (0..MAX_ITERATIONS) |iter| {
vk.cmdBindDescriptorSets(cmd, .compute, layout, 0, &.{desc_sets[iter & 1]}, &.{});
vk.cmdDispatchIndirect(cmd, metadata_buffer, iter * 12);
// Barrier between iterations
vk.cmdPipelineBarrier(cmd, .compute_shader_bit, .compute_shader_bit | .draw_indirect_bit, ...);
}
// Option 2: Dynamic offsets (if using single descriptor set)
// Use push constants or dynamic buffer offsets to swap source/sink// Generate mipmap pyramid from depth buffer
glBindTexture(GL_TEXTURE_2D, hizTexture);
glGenerateMipmap(GL_TEXTURE_2D);
// Or use compute shader for max reduction// Vulkan doesn't have glGenerateMipmap - must do manually
// Option 1: Blit chain (simpler but uses max filter)
for (1..mip_levels) |mip| {
const src_region = vk.ImageBlit{ .src_subresource = .{ .mip_level = mip - 1, ... }, ... };
const dst_region = vk.ImageBlit{ .dst_subresource = .{ .mip_level = mip, ... }, ... };
vk.cmdBlitImage(cmd, depth_image, .transfer_src_optimal,
hiz_image, .transfer_dst_optimal, &.{blit}, .nearest);
// Transition mip for next iteration
}
// Option 2: Compute shader (better - proper max reduction)
// hiz_downsample.comp:
// Read 2x2 texels from mip N
// Write max(4 texels) to mip N+1Compute shader for HiZ:
// hiz_downsample.comp
layout(set = 0, binding = 0) uniform sampler2D srcMip;
layout(set = 0, binding = 1, r32f) uniform writeonly image2D dstMip;
layout(local_size_x = 8, local_size_y = 8) in;
void main() {
ivec2 dstCoord = ivec2(gl_GlobalInvocationID.xy);
ivec2 srcCoord = dstCoord * 2;
float d00 = texelFetch(srcMip, srcCoord + ivec2(0,0), 0).r;
float d10 = texelFetch(srcMip, srcCoord + ivec2(1,0), 0).r;
float d01 = texelFetch(srcMip, srcCoord + ivec2(0,1), 0).r;
float d11 = texelFetch(srcMip, srcCoord + ivec2(1,1), 0).r;
// For reversed-Z: use min. For normal Z: use max.
float maxDepth = max(max(d00, d10), max(d01, d11));
imageStore(dstMip, dstCoord, vec4(maxDepth));
}// GPU writes draw commands to buffer
// GPU also writes count to separate location
glMultiDrawElementsIndirectCountARB(
GL_TRIANGLES,
GL_UNSIGNED_INT,
drawCommandBuffer, // Commands
countOffset, // Offset to count
maxDrawCount, // Upper limit
sizeof(DrawCommand) // Stride
);// Requires VK_KHR_draw_indirect_count extension (core in Vulkan 1.2)
// Check feature support
if (physical_device_features.draw_indirect_count) {
vk.cmdDrawIndexedIndirectCount(
cmd,
draw_command_buffer,
0, // Command buffer offset
count_buffer,
count_offset, // Offset to u32 count
max_draw_count,
@sizeOf(VkDrawIndexedIndirectCommand) // 20 bytes
);
} else {
// Fallback: read count on CPU (slower)
// Or use indirect dispatch + compute to filter
}VkDrawIndexedIndirectCommand structure:
const VkDrawIndexedIndirectCommand = extern struct {
index_count: u32,
instance_count: u32,
first_index: u32,
vertex_offset: i32,
first_instance: u32,
};// Implicit synchronization with barriers
glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT | GL_COMMAND_BARRIER_BIT);// Explicit pipeline barriers required
// After compute writes to render queue:
const compute_to_draw_barrier = vk.BufferMemoryBarrier{
.src_access_mask = .{ .shader_write_bit = true },
.dst_access_mask = .{ .indirect_command_read_bit = true, .vertex_attribute_read_bit = true },
.buffer = render_queue_buffer,
.offset = 0,
.size = vk.WHOLE_SIZE,
};
vk.cmdPipelineBarrier(
cmd,
.{ .compute_shader_bit = true }, // srcStageMask
.{ .draw_indirect_bit = true, .vertex_input_bit = true }, // dstStageMask
.{}, // dependency flags
&.{}, // memory barriers
&.{compute_to_draw_barrier}, // buffer barriers
&.{}, // image barriers
);// Map buffer to read GPU-generated requests
ByteBuffer mapped = glMapBufferRange(GL_SHADER_STORAGE_BUFFER,
0, requestQueueSize, GL_MAP_READ_BIT);
int count = mapped.getInt(0);
for (int i = 0; i < count; i++) {
long position = mapped.getLong(8 + i * 8);
processRequest(position);
}
glUnmapBuffer(GL_SHADER_STORAGE_BUFFER);// Option 1: Staging buffer + transfer (non-blocking)
vk.cmdCopyBuffer(cmd, request_buffer, staging_buffer, &.{copy_region});
// Submit, wait for fence, then read from mapped staging buffer
// Option 2: HOST_VISIBLE request buffer (simpler but may be slower)
const mapped = @ptrCast([*]u32, vk.mapMemory(device, request_memory, 0, size, .{}));
const count = mapped[0];
for (0..count) |i| {
const position = @as(u64, mapped[2 + i*2]) << 32 | mapped[2 + i*2 + 1];
processRequest(position);
}
vk.unmapMemory(device, request_memory);Frame N:
┌─────────────────────────────────────────────────────────────────┐
│ 1. DEPTH PREPASS (render near chunks, write depth) │
│ └─ Output: depth buffer │
├─────────────────────────────────────────────────────────────────┤
│ 2. HIZ GENERATION (compute shader chain) │
│ └─ Input: depth buffer │
│ └─ Output: HiZ mipmap pyramid │
├─────────────────────────────────────────────────────────────────┤
│ 3. LOD TRAVERSAL (compute shader, N iterations) │
│ └─ Input: node buffer, HiZ, camera matrices │
│ └─ Output: render queue, request queue │
│ └─ Barrier: compute → indirect + vertex │
├─────────────────────────────────────────────────────────────────┤
│ 4. LOD RENDERING (graphics, indirect count) │
│ └─ Input: render queue, geometry buffers │
│ └─ vkCmdDrawIndexedIndirectCount(...) │
├─────────────────────────────────────────────────────────────────┤
│ 5. REQUEST READBACK (async, for next frame) │
│ └─ Copy request queue → staging │
│ └─ CPU processes requests on separate thread │
└─────────────────────────────────────────────────────────────────┘
Recommended: Use 2 descriptor sets for queue flip-flop:
- Set A: source=queueA, sink=queueB
- Set B: source=queueB, sink=queueA
Bind alternating sets each iteration.
const TraversalPushConstants = extern struct {
iteration: u32,
minSSS: f32,
frameId: u32,
_padding: u32,
};
vk.cmdPushConstants(cmd, layout, .{ .compute_bit = true }, 0, @sizeOf(TraversalPushConstants), &push_constants);Vulkan 1.1+ supports subgroup operations for efficient reductions:
// In traversal shader
uint visibleCount = subgroupAdd(isVisible ? 1 : 0);
if (subgroupElect()) {
atomicAdd(renderQueueCount, visibleCount);
}// Frame N compute submits with timeline semaphore value N
// Async thread waits for value N, processes requests
// Frame N+1 can use new nodes
const wait_info = vk.SemaphoreWaitInfo{
.semaphore_count = 1,
.p_semaphores = &timeline_semaphore,
.p_values = &@as(u64, frame_number),
};
vk.waitSemaphores(device, &wait_info, timeout);Here's a minimal skeleton for the traversal system in Zig:
const LodTraverser = struct {
device: vk.Device,
// Pipelines
traversal_pipeline: vk.Pipeline,
hiz_pipeline: vk.Pipeline,
// Buffers
node_buffer: GpuBuffer,
queue_a: GpuBuffer,
queue_b: GpuBuffer,
render_queue: GpuBuffer,
request_queue: GpuBuffer,
dispatch_metadata: GpuBuffer,
// Descriptor sets (for flip-flop)
desc_sets: [2]vk.DescriptorSet,
// HiZ
hiz_image: vk.Image,
hiz_views: []vk.ImageView, // One per mip level
const MAX_ITERATIONS = 5;
pub fn recordTraversal(self: *LodTraverser, cmd: vk.CommandBuffer, viewport: Viewport) void {
// 1. Generate HiZ
self.recordHizGeneration(cmd);
// 2. Reset dispatch metadata
vk.cmdFillBuffer(cmd, self.dispatch_metadata.handle, 0, @sizeOf(u32) * 4 * MAX_ITERATIONS, 0);
// Set initial dispatch size for iteration 0
// ...
// 3. Traversal iterations
vk.cmdBindPipeline(cmd, .compute, self.traversal_pipeline);
for (0..MAX_ITERATIONS) |iter| {
// Bind appropriate descriptor set for flip-flop
vk.cmdBindDescriptorSets(cmd, .compute, self.pipeline_layout, 0,
&.{self.desc_sets[iter & 1]}, &.{});
// Push iteration-specific constants
const push = TraversalPushConstants{
.iteration = @intCast(iter),
.minSSS = viewport.minScreenSpaceSize(),
.frameId = viewport.frameId,
};
vk.cmdPushConstants(cmd, self.pipeline_layout, .{ .compute_bit = true },
0, @sizeOf(TraversalPushConstants), std.mem.asBytes(&push));
// Indirect dispatch
vk.cmdDispatchIndirect(cmd, self.dispatch_metadata.handle, iter * 12);
// Barrier for next iteration
self.recordIterationBarrier(cmd);
}
// 4. Barrier before rendering
self.recordComputeToDrawBarrier(cmd);
}
pub fn recordLodRendering(self: *LodTraverser, cmd: vk.CommandBuffer,
geometry_buffer: GpuBuffer, index_buffer: GpuBuffer) void {
vk.cmdBindPipeline(cmd, .graphics, self.render_pipeline);
vk.cmdBindVertexBuffers(cmd, 0, &.{geometry_buffer.handle}, &.{0});
vk.cmdBindIndexBuffer(cmd, index_buffer.handle, 0, .uint32);
// Draw with GPU-determined count
vk.cmdDrawIndexedIndirectCount(
cmd,
self.render_queue.handle, // Draw commands
@sizeOf(u32), // Skip count at offset 0
self.render_queue.handle, // Count buffer (same buffer)
0, // Count at offset 0
MAX_DRAW_COMMANDS,
@sizeOf(vk.DrawIndexedIndirectCommand),
);
}
};This gives you a complete mapping from voxy's OpenGL approach to Vulkan patterns that will work with FarHorizons' existing architecture.
Block storage: 16 bits per block (8-bit ID + 8-bit state)
Chunk size: 16×16×16 = 4,096 blocks
Per chunk: 4,096 × 2 bytes = 8 KB
View distance 16 (diameter 33):
Horizontal: 33 × 33 = 1,089 columns
Vertical: 9 sections (assuming y range)
Total chunks: 1,089 × 9 = 9,801 chunks
Block data: 9,801 × 8 KB = 78.4 MB
Block storage: 64 bits per voxel (block ID + biome + light)
Section size: 32×32×32 = 32,768 voxels
Per section: 32,768 × 8 bytes = 256 KB
BUT: Voxy uses LOD, so not all sections are at full detail.
Typical loaded sections (estimated for 1km render distance):
Level 0 (near): ~500 sections × 256 KB = 128 MB
Level 1: ~200 sections × 256 KB = 51 MB
Level 2: ~100 sections × 256 KB = 26 MB
Level 3: ~50 sections × 256 KB = 13 MB
Level 4 (far): ~25 sections × 256 KB = 6 MB
Total block data: ~224 MB
Voxy also uses compression (ZSTD) for storage, reducing disk I/O.
Vertex format: 36 bytes
- pos: 3 × f32 = 12 bytes
- color: 3 × f32 = 12 bytes
- uv: 2 × f32 = 8 bytes
- tex_index: u32 = 4 bytes
Per quad: 4 vertices × 36 bytes = 144 bytes
Per face: ~144 bytes + 6 indices × 4 bytes = 168 bytes
Estimated geometry per chunk (average terrain):
~1,000 visible faces × 168 bytes = 168 KB
Total GPU geometry (9,801 chunks):
9,801 × 168 KB = 1.6 GB (!!!)
FarHorizons config:
vertex_arena_size: 256 MB (default)
index_arena_size: 128 MB (default)
Total configured: 384 MB
Quad format: 8 bytes (packed)
- position (5+5+5 bits): 15 bits
- size (4+4 bits): 8 bits
- face direction: 3 bits
- block state ID: 16 bits
- biome ID: 9 bits
- light: 8 bits
Total: 64 bits = 8 bytes per quad
Greedy meshing reduces quad count by ~10-50×
Average terrain: ~50-200 quads per section (vs 1,000+ faces per chunk)
Estimated geometry per section:
~100 quads × 8 bytes = 800 bytes
Total GPU geometry (estimated 875 sections):
875 × 800 bytes = 700 KB (!!!)
Plus metadata per section: 32 bytes × 875 = 28 KB
Voxy geometry buffer config: 256 MB - 4 GB (configurable)
Actual usage: typically < 100 MB for massive render distances
Node storage: 16 bytes per node
- position: 8 bytes
- mesh ptr + child ptr + flags: 8 bytes
Max nodes: 2^21 = 2 million nodes
Total node buffer: 2M × 16 bytes = 32 MB
Typical active nodes: ~10,000
Active node memory: 10K × 16 bytes = 160 KB
Queue buffers (×2 for flip-flop):
MAX_QUEUE_SIZE = 200,000 nodes
Per queue: 200K × 4 bytes = 800 KB
Both queues: 1.6 MB
Render queue:
MAX_RENDER_QUEUE = 200,000 entries
Per entry: 4 bytes (mesh ID)
Total: 800 KB
Request queue:
MAX_REQUESTS = 50 entries
Per entry: 8 bytes (position)
Total: 400 bytes
Dispatch metadata:
16 iterations × 16 bytes = 256 bytes
Total traversal buffers: ~3.2 MB
| Component | FarHorizons | Voxy |
|---|---|---|
| Block data (CPU) | 78 MB | ~50 MB (fewer sections needed) |
| Geometry (GPU) | 1.6 GB | < 10 MB |
| Node data | N/A | 160 KB |
| Traversal buffers | N/A | 3.2 MB |
| Total GPU | ~1.6 GB | ~15 MB |
| Component | FarHorizons | Voxy |
|---|---|---|
| Block data (CPU) | Would require ~2 GB+ | ~224 MB |
| Geometry (GPU) | Would require ~20 GB+ | ~100 MB |
| Node data | N/A | 32 MB |
| Traversal buffers | N/A | 3.2 MB |
| Total GPU | Impossible | ~135 MB |
FarHorizons: 144 bytes per quad (4 × 36-byte vertices)
Voxy: 8 bytes per quad (packed format)
Ratio: 18×
Voxy achieves this by:
- No per-vertex position (computed from quad data in shader)
- No per-vertex color (computed from block state)
- No per-vertex UV (computed from quad position + texture atlas)
- Single 64-bit value encodes entire quad
FarHorizons: 1 quad per visible block face
Voxy: 1 quad per contiguous region of same material
Example - flat grass field 32×32:
FarHorizons: 1,024 quads (32×32 individual faces)
Voxy: 4 quads (32×32 merged into 4 × 16×16 quads due to max size limit)
Ratio: 256×
1km view distance:
Without LOD: Must render ~15,000 chunks at full detail
With LOD: Render ~500 chunks at full detail, rest at reduced resolution
Even with same vertex format, LOD alone provides 30× reduction.
Current: 144 bytes/quad
Packed: 8 bytes/quad
Savings: 18×
Result: 1.6 GB → ~90 MB for current view distance
Could potentially double view distance to 32 chunks (~512 blocks)
Current: ~1,000 faces/chunk
Greedy: ~100 faces/chunk
Savings: 10×
Result: 1.6 GB → ~160 MB
Modest improvement, still limited by chunk count
Current: 9,801 chunks at full detail
With LOD: ~500 chunks at full detail, ~400 at reduced
But without vertex optimization, each LOD section still uses
significant geometry memory. Savings: ~5-10×
Result: 1.6 GB → ~200-300 MB
Enables larger distances but still memory-heavy
Packed vertices (18×) × Greedy meshing (10×) × LOD (10×) = 1800×
Result: 1.6 GB → < 1 MB for same area
Enables 1km+ render distances with < 100 MB GPU memory
-
Packed Vertex Format - Highest impact, moderate effort
- Define 8-byte quad structure
- Modify vertex shader to unpack
- Immediate 18× GPU memory savings
-
Greedy Meshing - High impact, moderate effort
- Implement ScanMesher2D algorithm
- 10× quad count reduction
- Combined with #1: 180× savings
-
LOD System - Highest impact for distance, high effort
- Multi-level section storage
- Hierarchical traversal
- Enables unlimited render distance
For a modern gaming PC with 8GB VRAM:
Target GPU memory budget: 512 MB for terrain
With voxy-style optimization:
512 MB ÷ 8 bytes/quad ÷ 100 quads/section = 640,000 sections
At LOD 0: 640K sections × 32³ blocks = 21 trillion blocks
Practically: Multi-km render distances easily achievable
Without optimization (FarHorizons current):
512 MB ÷ 168 KB/chunk = 3,048 chunks
View distance: ~17 chunks (current limit)
Voxy packs all quad data into a single 64-bit value:
Bits 0-4: X position within section (0-31)
Bits 5-9: Y position within section (0-31)
Bits 10-14: Z position within section (0-31)
Bits 15-18: Width - 1 (0-15, so 1-16 blocks)
Bits 19-22: Height - 1 (0-15, so 1-16 blocks)
Bits 23-25: Face direction (0-5: -X, +X, -Y, +Y, -Z, +Z)
Bits 26-41: Block state ID (0-65535)
Bits 42-50: Biome ID (0-511)
Bits 51-58: Light level (0-255, packed sky+block)
Bits 59-63: Reserved/flags
#version 450
// ============ INPUTS ============
// Quad data stored in SSBO (8 bytes per quad)
layout(set = 0, binding = 0, std430) readonly buffer QuadData {
uvec2 quads[]; // Each quad is uvec2 (64 bits)
};
// Section metadata (position, etc.)
layout(set = 0, binding = 1, std430) readonly buffer SectionData {
ivec4 sections[]; // xyz = section world pos, w = LOD level
};
// Indirect draw provides section index via gl_DrawID (Vulkan 1.2+)
// Or use gl_InstanceIndex if batching sections
// ============ UNIFORMS ============
layout(set = 1, binding = 0) uniform CameraUBO {
mat4 viewProj;
vec3 cameraPos;
float time;
};
// Texture atlas info
layout(set = 1, binding = 1) uniform AtlasUBO {
vec2 atlasSize; // e.g., 256x256 tiles
vec2 tileSize; // e.g., 16x16 pixels per tile
};
// ============ OUTPUTS ============
layout(location = 0) out vec3 fragColor;
layout(location = 1) out vec2 fragUV;
layout(location = 2) flat out uint fragTexIndex;
layout(location = 3) out float fragAO;
// ============ CONSTANTS ============
// Face normals (indexed by face direction)
const vec3 FACE_NORMALS[6] = vec3[6](
vec3(-1, 0, 0), // -X
vec3( 1, 0, 0), // +X
vec3( 0,-1, 0), // -Y
vec3( 0, 1, 0), // +Y
vec3( 0, 0,-1), // -Z
vec3( 0, 0, 1) // +Z
);
// Face tangent/bitangent for UV generation
const vec3 FACE_TANGENT[6] = vec3[6](
vec3( 0, 0, 1), // -X: Z is U
vec3( 0, 0,-1), // +X: -Z is U
vec3( 1, 0, 0), // -Y: X is U
vec3( 1, 0, 0), // +Y: X is U
vec3(-1, 0, 0), // -Z: -X is U
vec3( 1, 0, 0) // +Z: X is U
);
const vec3 FACE_BITANGENT[6] = vec3[6](
vec3( 0, 1, 0), // -X: Y is V
vec3( 0, 1, 0), // +X: Y is V
vec3( 0, 0, 1), // -Y: Z is V
vec3( 0, 0,-1), // +Y: -Z is V
vec3( 0, 1, 0), // -Z: Y is V
vec3( 0, 1, 0) // +Z: Y is V
);
// Vertex positions within a quad (0-3)
// Triangles: 0-1-2, 2-3-0 (or use index buffer: 0,1,2,2,3,0)
const vec2 QUAD_VERTICES[4] = vec2[4](
vec2(0, 0), // Bottom-left
vec2(1, 0), // Bottom-right
vec2(1, 1), // Top-right
vec2(0, 1) // Top-left
);
// Minecraft-style face shading
const float FACE_SHADING[6] = float[6](
0.6, // -X (west)
0.6, // +X (east)
0.5, // -Y (bottom)
1.0, // +Y (top)
0.8, // -Z (north)
0.8 // +Z (south)
);
// ============ HELPER FUNCTIONS ============
// Unpack quad data from 64-bit uvec2
void unpackQuad(uvec2 packed, out vec3 pos, out vec2 size, out uint face,
out uint blockState, out uint biome, out uint light) {
uint low = packed.x;
uint high = packed.y;
// Position (bits 0-14 of low)
pos.x = float(low & 0x1Fu);
pos.y = float((low >> 5) & 0x1Fu);
pos.z = float((low >> 10) & 0x1Fu);
// Size (bits 15-22 of low)
size.x = float(((low >> 15) & 0xFu) + 1);
size.y = float(((low >> 19) & 0xFu) + 1);
// Face direction (bits 23-25 of low)
face = (low >> 23) & 0x7u;
// Block state (bits 26-31 of low + bits 0-9 of high)
blockState = ((low >> 26) & 0x3Fu) | ((high & 0x3FFu) << 6);
// Biome (bits 10-18 of high)
biome = (high >> 10) & 0x1FFu;
// Light (bits 19-26 of high)
light = (high >> 19) & 0xFFu;
}
// Compute ambient occlusion from light value
float computeAO(uint light, uint vertexIndex) {
// Simple: use light level directly
// Advanced: encode per-vertex AO in additional bits
float skyLight = float((light >> 4) & 0xFu) / 15.0;
float blockLight = float(light & 0xFu) / 15.0;
return max(skyLight, blockLight) * 0.5 + 0.5;
}
// Get texture index from block state (simplified - real impl uses lookup table)
uint getTextureIndex(uint blockState, uint face) {
// For a real implementation, use a SSBO lookup table:
// return textureIndices[blockState * 6 + face];
// Simplified: assume blockState IS the texture index
return blockState;
}
// ============ MAIN ============
void main() {
// Determine which quad and which vertex within quad
uint quadIndex = gl_VertexIndex / 4;
uint vertexInQuad = gl_VertexIndex % 4;
// Alternative: use index buffer with 6 indices per quad (0,1,2,2,3,0)
// uint quadIndex = gl_VertexIndex / 6;
// uint indexInQuad = gl_VertexIndex % 6;
// uint vertexInQuad = uint[6](0,1,2,2,3,0)[indexInQuad];
// Get section info (from gl_DrawID or gl_InstanceIndex)
ivec4 section = sections[gl_InstanceIndex];
vec3 sectionWorldPos = vec3(section.xyz) * 32.0; // Section origin in world
uint lodLevel = uint(section.w);
// Load and unpack quad data
uvec2 packedQuad = quads[quadIndex];
vec3 quadPos;
vec2 quadSize;
uint face, blockState, biome, light;
unpackQuad(packedQuad, quadPos, quadSize, face, blockState, biome, light);
// Get vertex position within quad (0-1 range)
vec2 vertexOffset = QUAD_VERTICES[vertexInQuad];
// Scale by quad size
vec2 scaledOffset = vertexOffset * quadSize;
// Compute 3D offset using face tangent/bitangent
vec3 tangent = FACE_TANGENT[face];
vec3 bitangent = FACE_BITANGENT[face];
vec3 normal = FACE_NORMALS[face];
// Local position within section
vec3 localPos = quadPos;
localPos += tangent * scaledOffset.x;
localPos += bitangent * scaledOffset.y;
// Offset by half block in normal direction (face is on block surface)
localPos += normal * 0.5;
// Scale by LOD level (higher LOD = larger blocks)
float lodScale = float(1u << lodLevel);
localPos *= lodScale;
// World position
vec3 worldPos = sectionWorldPos * lodScale + localPos;
// Transform to clip space
gl_Position = viewProj * vec4(worldPos, 1.0);
// ============ COMPUTE FRAGMENT OUTPUTS ============
// UV coordinates (for texture sampling)
fragUV = vertexOffset * quadSize; // Tile UV (will repeat)
// Texture index from block state
fragTexIndex = getTextureIndex(blockState, face);
// Base color from face shading
float shade = FACE_SHADING[face];
fragColor = vec3(shade);
// Ambient occlusion
fragAO = computeAO(light, vertexInQuad);
}#version 450
layout(location = 0) in vec3 fragColor;
layout(location = 1) in vec2 fragUV;
layout(location = 2) flat in uint fragTexIndex;
layout(location = 3) in float fragAO;
layout(location = 0) out vec4 outColor;
// Texture array (all block textures in one array)
layout(set = 2, binding = 0) uniform sampler2DArray textureSampler;
void main() {
// Sample texture (UV + texture index as array layer)
vec4 texColor = texture(textureSampler, vec3(fragUV, float(fragTexIndex)));
// Alpha test for cutout blocks
if (texColor.a < 0.5) {
discard;
}
// Apply face shading and ambient occlusion
vec3 color = texColor.rgb * fragColor * fragAO;
outColor = vec4(color, texColor.a);
}const PackedQuad = packed struct {
x: u5, // 0-31
y: u5, // 0-31
z: u5, // 0-31
width_m1: u4, // 0-15 (actual 1-16)
height_m1: u4, // 0-15 (actual 1-16)
face: u3, // 0-5
block_state: u16,
biome: u9,
light: u8,
_reserved: u5,
};
fn packQuad(x: u5, y: u5, z: u5, w: u4, h: u4, face: u3,
block: u16, biome: u9, light: u8) u64 {
return @as(u64, x)
| (@as(u64, y) << 5)
| (@as(u64, z) << 10)
| (@as(u64, w) << 15)
| (@as(u64, h) << 19)
| (@as(u64, face) << 23)
| (@as(u64, block) << 26)
| (@as(u64, biome) << 42)
| (@as(u64, light) << 51);
}
// Example: create a stone block face at (5, 10, 3) facing +Y
const quad = packQuad(5, 10, 3, 0, 0, 3, STONE_TEXTURE_ID, 0, 255);// Each section gets one indirect draw command
const DrawCommand = extern struct {
vertex_count: u32, // quads * 4 (or quads * 6 with index buffer)
instance_count: u32, // 1
first_vertex: u32, // Offset into quad buffer
first_instance: u32, // Section index (passed as gl_InstanceIndex)
};
fn createDrawCommand(section_index: u32, quad_offset: u32, quad_count: u32) DrawCommand {
return .{
.vertex_count = quad_count * 4, // 4 vertices per quad
.instance_count = 1,
.first_vertex = quad_offset * 4,
.first_instance = section_index,
};
}QUAD BUFFER (SSBO, binding 0):
┌────────────────────────────────────────┐
│ Section 0 quads (N₀ × 8 bytes) │
├────────────────────────────────────────┤
│ Section 1 quads (N₁ × 8 bytes) │
├────────────────────────────────────────┤
│ ... │
└────────────────────────────────────────┘
SECTION BUFFER (SSBO, binding 1):
┌─────────────────────────────┐
│ Section 0: ivec4(x,y,z,lod) │
├─────────────────────────────┤
│ Section 1: ivec4(x,y,z,lod) │
├─────────────────────────────┤
│ ... │
└─────────────────────────────┘
DRAW COMMAND BUFFER:
┌─────────────────────────────────────────┐
│ DrawCommand { vertex_count, 1, 0, 0 } │ ← Section 0
├─────────────────────────────────────────┤
│ DrawCommand { vertex_count, 1, N₀, 1 } │ ← Section 1
├─────────────────────────────────────────┤
│ ... │
└─────────────────────────────────────────┘
// Bind pipeline and descriptor sets
vk.cmdBindPipeline(cmd, .graphics, packed_quad_pipeline);
vk.cmdBindDescriptorSets(cmd, .graphics, layout, 0, &.{
quad_descriptor, // Set 0: quad + section SSBOs
camera_descriptor, // Set 1: camera UBO, atlas UBO
texture_descriptor, // Set 2: texture array
}, &.{});
// Multi-draw indirect (all sections in one call)
vk.cmdDrawIndirect(
cmd,
draw_command_buffer,
0, // Offset
section_count, // Draw count
@sizeOf(DrawCommand), // Stride
);
// Or with GPU-determined count (after LOD traversal):
vk.cmdDrawIndirectCount(
cmd,
draw_command_buffer,
@sizeOf(u32), // Commands start after count
count_buffer,
0, // Count at offset 0
max_sections,
@sizeOf(DrawCommand),
);const Vertex = extern struct {
pos: [3]f32, // 12 bytes
color: [3]f32, // 12 bytes
uv: [2]f32, // 8 bytes
tex_index: u32, // 4 bytes
}; // Total: 36 bytes × 4 vertices = 144 bytes per quadconst PackedQuad = u64; // 8 bytes per quad (entire quad, not per vertex!)Savings: 144 → 8 = 18× less GPU memory
Instead of 4 vertices per quad, use 4 vertices + 6 indices:
// Global index buffer (reusable for all quads)
const QUAD_INDICES = [6]u16{ 0, 1, 2, 2, 3, 0 };
// Create index buffer once at startup
fn createQuadIndexBuffer(max_quads: u32) GpuBuffer {
var indices: []u16 = allocator.alloc(u16, max_quads * 6);
for (0..max_quads) |q| {
const base = @intCast(u16, q * 4);
indices[q*6 + 0] = base + 0;
indices[q*6 + 1] = base + 1;
indices[q*6 + 2] = base + 2;
indices[q*6 + 3] = base + 2;
indices[q*6 + 4] = base + 3;
indices[q*6 + 5] = base + 0;
}
return uploadToGpu(indices);
}Then use vkCmdDrawIndexedIndirect instead of vkCmdDrawIndirect.
- gl_VertexIndex math is free - Modern GPUs handle integer division/modulo efficiently
- Texture array sampling - Same performance as individual textures, better batching
- SSBO access - Cached well when accessed sequentially (quad buffer is sequential)
- gl_InstanceIndex for section - Allows different LOD levels in same draw call
Naive meshing creates 1 quad per visible block face:
- A 32×32 flat surface = 1,024 quads
- Massive waste: all same material, could be 1 quad
Greedy meshing merges adjacent same-material faces:
- A 32×32 flat surface = 1-4 quads (limited by max quad size)
Process each face layer (e.g., all +Y faces at y=10) as a 2D grid:
- Horizontal scan: Merge consecutive same-material cells into runs
- Vertical merge: Combine identical runs from adjacent rows
Input (32×32 grid, S=stone, D=dirt, .=air):
SSSSDDDD....
SSSSDDDD....
SSSS........
SSSS........
Output quads:
- Quad 1: (0,0) size 4×4 = stone
- Quad 2: (4,0) size 4×2 = dirt
const ScanMesher2D = struct {
const MAX_SIZE = 16; // Max quad dimension
// Per-row tracking
row_data: [32]u64, // Material ID for each row's current run
row_length: [32]u8, // Horizontal length of current run
row_depth: [32]u8, // Vertical depth (merged rows)
row_start_x: [32]u8, // Starting X of current run
row_bitset: u32, // Which rows have active runs
// Current position
current_x: u8,
current_y: u8,
current_material: u64,
current_run_length: u8,
// Output
quads: std.ArrayList(PackedQuad),
};pub fn process(self: *ScanMesher2D, grid: *const [32][32]u64) void {
// Process row by row (Y direction)
for (0..32) |y| {
self.current_y = @intCast(y);
self.processRow(&grid[y]);
}
// Flush any remaining runs
self.flushAllRuns();
}
fn processRow(self: *ScanMesher2D, row: *const [32]u64) void {
self.current_run_length = 0;
self.current_material = 0; // 0 = air/empty
for (0..32) |x| {
const material = row[x];
self.processCell(@intCast(x), material);
}
// End of row: flush current run
if (self.current_run_length > 0) {
self.endRun();
}
}
fn processCell(self: *ScanMesher2D, x: u8, material: u64) void {
// Same material: extend current run
if (material == self.current_material and self.current_run_length < MAX_SIZE) {
self.current_run_length += 1;
return;
}
// Different material or max size: end previous run
if (self.current_run_length > 0) {
self.endRun();
}
// Start new run
self.current_material = material;
self.current_run_length = if (material != 0) 1 else 0;
self.row_start_x[self.current_y] = x;
}
fn endRun(self: *ScanMesher2D) void {
if (self.current_material == 0) return; // Skip air
const y = self.current_y;
const x = self.row_start_x[y];
const length = self.current_run_length;
// Try to merge with run from previous row
if (self.canMergeWithPreviousRow(y, x, length)) {
self.mergeWithPreviousRow(y);
} else {
// Can't merge: emit previous row's run if exists
if (y > 0 and (self.row_bitset & (@as(u32, 1) << (y - 1))) != 0) {
self.emitRun(y - 1);
}
// Start new run at this row
self.startNewRun(y, x, length);
}
self.current_run_length = 0;
}
fn canMergeWithPreviousRow(self: *ScanMesher2D, y: u8, x: u8, length: u8) bool {
if (y == 0) return false;
const prev_y = y - 1;
// Check if previous row has an active run
if ((self.row_bitset & (@as(u32, 1) << prev_y)) == 0) return false;
// Check if same position, length, and material
return self.row_start_x[prev_y] == x
and self.row_length[prev_y] == length
and self.row_data[prev_y] == self.current_material
and self.row_depth[prev_y] < MAX_SIZE;
}
fn mergeWithPreviousRow(self: *ScanMesher2D, y: u8) void {
const prev_y = y - 1;
// Extend the run from previous row
self.row_depth[prev_y] += 1;
// If max depth reached, emit the quad
if (self.row_depth[prev_y] >= MAX_SIZE) {
self.emitRun(prev_y);
}
}
fn startNewRun(self: *ScanMesher2D, y: u8, x: u8, length: u8) void {
self.row_data[y] = self.current_material;
self.row_length[y] = length;
self.row_depth[y] = 1;
self.row_start_x[y] = x;
self.row_bitset |= (@as(u32, 1) << y);
}
fn emitRun(self: *ScanMesher2D, y: u8) void {
const x = self.row_start_x[y];
const width = self.row_length[y];
const height = self.row_depth[y];
const material = self.row_data[y];
// Create packed quad
self.quads.append(.{
.x = x,
.y = y - height + 1, // Start Y (run grew downward)
.width = width,
.height = height,
.material = material,
});
// Clear the run
self.row_bitset &= ~(@as(u32, 1) << y);
}
fn flushAllRuns(self: *ScanMesher2D) void {
// Emit all remaining active runs
var bitset = self.row_bitset;
while (bitset != 0) {
const y = @ctz(bitset);
self.emitRun(@intCast(y));
bitset &= bitset - 1; // Clear lowest bit
}
}Row 0: SSSSDDDD
Row 1: SSSSDDDD
Row 2: SSSSSSSS
Row 3: SSSSSSSS
Row 4: DDDDDDDD
Row 5: DDDDDDDD
Row 6: ........
Row 7: ........
Row 0: Scan left-to-right
- Cells 0-3: SSSS → Run(x=0, len=4, mat=S)
- Cells 4-7: DDDD → End S run, start Run(x=4, len=4, mat=D)
After row 0:
row_data[0] = S, row_length[0] = 4, row_start_x[0] = 0, row_depth[0] = 1
(D run also stored)
Row 1: Scan left-to-right
- Cells 0-3: SSSS → Same as row 0? Yes! Merge: row_depth[0] = 2
- Cells 4-7: DDDD → Same as row 0 D? Yes! Merge: depth = 2
Row 2: Scan left-to-right
- Cells 0-7: SSSSSSSS → Run(x=0, len=8, mat=S)
- Can merge with row 1? No! (different length)
- Emit row 1 stone: Quad(0, 0, 4, 2, S)
- Emit row 1 dirt: Quad(4, 0, 4, 2, D)
- Start new: row_data[2] = S, len=8, depth=1
Row 3: SSSSSSSS
- Merge with row 2: depth = 2
Row 4: DDDDDDDD
- Can't merge (different material)
- Emit row 3: Quad(0, 2, 8, 2, S)
- Start new dirt run
Row 5: DDDDDDDD
- Merge: depth = 2
Rows 6-7: Empty
- Emit row 5: Quad(0, 4, 8, 2, D)
Final flush: (nothing remaining)
Quad 1: (0, 0) size 4×2 = stone (was 8 blocks)
Quad 2: (4, 0) size 4×2 = dirt (was 8 blocks)
Quad 3: (0, 2) size 8×2 = stone (was 16 blocks)
Quad 4: (0, 4) size 8×2 = dirt (was 16 blocks)
For a 3D chunk, run greedy meshing on each face layer:
pub fn meshChunk(chunk: *const Chunk) ChunkMesh {
var quads = std.ArrayList(PackedQuad).init(allocator);
// For each face direction
for (0..6) |face| {
// For each slice perpendicular to that face
const axis = face / 2; // 0=X, 1=Y, 2=Z
const slices = if (axis == 1) CHUNK_HEIGHT else CHUNK_SIZE;
for (0..slices) |slice| {
// Extract 2D grid of visible faces for this slice
var grid: [32][32]u64 = undefined;
extractFaceGrid(chunk, face, slice, &grid);
// Run greedy meshing
var mesher = ScanMesher2D.init();
mesher.process(&grid);
// Convert to 3D quads and add to output
for (mesher.quads.items) |quad2d| {
const quad3d = convert2Dto3D(quad2d, face, slice);
quads.append(quad3d);
}
}
}
return ChunkMesh{ .quads = quads };
}
fn extractFaceGrid(chunk: *const Chunk, face: u3, slice: u8, grid: *[32][32]u64) void {
// face: 0=-X, 1=+X, 2=-Y, 3=+Y, 4=-Z, 5=+Z
for (0..32) |a| {
for (0..32) |b| {
// Convert 2D (a, b) to 3D based on face
const pos = faceToChunkPos(face, slice, a, b);
// Check if face is visible (neighbor is air or transparent)
const block = chunk.getBlock(pos);
const neighbor_pos = pos + FACE_NORMALS[face];
if (block.isAir()) {
grid[a][b] = 0; // Air = no face
} else if (neighborIsOccluding(chunk, neighbor_pos)) {
grid[a][b] = 0; // Occluded = no face
} else {
// Encode material: block ID + face for texture lookup
grid[a][b] = encodeBlockFace(block, face);
}
}
}
}Before running greedy meshing, use bitmasks to skip empty slices:
// Pre-compute which slices have any visible faces
pub fn computeSliceMasks(chunk: *const Chunk) [6][32]u32 {
var masks: [6][32]u32 = undefined;
for (chunk.blocks, 0..) |block, i| {
if (block.isAir()) continue;
const x = i % 16;
const z = (i / 16) % 16;
const y = i / 256;
// Check each face
for (0..6) |face| {
const neighbor = getNeighborPos(x, y, z, face);
if (!chunk.isOccluded(neighbor)) {
// Mark this slice as having visible faces
const slice = getSliceForFace(face, x, y, z);
const bit = getSliceBit(face, x, y, z);
masks[face][slice] |= (@as(u32, 1) << bit);
}
}
}
return masks;
}
// Skip empty slices entirely
for (0..6) |face| {
for (0..32) |slice| {
if (slice_masks[face][slice] == 0) continue; // No visible faces
// ... run greedy meshing
}
}Naive meshing:
- Top faces: 16×16 = 256 quads
- Side faces: 4 sides × 16×4 = 256 quads
- Bottom faces: 16×16 = 256 quads (if visible)
- Total: ~768 quads
Greedy meshing:
- Top: 1 quad (16×16 merged)
- Each side: 1 quad (16×4 merged)
- Bottom: 1 quad (16×16 merged)
- Total: 6 quads
Reduction: 128×
Naive meshing:
- Every other block = 50% faces visible
- ~2,000 quads
Greedy meshing:
- No merging possible (alternating materials)
- ~2,000 quads
Reduction: 1× (worst case)
- Naive: ~1,500 quads per chunk
- Greedy: ~100-300 quads per chunk
- Typical reduction: 5-15×
Naive: 1,500 quads × 144 bytes = 216 KB per chunk
Greedy: 150 quads × 8 bytes = 1.2 KB per chunk
Reduction: 180× (combined with packed quads)
10,000 chunks:
Naive: 2.16 GB
Greedy: 12 MB
- Process layers in parallel - Each face direction is independent
- Reuse mesher instances - Avoid allocations per chunk
- Cache neighbor lookups - Neighbor chunk data accessed repeatedly
- Use bitmasks for empty detection - Skip slices with no visible faces
- Limit max quad size - 16×16 prevents visual artifacts from large quads
- Separate fluid layer - Water meshed separately (overlays on blocks)
- Non-opaque handling - Transparent blocks (glass) meshed independently
- AO baking - Ambient occlusion computed during meshing, stored in quad
- Biome blending - Biome ID stored per quad for grass/foliage tinting