Skip to content
This repository was archived by the owner on Dec 25, 2023. It is now read-only.

Commit a3178f4

Browse files
Updated Readme. Fixed (hopefully) failure to exit due to race in FileStreamerReference.cpp between tile updates and packed mip loads. Added to UI: Demo button and adapter description. Frame timing includes everything except feedback resolves.
1 parent e06b6ec commit a3178f4

17 files changed

+438
-265
lines changed

README.md

Lines changed: 123 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Sampler Feedback Streaming
22

3-
This repository contains a demo of `DirectX12 Sampler Feedback Streaming`, a technique using [DirectX12 Sampler Feedback](https://microsoft.github.io/DirectX-Specs/d3d/SamplerFeedback.html) to guide continuous loading and eviction of reserved resource tiles. Sampler Feedback Streaming allows scenes containing 100s of gigabytes of resources to be drawn on GPUs containing much less physical memory. The scene below uses just ~200MB of a 1GB heap, despite over 350GB of total texture resources.
3+
This repository contains a demo of `DirectX12 Sampler Feedback Streaming`, a technique using [DirectX12 Sampler Feedback](https://microsoft.github.io/DirectX-Specs/d3d/SamplerFeedback.html) to guide continuous loading and eviction of small portions (tiles) of assets. Sampler Feedback Streaming allows scenes consisting of 100s of gigabytes of resources to be drawn on GPUs containing much less physical memory. The scene below uses just ~200MB of a 1GB heap, despite over 350GB of total texture resources.
44

55
The demo requires **`Windows 10 20H1 (aka May 2020 Update, build 19041)`** or later and a GPU with Sampler Feedback Support.
66

@@ -145,4 +145,125 @@ In this case, the hardware sampler is reaching across tile boundaries to perform
145145

146146
There are also a few known bugs:
147147
* entering full screen in a multi-gpu system moves the window to a monitor attached to the GPU by design. However, if the window starts on a different monitor, it "disappears" on the first maximization. Hit *escape* then maximize again, and it should work fine.
148-
* full-screen while remote desktop is broken *again*. Will likely fix soon.
148+
* full-screen while remote desktop is not borderless.
149+
150+
## How It Works
151+
152+
This implementation of Sampler Feedback Streaming uses DX12 Sampler Feedback in combination with DX12 Reserved Resources, aka Tiled Resources. A multi-threaded CPU library processes feedback from the GPU, makes decisions about which tiles to load and evict, loads data from disk storage, and submits mapping and uploading requests via GPU copy queues. There is no explicit GPU-side synchronization between the queues, so rendering frame rate is not dependent on completion of copy commands (on GPUs that support concurrent multi-queue operation). The CPU threads run continuously and asynchronously from the GPU (pausing when there's no work to do), polling fence completion states to determine when feedback is ready to process or copies and memory mapping has completed.
153+
154+
All the magic can be found in the **TileUpdateManager** library (see TileUpdateManager.h), which abstracts the creation of StreamingResources and heaps while internally managing feedback resources, file I/O, and GPU memory mapping.
155+
156+
The technique works as follows:
157+
158+
### 1. Create a Texture to be Streamed
159+
160+
The streaming textures are allocated as DX12 [Reserved Resources](https://docs.microsoft.com/en-us/windows/win32/api/d3d12/nf-d3d12-id3d12device-createreservedresource), which behave like [VirtualAlloc](https://docs.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-virtualalloc) in C. Each resource takes no physical GPU memory until 64KB regions of the resource are committed in 1 or more GPU [heaps](https://docs.microsoft.com/en-us/windows/win32/api/d3d12/nf-d3d12-id3d12device-createheap). The x/y dimensions of a reserved resource tile is a function of the texture format, such that it fills a 64KB GPU memory page. For example, BC7 textures have 256x256 tiles, while BC1 textures have 512x256 tiles.
161+
162+
In Expanse, each tiled resource corresponds to a single .XeT file on a hard drive (though multiple resources can point to the same file). The file contains dimensions and format, but also information about how to access the tiles within the file.
163+
164+
### 2. Create and Pair a Min-Mip Feedback Map
165+
166+
To use sampler feedback, we create a feedback resource with identical dimensions to record information about which texels were sampled.
167+
168+
For this streaming usage, we use the min mip feedback feature by [creating the resource](https://docs.microsoft.com/en-us/windows/win32/api/d3d12/nf-d3d12-id3d12device8-createcommittedresource2) with the format to DXGI_FORMAT_SAMPLER_FEEDBACK_MIN_MIP_OPAQUE. We set the region size of the feedback to match the tile dimensions through the SamplerFeedbackRegion member of [D3D12_RESOURCE_DESC1](https://docs.microsoft.com/en-us/windows/win32/api/d3d12/ns-d3d12-d3d12_resource_desc1).
169+
170+
For the feedback to be written by GPU shaders (in this case, pixel shaders) the texture and feedback resources must be paired through a view created with [CreateSamplerFeedbackUnorderedAccessView](https://docs.microsoft.com/en-us/windows/win32/api/d3d12/nf-d3d12-id3d12device8-createsamplerfeedbackunorderedaccessview).
171+
172+
### 3. Determine Resident Tiles
173+
174+
Because textures are only partially resident, we only want the pixel shader to sample resident portions. Sampling texels that are not physically mapped that returns 0s, resulting in undesirable visual artifacts. To prevent this, we clamp all sampling operations based on a **residency map**. The residency map is relatively tiny: for a 16k x 16k BC7 texture, which would take 350MB of GPU memory, we only need a 4KB residency map. Note that the lowest-resolution "packed" mips are loaded for all objects, so there is always something available to sample. See also [GetResourceTiling](https://docs.microsoft.com/en-us/windows/win32/api/d3d12/nf-d3d12-id3d12device-getresourcetiling).
175+
176+
When a texture tile has been loaded or evicted by TileUpdateManager, it updates the corresponding residency map. The residency map is an application-generated representation of the minimum mip available for each region in the texture, and is described in the [Sample Feedback spec](https://microsoft.github.io/DirectX-Specs/d3d/SamplerFeedback.html) as follows:
177+
178+
```
179+
The MinMip map represents per-region mip level clamping values for the tiled texture; it represents what is actually loaded.
180+
```
181+
182+
Below, the Visualization mode was set to "Color = Mip" and labels were added. TileUpdateManager processes the Min Mip Feedback (left window in top right), uploads and evicts tiles to form a Residency map, which is a proper min-mip-map (right window in top right). The contents of memory can be seen in the partially resident mips along the bottom (black is not resident). The last 3 mip levels are never evicted because they are packed mips (all fit within a 64KB tile). In this visualization mode, the colors of the texture on the bottom correspond to the colors of the visualization windows in the top right. Notice how the resident tiles do not exactly match what feedback says is required.
183+
![Expanse UI showing feedback and residency maps](./readme-images/labels.jpg "Expanse UI showing Min Mip Feedback, Residency Map, and Texture Mips (labels added)")
184+
185+
To reduce GPU memory, a single combined buffer contains all the residency maps for all the resources. The pixel shader samples the corresponding residency map to clamp the sampling function to the minimum available texture data available, thereby avoiding sampling tiles that have not been mapped.
186+
187+
We can see this in the shader "terrainPS.hlsl". Resources are defined at the top of the shader, including the reserved buffer, the residency resource, and the sampler:
188+
189+
```cpp
190+
Texture2D g_streamingTexture : register(t0);
191+
Buffer<uint> g_minmipmap: register(t1);
192+
SamplerState g_sampler : register(s0);
193+
```
194+
195+
The shader offsets into its region of the residency buffer (g_minmipmapOffset) and loads the minimum mip value for the region to be sampled.
196+
```cpp
197+
int2 uv = input.tex * g_minmipmapDim;
198+
uint index = g_minmipmapOffset + uv.x + (uv.y * g_minmipmapDim.x);
199+
uint mipLevel = g_minmipmap.Load(index);
200+
```
201+
The sampling operation is clamped to the minimum mip resident (mipLevel).
202+
```cpp
203+
float3 color = g_streamingTexture.Sample(g_sampler, input.tex, 0, mipLevel).rgb;
204+
```
205+
206+
### 4. Draw Objects While Recording Feedback
207+
208+
For expanse, there is a "normal" non-feedback shader named terrainPS.hlsl and a "feedback-enabled" version of the same shader, terrainPS-FB.hlsl. The latter simply writes feedback using [WriteSamplerFeedback](https://microsoft.github.io/DirectX-Specs/d3d/SamplerFeedback.html) HLSL intrinsic, using the same sampler and texture coordinates, then calls the prior shader. Compare the WriteSamplerFeedback() call below to to the Sample() call above.
209+
210+
Include the normal pixel shader:
211+
```cpp
212+
#include "terrainPS.hlsl"
213+
FeedbackTexture2D<SAMPLER_FEEDBACK_MIN_MIP> g_feedback : register(u0);
214+
215+
float4 psFB(VS_OUT input) : SV_TARGET0
216+
{
217+
218+
g_feedback.WriteSamplerFeedback(g_streamingTexture, g_sampler, input.tex.xy);
219+
220+
return ps(input);
221+
}
222+
```
223+
224+
Resolving feedback for one resource is inexpensive, but adds up when there are 1000 objects. Expanse has a configurable time limit for the amount of feedback resolved each frame. The "FB" shaders are only used for a subset of resources such that the amount of feedback produced can be resolved within the time limit. The time limit is managed by the application, not by the TileUpdateManager library, by keeping a running average of resolve time as reported by GPU timers.
225+
226+
As an optimization, Expanse tells streaming resources to evict all tiles if they are behind the camera. This could potentially be improved to include any object not in the view frustum.
227+
228+
You can find the time limit estimation, the eviction optimization, and the request to gather sampler feedback by searching Scene.cpp for the following:
229+
230+
* DetermineMaxNumFeedbackResolves
231+
* QueueEviction
232+
* SetFeedbackEnabled
233+
234+
### 5. Determine Which Tiles to Load & Evict
235+
236+
Once the draw command is complete, the feedback is ready to read on the CPU - either by copying the feedback to a readback resource, or by resolving directly to a readback resource.
237+
238+
Min mip feedback tells us the minimum mip tile that should be loaded. The min mip feedback is traversed, updating an internal reference count for each tile. If a tile previously was unused (ref count = 0), it is queued for loading from the bottom (highest mip) up. If a tile is not needed for a particular region, its ref count is decreased (from the top down). When its ref count reaches 0, it might be ready to evict.
239+
240+
Data structures for tracking reference count, residency state, and heap usage can be found in StreamingResource.cpp/h, look for TileMappingState. This class also has methods for interpreting the feedback buffer (ProcessFeedback) and updating the residency map (UpdateMinMipMap).
241+
```cpp
242+
class TileMappingState
243+
{
244+
public:
245+
// see file for method declarations
246+
private:
247+
TileLayer<BYTE> m_resident;
248+
TileLayer<UINT32> m_refcounts;
249+
TileLayer<UINT32> m_heapIndices;
250+
};
251+
TileMappingState m_tileMappingState;
252+
```
253+
254+
Tiles can only be evicted if there are no lower-mip-level tiles that depend on them, e.g. a mip 1 tile may have 4 mip 0 tiles "above" it in the mip hierarchy, and may only be evicted if all 4 of those tiles have also been evicted. The ref count helps us determine this dependency.
255+
256+
A tile also cannot be evicted if it is being used by an outstanding draw command. We prevent this by delaying evictions a frame or two depending on double or triple buffering of the swap chain. If a tile is needed before the delay completes, the tile is simply rescued from the pending eviction data structure instead of being re-loaded.
257+
258+
The mechanics of loading, mapping, and unmapping tiles is all contained within the DataUploader class, which depends on a FileStreamer class to do the actual tile loads. The latter implementation (FileStreamerReference) can easily be exchanged with DirectStorage for Windows.
259+
260+
### 6. Putting it all Together
261+
262+
There is some work that needs to be done before drawing objects that use feedback (clearing feedback resources), and some work that needs to be done after (resolving feedback resources). TileUpdateManager creates theses commands, but does not execute them. Each frame, these command lists must be built and submitted with application draw commands, which you can find just before the call to Present() as follows:
263+
264+
```cpp
265+
auto commandLists = m_pTileUpdateManager->EndFrame();
266+
267+
ID3D12CommandList* pCommandLists[] = { commandLists.m_beforeDrawCommands, m_commandList.Get(), commandLists.m_afterDrawCommands };
268+
m_commandQueue->ExecuteCommandLists(_countof(pCommandLists), pCommandLists);
269+
```

TileUpdateManager/DataUploader.cpp

Lines changed: 15 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -162,17 +162,19 @@ void Streaming::DataUploader::StopThreads()
162162
m_submitFlag.Set();
163163
m_monitorFenceFlag.Set();
164164

165-
if (m_fenceMonitorThread.joinable())
166-
{
167-
m_fenceMonitorThread.join();
168-
DebugPrint(L"JOINED Fence Monitor Thread\n");
169-
}
170-
165+
// stop submitting new work
171166
if (m_submitThread.joinable())
172167
{
173168
m_submitThread.join();
174169
DebugPrint(L"JOINED Submit Thread\n");
175170
}
171+
172+
// finish up any remaining work
173+
if (m_fenceMonitorThread.joinable())
174+
{
175+
m_fenceMonitorThread.join();
176+
DebugPrint(L"JOINED Fence Monitor Thread\n");
177+
}
176178
}
177179
}
178180

@@ -182,7 +184,7 @@ void Streaming::DataUploader::StopThreads()
182184
void Streaming::DataUploader::FlushCommands()
183185
{
184186
DebugPrint(m_updateListFreeCount.load(), " DU flush\n");
185-
while (m_updateListFreeCount < m_updateLists.size())
187+
while (m_updateListFreeCount.load() < m_updateLists.size())
186188
{
187189
_mm_pause();
188190
}
@@ -205,8 +207,11 @@ Streaming::UpdateList* Streaming::DataUploader::AllocateUpdateList(StreamingReso
205207
UpdateList* pUpdateList = nullptr;
206208

207209
// early out if there are none available
208-
if (m_updateListFreeCount)
210+
if (m_updateListFreeCount.load() > 0)
209211
{
212+
// there is definitely at least one updatelist that is STATE_FREE
213+
m_updateListFreeCount.fetch_sub(1);
214+
210215
// Idea: consider allocating in order, that is index 0, then 1, etc.
211216
// eventually will loop around. the most likely available index after the last index is index 0.
212217
// that is, the next index is likely available because has had the longest time to execute
@@ -223,8 +228,6 @@ Streaming::UpdateList* Streaming::DataUploader::AllocateUpdateList(StreamingReso
223228
pUpdateList = &p;
224229
// it is only safe to clear the state within the allocating thread
225230
p.Reset((Streaming::StreamingResourceDU*)in_pStreamingResource);
226-
ASSERT(m_updateListFreeCount);
227-
m_updateListFreeCount--;
228231

229232
// start fence polling thread now
230233
m_monitorFenceFlag.Set();
@@ -244,7 +247,8 @@ void Streaming::DataUploader::FreeUpdateList(Streaming::UpdateList& in_updateLis
244247
// NOTE: updatelist is deliberately not cleared until after allocation
245248
// otherwise there can be a race with the mapping thread
246249
in_updateList.m_executionState = UpdateList::State::STATE_FREE;
247-
m_updateListFreeCount++;
250+
m_updateListFreeCount.fetch_add(1);
251+
ASSERT(m_updateListFreeCount.load() <= m_updateLists.size());
248252
}
249253

250254
//-----------------------------------------------------------------------------
@@ -263,20 +267,6 @@ void Streaming::DataUploader::SubmitUpdateList(Streaming::UpdateList& in_updateL
263267
m_submitFlag.Set();
264268
}
265269

266-
//-----------------------------------------------------------------------------
267-
// Allow StreamingResource to free empty update lists that it allocates
268-
//-----------------------------------------------------------------------------
269-
void Streaming::DataUploader::FreeEmptyUpdateList(Streaming::UpdateList& in_updateList)
270-
{
271-
ASSERT(0 == in_updateList.GetNumStandardUpdates());
272-
ASSERT(0 == in_updateList.GetNumPackedUpdates());
273-
ASSERT(0 == in_updateList.m_evictCoords.size());
274-
275-
in_updateList.m_executionState = UpdateList::State::STATE_FREE;
276-
m_updateListFreeCount++;
277-
ASSERT(m_updateListFreeCount.load() <= m_updateLists.size());
278-
}
279-
280270
//-----------------------------------------------------------------------------
281271
// check necessary fences to determine completion status
282272
// possibilities:

TileUpdateManager/DataUploader.h

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -69,8 +69,9 @@ namespace Streaming
6969

7070
void SubmitUpdateList(Streaming::UpdateList& in_updateList);
7171

72-
// Streaming resource may find it can't use an updatelist
73-
void FreeEmptyUpdateList(Streaming::UpdateList& in_updateList);
72+
// free updatelist after processing
73+
// Streaming resource may call this (via TUM) if it allocates but doesn't use an updatelist
74+
void FreeUpdateList(Streaming::UpdateList& in_updateList);
7475

7576
enum class StreamerType
7677
{
@@ -91,9 +92,6 @@ namespace Streaming
9192

9293
void SetVisualizationMode(UINT in_mode) { m_pFileStreamer->SetVisualizationMode(in_mode); }
9394
private:
94-
// free updatelist after processing
95-
void FreeUpdateList(Streaming::UpdateList& in_updateList);
96-
9795
// affects upload buffer size. 1024 would become a 64MB upload buffer
9896
const UINT m_maxTileCopiesInFlight{ 0 };
9997
const UINT m_maxBatchSize{ 0 };
@@ -149,5 +147,6 @@ namespace Streaming
149147
//-------------------------------------------
150148
std::atomic<UINT> m_numTotalEvictions{ 0 };
151149
std::atomic<UINT> m_numTotalUploads{ 0 };
150+
std::atomic<UINT> m_numTotalUpdateListsProcessed{ 0 };
152151
};
153152
}

TileUpdateManager/FileStreamerReference.cpp

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -172,20 +172,26 @@ Streaming::FileStreamer::FileHandle* Streaming::FileStreamerReference::OpenFile(
172172
// Best guess is OS pauses the thread delaying when the copybatch is released
173173
// very rarely, the result is a (very) long delay waiting for an available batch
174174
//-----------------------------------------------------------------------------
175-
Streaming::FileStreamerReference::CopyBatch& Streaming::FileStreamerReference::AllocateCopyBatch(Streaming::UpdateList& in_updateList)
175+
void Streaming::FileStreamerReference::AllocateCopyBatch(Streaming::UpdateList& in_updateList, CopyBatch::State in_desiredState)
176176
{
177-
UINT numBatches = (UINT)m_copyBatches.size();
177+
const UINT numBatches = (UINT)m_copyBatches.size();
178178

179179
while (1)
180180
{
181181
// by allocating the least-recently-used, measured ~100% success on first try
182+
// might have a race here with multiple threads, but it'll never read out-of-bounds
182183
auto& batch = m_copyBatches[m_batchAllocIndex];
183184
m_batchAllocIndex = (m_batchAllocIndex + 1) % numBatches;
184185

185-
if (CopyBatch::State::FREE == batch.m_state)
186+
// multiple threads may be trying to allocate a CopyBatch
187+
CopyBatch::State expected = CopyBatch::State::FREE;
188+
if (batch.m_state.compare_exchange_weak(expected, CopyBatch::State::ALLOCATED))
186189
{
190+
// set the update list while the CopyBatch is in the "allocated" state
187191
batch.m_pUpdateList = &in_updateList;
188-
return batch;
192+
// as soon as this state changes, the
193+
batch.m_state = in_desiredState;
194+
break;
189195
}
190196
}
191197
}
@@ -197,8 +203,7 @@ void Streaming::FileStreamerReference::StreamPackedMips(Streaming::UpdateList& i
197203
ASSERT(in_updateList.GetNumPackedUpdates());
198204
ASSERT(0 == in_updateList.GetNumStandardUpdates());
199205

200-
CopyBatch& batch = AllocateCopyBatch(in_updateList);
201-
batch.m_state = CopyBatch::State::LOAD_PACKEDMIPS;
206+
AllocateCopyBatch(in_updateList, CopyBatch::State::LOAD_PACKEDMIPS);
202207
}
203208

204209
//-----------------------------------------------------------------------------
@@ -208,9 +213,7 @@ void Streaming::FileStreamerReference::StreamTexture(Streaming::UpdateList& in_u
208213
ASSERT(0 == in_updateList.GetNumPackedUpdates());
209214
ASSERT(in_updateList.GetNumStandardUpdates());
210215

211-
CopyBatch& batch = AllocateCopyBatch(in_updateList);
212-
213-
batch.m_state = CopyBatch::State::LOAD_TILES;
216+
AllocateCopyBatch(in_updateList, CopyBatch::State::LOAD_TILES);
214217
}
215218

216219
//-----------------------------------------------------------------------------
@@ -404,6 +407,7 @@ void Streaming::FileStreamerReference::CopyThread()
404407
break;
405408

406409
case CopyBatch::State::WAIT_COMPLETE:
410+
// can't recycle this command allocator until the corresponding fence has completed
407411
if (c.m_copyFenceValue <= m_copyFence->GetCompletedValue())
408412
{
409413
m_uploadAllocator.Free(c.m_uploadIndices);

0 commit comments

Comments
 (0)