[viostor] WIP-phase1 - Various fixes and enhancements#1499
[viostor] WIP-phase1 - Various fixes and enhancements#1499benyamin-codez wants to merge 1 commit intovirtio-win:masterfrom
Conversation
|
[Jenkins CI]: Can one of the admins verify this patch? |
Summary of ChangesHello @benyamin-codez, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces several fixes and enhancements to the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces significant fixes and enhancements to the viostor driver, focusing on request handling, locking, interrupts, and tracing, with a commendable refactoring of I/O request submission into a centralized SendSRB function. However, it also introduces several critical and medium severity buffer overflow vulnerabilities, including a kernel-mode out-of-bounds write in the SendSRB function when handling UNMAP requests, which could lead to system instability or privilege escalation, and buffer management issues in newly added registry helper functions causing stack-based and array-based overflows. Additionally, other issues require attention, such as a critical bug in a new registry helper function (assignment instead of comparison), a potentially breaking change to a device-facing data structure, a function in a header that should be static inline, and an overly complex code block.
| for (ULONG BlockDescrIndex = 0; BlockDescrIndex < BlockDescrCount; BlockDescrIndex++) | ||
| { | ||
| ULONGLONG blockDescrStartingLba; | ||
| ULONG blockDescrLbaCount; | ||
| REVERSE_BYTES_QUAD(&blockDescrStartingLba, BlockDescriptors[BlockDescrIndex].StartingLba); | ||
| REVERSE_BYTES(&blockDescrLbaCount, BlockDescriptors[BlockDescrIndex].LbaCount); | ||
| RhelDbgPrint(TRACE_UNMAP, | ||
| " Index %d BlockDescrCount = %d" | ||
| " blockDescrStartingLba = %llu" | ||
| " blockDescrLbaCount = %lu\n", | ||
| BlockDescrIndex, | ||
| BlockDescrCount, | ||
| blockDescrStartingLba, | ||
| blockDescrLbaCount); | ||
| adaptExt->blk_discard[BlockDescrIndex].sector = blockDescrStartingLba * | ||
| (adaptExt->info.blk_size / SECTOR_SIZE); | ||
| adaptExt->blk_discard[BlockDescrIndex].num_sectors = blockDescrLbaCount * | ||
| (adaptExt->info.blk_size / SECTOR_SIZE); | ||
| adaptExt->blk_discard[BlockDescrIndex].flags = 0; | ||
| } |
There was a problem hiding this comment.
In the SEND_SRB_UNMAP case, the driver processes a list of unmap block descriptors provided in the Srb data buffer. The number of descriptors, BlockDescrCount, is calculated from the BlockDescrDataLength field of the unmapList header, which is controlled by the caller. The driver then iterates through these descriptors and writes them into the adaptExt->blk_discard array. However, adaptExt->blk_discard is a fixed-size array of 16 elements (defined in viostor/virtio_stor.h), and there is no validation to ensure that BlockDescrCount does not exceed this limit. An attacker providing a large BlockDescrDataLength can cause an out-of-bounds write, corrupting the ADAPTER_EXTENSION structure, which contains sensitive data including function pointers and list entries. This is a critical kernel-mode buffer overflow.
There was a problem hiding this comment.
This was lifted from RhelDoUnMap().
Will address in child / separately...
There was a problem hiding this comment.
Hanna, did you already catch this one in your review of master...?
cc: @kevmw @YanVugenfirer
There was a problem hiding this comment.
Yes: 7ab33cf
(Longer list of potential issues here: https://github.com/mstsirkin/kvm-guest-drivers-windows/commits/review1, and here: https://github.com/mstsirkin/kvm-guest-drivers-windows/pull/1/commits)
| if (shAdapterId / 100) | ||
| { | ||
| j = 0; | ||
| hba_id_as_str[j] = (UCHAR)(shAdapterId / 100) + 48; | ||
| } | ||
| else | ||
| { | ||
| hba_id_as_str[0] = 48; | ||
| if (shAdapterId / 10) | ||
| { | ||
| j = 1; | ||
| hba_id_as_str[j] = (UCHAR)(shAdapterId / 10) + 48; | ||
| } | ||
| else | ||
| { | ||
| hba_id_as_str[1] = 48; | ||
| j = 2; | ||
| hba_id_as_str[j] = (UCHAR)shAdapterId + 48; | ||
| } | ||
| } | ||
| if ((j < 1) && (shAdapterId / 10)) | ||
| { | ||
| j = 1; | ||
| hba_id_as_str[j] = (UCHAR)(((shAdapterId - ((shAdapterId / 100) * 100)) / 10) + 48); | ||
| } | ||
| else if ((j < 2) && (shAdapterId > 9)) | ||
| { | ||
| j = 2; | ||
| hba_id_as_str[j] = (UCHAR)((shAdapterId - ((shAdapterId / 10) * 10)) + 48); | ||
| } | ||
| else | ||
| { | ||
| j = 1; | ||
| hba_id_as_str[j] = 48; | ||
| } | ||
| if ((j < 2) && (shAdapterId > 0)) | ||
| { | ||
| j = 2; | ||
| hba_id_as_str[j] = (UCHAR)((shAdapterId - ((shAdapterId / 10) * 10)) + 48); | ||
| } | ||
| else if (j < 2) | ||
| { | ||
| j = 2; | ||
| hba_id_as_str[j] = 48; | ||
| } | ||
| /* NULL-terminate the string. */ | ||
| hba_id_as_str[3] = '\0'; |
There was a problem hiding this comment.
This block of code for converting an integer to a zero-padded 3-digit string is overly complex and hard to maintain. It can be significantly simplified by using a standard and safe string formatting function like RtlStringCbPrintfA. You will need to include <ntstrsafe.h> to use this function.
RtlStringCbPrintfA(hba_id_as_str, sizeof(hba_id_as_str), "%03u", shAdapterId);| * NULL terminator wraps things up. Also used in TRACING. | ||
| */ | ||
| CopyBufferToAnsiString(&valname_as_str, ValueName, '\0', 60); | ||
| CopyBufferToAnsiString(&hba_id_as_str, &shAdapterId, '\0', 4); |
There was a problem hiding this comment.
The function VioStorReadRegistryParameter calls CopyBufferToAnsiString to copy shAdapterId into hba_id_as_str. The _maxlength parameter is set to 4, which is the exact size of the hba_id_as_str buffer. The CopyBufferToAnsiString function increments the destination pointer _maxlength times if the delimiter is not found, and then writes a null terminator at the resulting address. This results in a 1-byte out-of-bounds write (hba_id_as_str[4]) on the stack. Additionally, shAdapterId is a USHORT (2 bytes), so reading 4 bytes from its address results in an out-of-bounds read of stack data.
| for (j = 0; j < 4; ++j) | ||
| { | ||
| valname_as_str[i + j + 1] = hba_id_as_str[j]; | ||
| } |
There was a problem hiding this comment.
When constructing ValueNamePerHba, the code appends hba_id_as_str to valname_as_str. valname_as_str is defined as CHAR[64]. The initial part of the name is copied with a limit of 60 characters. If the initial string is 60 characters long, appending the underscore (1 byte) and the ID (up to 4 bytes including null terminator) will exceed the 64-byte boundary of the array. Specifically, valname_as_str[64] is written when j=3 in the loop at line 160, leading to a kernel-mode out-of-bounds write.
a6080e8 to
90ab182
Compare
cef2b05 to
c9e0fd9
Compare
|
@benyamin-codez I expected a series of smaller cleanups to wait for or rebase on, but I didn't realise that you have effectively a whole driver rewrite consisting of unreviewed code. I'm afraid you have manoeuvred yourself into a dead end here, having so much out-of-tree code is bound to end in pain. There is no way we can wait for all of this to land before we touch the driver again, and even if we waited, review of each step would probably cause changes and constant rebasing pains, too. So I think we should go ahead with all the other fixes and cleanups and basically treat your tree as a source of ideas, but not as something to be merged as-is. And then we can work together on the current state in master, with small incremental self-contained changes. |
|
Yeah, that's fine. That was the expectation. |
|
I still think it worthwhile grabbing the MSI-X parts sooner rather than later. In particular, I see some value to retaining the type of mechanism in #define QUEUE_TO_MESSAGE(QueueId) ((QueueId) + VIRTIO_BLK_MSIX_VQ_OFFSET)
#define MESSAGE_TO_QUEUE(MessageId) ((MessageId)-VIRTIO_BLK_MSIX_VQ_OFFSET)
#define QUEUE_TO_MESSAGE_1_VECTOR(QueueId) ((QueueId) + VIRTIO_BLK_MSIX_VQ_1_VCTR_OFFSET)
#define MESSAGE_TO_QUEUE_1_VECTOR(MessageId) ((MessageId)-VIRTIO_BLK_MSIX_VQ_1_VCTR_OFFSET)This is one reason why I've been developing a clearer PR for your consideration. I've progressed work on this but as you can imagine the number of commits I understand you would prefer me to provide isn't insignificant. I'm quickly running out of time before I must attend to other matters for about 24 hours. Perhaps on the other side of the weekend I will have something more appropriate for you to review... Does this work for you, @XanClic and anyone else interested..? |
|
Looks like I've got a little cleanup of the registry stuff... 8^d |
c9e0fd9 to
4df1016
Compare
See PR for more info Signed-off-by: benyamin-codez <115509179+benyamin-codez@users.noreply.github.com>
4df1016 to
7229eab
Compare
| else | ||
| { | ||
| // Unused legacy pathway for when adaptExt->msix_sync_mode == InterruptSynchronizeAll | ||
| ULONG QueueNumber; |
There was a problem hiding this comment.
Don't go through QueueNumbers here, DPCs are primarily about messages, not queues (and yes, the code is currently a bit confused about this). After #1503, this should just call MessageToDpcIdx() instead.
There was a problem hiding this comment.
Yeah, I think it's a legacy holdover.
| if (adaptExt->num_queues > 1) | ||
| RhelDbgPrint(TRACE_LOCKS, " Processing StorPortAcquireMSISpinLock() for MessageId : %d \n", MessageId); | ||
| ULONG oldIrql = 0; | ||
| StorPortAcquireMSISpinLock(DeviceExtension, MessageId, &oldIrql); |
There was a problem hiding this comment.
This is also a question about the preexisting code: What is the advantage of StorPortAcquireMSISpinLock() in our case? Generally speaking it seems to be that you can selectively lock only for a specific message, but the way we use the adaptExt->dpc array per message with StorPortAcquireSpinLock() effectively achieves the same thing.
Could we simplify and just use the method that is always available?
There was a problem hiding this comment.
The advantage is primarily improved SMP performance due to the way StorPort calls the interrupt callback.
So even if we call StorPortAcquireSpinLock() with DpcType, StorPort still acquires an interrupt spin lock on ALL messages and then calls the HW_MESSAGE_SIGNALED_INTERRUPT_ROUTINE at DIRQL. It is not sufficient to set the InterruptSynchronizationMode to InterruptSynchronizePerMessage because even in this mode, if we call StorPortAcquireSpinLock(), StorPort will still acquire a interrupt lock on all messages. This is the mechanism used to synchronise with all messages when the need arises even in InterruptSynchronizePerMessage mode.
To take advantage of the performance benefit of InterruptSynchronizePerMessage mode, we need to exclusively call spinlocks using StorPortAcquireMSISpinLock(). In the edge case when we need to synchronise with all messages - and I'm not aware of any in the current implementation because we now always address the DPC context anyway - we would then use StorPortAcquireSpinLock() directly rather than via VioStorVQLock().
I included the StorPortAcquireSpinLock() conditional fork for completeness and in case we wanted to use the INTERRUPT_SYNCHRONIZATION_MODE enumerator. down the track, by say exposing it to a registry switch.
As it stands I just did this:
kvm-guest-drivers-windows/viostor/virtio_stor.c
Lines 413 to 429 in 7229eab
...where the only alternative is for our InterruptLock type fallback.
It's possible we originally implemented InterruptSynchronizePerMessage mode hoping to get the performance gains but then stumbled. It could date back to early MSI-X implementation. I think it was the right choice but we didn't nail the implementation.
There was a problem hiding this comment.
I included the
StorPortAcquireSpinLock()conditional fork for completeness and in case we wanted to use theINTERRUPT_SYNCHRONIZATION_MODEenumerator, down the track, by say exposing it to a registry switch.
Kevin, I should also mention that I retained the pathway because I also wanted to experiment with spinlocks again in vioscsi. My previous attempts to get StorPortAcquireMSISpinLock() working there were not terribly successful, but I now suspect that this was more about msix_one_vector and the MessageNumber maybe being off...
It works quite well here in viostor - in combination with other changes in this PR - using 4 vCPUs the average latency for 4KiB random reads reduced by over 50% to about 1ms, and 64KiB went down to a hair over 6ms.
cc: @YanVugenfirer
There was a problem hiding this comment.
It is not sufficient to set the InterruptSynchronizationMode to InterruptSynchronizePerMessage because even in this mode, if we call StorPortAcquireSpinLock(), StorPort will still acquire a interrupt lock on all messages.
If it does indeed acquire an interrupt lock on all messages, that sounds significant. I tried to find this behaviour in the documentation, but couldn't find it. Instead, StorPortAcquireSpinLock() documents that you can explicitly take an InterruptLock after acquiring a DpcLock, which implies that it won't already be held. Do you have a pointer for me where this is explained?
There was a problem hiding this comment.
Apologies, Kevin. I wasn't especially clear and conflated my response.
With regard to the HW_MESSAGE_SIGNALED_INTERRUPT_ROUTINE callback, which is HwMSIInterruptRoutine and VirtIoMSInterruptRoutine() in our viostor driver. In this routine we don't actually call any spin locks, but the InterruptSynchronizationMode affects how this operates.
If you look at the documentation for StorPortAquireSpinlock(), in the tables you will see for HwMSIInterruptRoutine that the port driver (StorPort) acquires and holds an InterruptLock type spin lock and that you cannot acquire a spin lock in the miniport driver (in this case viostor), i.e. Allowed spin locks = None.
The HW_MESSAGE_SIGNALED_INTERRUPT_ROUTINE documentation gives a more concise overview too.
Generally, when we call StorPortAquireSpinlock() we must honour the tables per the documentation, but if you look at the documentation for StorPortAquireMSISpinlock() you will see that it can be called at any IRQL. This is because StorPort will manage the level hierarchy.
Again, the advantage is better SMP performance.
There was a problem hiding this comment.
I should mention, that in any case, it is better to test for:
if (adaptExt->msix_sync_mode == InterruptSynchronizePerMessage)...or similar, rather than:
if (adaptExt->num_queues <= 1)...in deciding whether to use StorPortAcquireMSISpinLock()...
It seems quite counter-intuitive the current way, yes?
There was a problem hiding this comment.
Is that true? I thought it was the opposite, i.e. that VirtIoMSInterruptRoutine() doesn't get called while you hold StorPortAcquireSpinLock(), and moreover that VirtIoMSInterruptRoutine() can be called for other vectors when you use StorPortAcquireMSISpinLock(). Also virtqueue_disable_cb() is per virtqueue, so should only block for the vector currently under lock.
Yes, I'm always talking only about the same vector here.
VirtIoMSInterruptRoutine() doesn't get called for the same vector while you hold an MSISpinLock for that message (either explicitly in VioStorVQLock() or automatically around VirtIoMSInterruptRoutine() with InterruptSynchronizePerMessage).
It also doesn't get called while you hold an InterruptLock (which may again be explicitly through the non-MSI path in VioStorVQLock(), or automatically around VirtIoMSInterruptRoutine() with InterruptSynchronizeAll, or automatically around VirtIoInterrupt()).
I don't see DpcLock documented as interacting with any of this, and it would also be unexpected to me if it did interact. So my understanding is that holding that one will still allow the interrupt handler to run, and only running the DPC will be affected. (If I'm not misreading, this actually means that in the case when CompleteDPC() returns FALSE, we're ignoring the DpcLock that the submission code may be holding.)
It seems quite counter-intuitive the current way, yes?
Sure, my intuition says that we should use StorPortAcquireMSISpinLock() whenever we can and a global InterruptLock as a fallback when it's not available. But obviously someone (in fact @vrozenfe) wrote the code with an additional case using DpcLock, so I assume that my intuition might be off and they had reasons for it.
This is why I'm asking all of these questions, because while intuitively always using StorPortAcquireMSISpinLock() seems simpler, I also don't really understand why it should improve performance so much compared to DpcLock. We don't seem to have a good theory for this yet, and I like to fully understand changes that I support.
But given that you do see the performance improvements, we should probably just a PR that does just this one change and properly benchmark it.
There was a problem hiding this comment.
Ok, so I think we (mostly) are on the same page.
iirc, I think the historical decisions were during initial MSI-X implementation away from InterruptLock and because msix_one_vector would introduce the ASSERT and BSOD when using StorPortAcquireMSISpinLock().
I'm fairly certain the reason why StorPortAcquireMSISpinLock() performs better in SMP systems is because VirtIoMSInterruptRoutine() can run for other vectors if StorPortAcquireSpinLock() isn't holding a lock.
In any case, I think your proposal for a separate PR and further benchmarking is a good idea. As I previously mentioned, I only saw the performance increase along with other changes, so it would be best to separate them first to have confidence in the claim.
It would also be prudent to rebase on PRs #1493 and #1503 first methinks...
There was a problem hiding this comment.
I'm fairly certain the reason why StorPortAcquireMSISpinLock() performs better in SMP systems is because VirtIoMSInterruptRoutine() can run for other vectors if StorPortAcquireSpinLock() isn't holding a lock.
When I said above that my understanding is that DpcLock doesn't interact with interrupts, this scenario is precisely what I had in mind. Yes, holding an InterruptLock would definitiely prevent VirtIoMSInterruptRoutine() from being called for another message. But holding a DpcLock shouldn't - and it shouldn't even prevent the DPC for another message to run because the DpcLock is not global, but per DPC (and therefore per message in our design).
It would also be prudent to rebase on PRs #1493 and #1503 first methinks...
I agree.
There was a problem hiding this comment.
Yep, I get what you're saying. That being said, it's somewhat hard to test whether VirtIoMSInterruptRoutine() is blocked (as in Storport just doesn't call it) when using DpcLock type (we both agree InterrupLock type is a problem, but in this case we don't use MSI-X anyway), without stepping through in the debugger and in this instance that would take some time - perhaps too long (before timeouts occur). Testing for this isn't especially clear via tracing either.
Performance testing should reveal something, but in the end you could well be right about DpcLock type and any improvement could be due to some other reason.
| return FALSE; | ||
| } | ||
| if ((lba + blocks) > adaptExt->lastLBA) | ||
| if ((lba + blocks) > adaptExt->lastLBA - 1) |
There was a problem hiding this comment.
I note 86394da (Issue B) touched this part.
I included this part as reported by @EnergyFaith in commentary on 67f64d0 for issue #829.
Did you want to include it in your lists?
There was a problem hiding this comment.
I mean, as long as it’s fixed, it doesn’t need to be part of the list :)
But it would be separate issue from issue B, right.
There was a problem hiding this comment.
Sure, it's the boundary check itself rather than the unit mismatch.
My thoughts were that if this part is being refactored to address the unit mismatch, then maybe this off-by-one issue could be addressed at the same time.
Refactoring might also consolidate the two tests to:
if (lba >= adaptExt->lastLBA || (lba + blocks) >= adaptExt->lastLBA)Or even eliminate the first test...
...because lba + xfer_length (blocks) will always be > lastLBA if lba > lastLBA...
There was a problem hiding this comment.
lastLBA is actually the capacity (i.e. first LBA after the end of the disk), so it's a bit of a misnomer. Should it be renamed?
The current condition seems right to me (apart from the unit confusion): lba + blocks is the first LBA after the end of the request. It is allowed for this to be the first LBA after the end of the disk (lba + blocks == adaptExt->lastLBA), this happens when the request accesses the very end of the disk. Am I missing something?
What you could additionally take into account is integer overflows, if Windows doesn't already catch them before calling us. if (lba >= adaptExt->lastLBA || blocks > adaptExt->lastLBA - lba) should take care of that.
There was a problem hiding this comment.
lastLBAis actually the capacity (i.e. first LBA after the end of the disk), so it's a bit of a misnomer. Should it be renamed?
Maybe totalLBAs...?
iinm, blk_config.capacity would be a count...
The current condition seems right to me (apart from the unit confusion):
lba + blocksis the first LBA after the end of the request. It is allowed for this to be the first LBA after the end of the disk (lba + blocks == adaptExt->lastLBA), this happens when the request accesses the very end of the disk. Am I missing something?
Likely not.
That makes sense.
This is probably why the comment by @EnergyFaith went unanswered...
What you could additionally take into account is integer overflows, if Windows doesn't already catch them before calling us.
if (lba >= adaptExt->lastLBA || blocks > adaptExt->lastLBA - lba)should take care of that.
Looks good to me.
If someone is addressing the unit mismatch, can they pick up this consolidation along with it?








Here we have the first phase of my current Work In Progress on
viostorfor your consideration.I will update this post with child PRs as they are raised or updated as the case may be.
Please consider keeping review and discussion in these child PRs where possible.
Please also give me a couple of days to raise and update same.
Best regards,
Ben
Child PRs:
PR [viostor] Fix performance issue (regression) #1289 (Ready)
PR [viostor] Introduce Registry read capability #1297 (To Be Updated)
PR [CI-NO-BUILD] [viostor] Programmatically determine max_segments #1298 (To Be Updated)
PR [CI-NO-BUILD] [viostor] Align memory allocations #1299 (To Be Updated)
PR [CI-NO-BUILD] [viostor] Refactor VirtIoHwInitialize() #1300 (To Be Updated)
PR [viostor] Refactor RhelGetDiskGeometry() #1301 (To Be Updated)
PR [CI-NO-BUILD] [viostor] Refactor RhelSetGuestFeatures() #1302 (To Be Updated)
PR [viostor] Zero-out adaptExt in VirtIoFindAdapter() #1304 (Ready)
PR [CI-NO-BUILD] [viostor] Increase MAX_PHYS_SEGMENTS limit #1315 (Ready)
adaptExt->num_queuesas ULONGstruct member alignment
Locking
Interrupts
Tracing
More to follow... 8^{d>~