-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Xilinx ZynqMP CSUDMA problem workaround #7603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -10,9 +10,11 @@ | |
| #include <drivers/zynqmp_huk.h> | ||
| #include <drivers/zynqmp_pm.h> | ||
| #include <io.h> | ||
| #include <kernel/delay.h> | ||
| #include <kernel/tee_common_otp.h> | ||
| #include <mm/core_memprot.h> | ||
| #include <string_ext.h> | ||
| #include <tee/cache.h> | ||
| #include <tee/tee_cryp_utl.h> | ||
| #include <trace.h> | ||
| #include <utee_defines.h> | ||
|
|
@@ -114,6 +116,7 @@ TEE_Result tee_otp_get_hw_unique_key(struct tee_hw_unique_key *hwkey) | |
| uint8_t dst[ZYNQMP_CSU_AES_DST_LEN(sizeof(src))] | ||
| __aligned_csuaes = { 0 }; | ||
| TEE_Result ret = TEE_ERROR_GENERIC; | ||
| uint32_t retry_counter = 0; | ||
| uint32_t status = 0; | ||
|
|
||
| static_assert(sizeof(device_dna) == ZYNQMP_GCM_IV_SIZE); | ||
|
|
@@ -194,19 +197,54 @@ TEE_Result tee_otp_get_hw_unique_key(struct tee_hw_unique_key *hwkey) | |
| memset(src, 0, sizeof(src)); | ||
| /* Ignore the tag data from the dst buffer - pass a smaller size */ | ||
| ret = zynqmp_csu_aes_decrypt_data(dst, sizeof(src), src, sizeof(src), | ||
| tag, sizeof(tag), iv, | ||
| ZYNQMP_EFUSE_LEN(DNA), | ||
| tag, sizeof(tag), iv, sizeof(iv), | ||
| ZYNQMP_CSU_AES_KEY_SRC_DEV); | ||
| if (ret) { | ||
| EMSG("Can't decrypt DNA, please make sure PUF was registered"); | ||
| goto cleanup; | ||
| } | ||
|
|
||
| if (memcmp(src, sha, sizeof(sha))) { | ||
| EMSG("PUF not ready, can't create HUK"); | ||
| ret = TEE_ERROR_GENERIC; | ||
| goto cleanup; | ||
| } | ||
| /* | ||
| * This retry loop here is a workaround for hardware issue where | ||
| * CSU-DMA transfer is not yet delivered to DDR memory (in variable | ||
| * 'src'). | ||
| * | ||
| * CSU-AES/CSU-DMA code is implemented in a way that it waits for DONE | ||
| * signal which indicates that memory transfer has left the DMA engine. | ||
| */ | ||
| retry_counter = 0; | ||
| do { | ||
| if (memcmp(src, sha, sizeof(sha)) == 0) | ||
| break; | ||
|
|
||
| /* | ||
| * If the transfer is not completed within 1 second then there | ||
| * is a real failure. | ||
| */ | ||
| if (retry_counter >= 1000) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Suggestion: move this to
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am OK to change the construct.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please could you extract the workaround section into a workaround function? makes it easier to understand |
||
| EMSG("PUF not ready, can't create HUK"); | ||
| ret = TEE_ERROR_GENERIC; | ||
| goto cleanup; | ||
| } | ||
|
|
||
| /* | ||
| * Give CPU's internal fifos some time to transfer data to the | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't this the CSUDMA transfer to RAM that is incorrectly signaled as completed?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since this is black box more or less for us so we are somewhat guessing in here :| Current theory is that there are several internal FIFOs between the domains and those "splits" the bus ack so that the sender (CSUDMA) thinks that they are complete. And now some internal timing on larger chip (or so) has changed so then the delay is so slightly larger for result to arrive in DDR. In our testing we only got one round i.e. 1 ms enough but for went for larger retry. That being said -- other operations also for the CSUDMA are subjects for the same issue. Would really like to hear from AMD/Xilinx folks what is the right thing to do in here. We had similar issue with FSBL where we decided to disable CSUDMA usage altogether for "memcpy" operation optimization. This happened also in smaller chips.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Btw. we have also observed similar issue when transferring data from FPGA to DDR that some data might just take a bit to transfer after we get IRQ so we had to implement some waiting logic and magic & counter checks to check that we know that data has finally arrived as complete before staring to work with that. This we concluded as normal operation. This could be the similar FIFO issue in both.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I prepared a different kind of change that still needs a testing in secured hardware -- but more or less I decided that 1 ms should be all enough for internal FIFO's to work their way -- and I relocated the fixup in CSU DMA driver: diff --git a/core/drivers/zynqmp_csudma.c b/core/drivers/zynqmp_csudma.c
index 4e3b59c9c..e20a2798e 100644
--- a/core/drivers/zynqmp_csudma.c
+++ b/core/drivers/zynqmp_csudma.c
@@ -74,6 +74,28 @@ TEE_Result zynqmp_csudma_sync(enum zynqmp_csudma_channel channel)
status = io_read32(dma + CSUDMA_I_STS_OFFSET);
if ((status & CSUDMA_IXR_DONE_MASK) == CSUDMA_IXR_DONE_MASK) {
csudma_clear_intr(channel, CSUDMA_IXR_DONE_MASK);
+
+ if (channel == ZYNQMP_CSUDMA_DST_CHANNEL) {
+ /*
+ * This retry extra delay here is a workaround for hardware
+ * issue where CSU-DMA transfer is not yet delivered to DDR
+ * memory.
+ *
+ * Even thou there is now a signal that transfer has left the
+ * DMA engine there are still other CPU's internal FIFOs that
+ * needs some time to transfer data to the final destination.
+ *
+ * 1 millisecond delay should be more than enough to transfer
+ * the remaining data.
+ *
+ * Technically this should only be needed on DDR destination but
+ * as operations in OP-TEE are few, the extra delay for crypto
+ * destination should not cause much an issue. "Guarantee" that
+ * the data is in destination before returning to caller is the
+ * driver here.
+ */
+ mdelay(1);
+ }
return TEE_SUCCESS;
}
}What do you think -- should we actually
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that it makes sense. As long as the CSUDMA is not used excessively on the OP-TEE side this should not impact performance. I agree that getting some feedback from Xilinx/AMD would be great here as well.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unfortunately that alternative change did not work on the hardware tests.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I based the csudma driver on https://github.com/Xilinx/embeddedsw/tree/master/XilinxProcessorIPLib/drivers/csudma . Could you check if there is any fix for the issue you are reporting there ?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can check -- their CSU-DMA memcpy() has been broken for some releases -- we have a patch to workaround that.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Will retry this with cache line fix that had insufficient amount of bytes for size. |
||
| * DDR memory. | ||
| */ | ||
| mdelay(1); | ||
|
|
||
| /* | ||
| * Invalidate src buffer from caches as memcmp above has | ||
| * already loaded data from DDR to cache so that next compare | ||
| * will be with fresh data. | ||
| */ | ||
| ret = cache_operation(TEE_CACHEINVALIDATE, src, sizeof(src)); | ||
| if (ret) { | ||
| EMSG("Failed to invalidate cache"); | ||
| goto cleanup; | ||
| } | ||
| retry_counter++; | ||
| } while (1); | ||
|
|
||
| IMSG("HUK ready"); | ||
|
|
||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure this handles words and not bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also please add some details - this is not fixing anything, just introducing a regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With cache functions you should never go over the range of expected bytes or you can get random problems as you might in-advertendtly reference memory that was used by some other code and cause unexpected behavior for that.
cache_operation()is used in many places in OP-TEE to do these operations. Unit of length is not specified as such but all code references tend to assume number of bytes.dcache_inv_range()is no longer being used directly in OP-TEE.There is helper allocator function
alloc_cache_aligned()that would normally allocated cache aligned memory, but in here we have variable size and alignment magic in stack so cache line alignment happens internally.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is. And I'd say that I tend to prefer it over
cache_operation().There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vesajaaskelainen I dont see what magic you are referring to and I dont think anything needs fixing here. The alignment is checked in the function for these page sizes.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dcache_inv_range()takes size in bytes.Now the size is given as =
SHIFT_U64(len, CSUDMA_SIZE_SHIFT).Where:
Which is same as *4.
But what I did not spot on first round was:
So sorry about that.
To get full picture original code:
Where sizeof(uint32_t) == 4
So we have arithmetics that counter each other.
Let me test the
msleep(1)workaround again with to notion that was pointed out.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the magic that I referred is:
With
__aligned()we align the memory to start in boundary of CSUDMA alignment requirement.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking with
git grep:And the comparison:
Used lots in drivers.
This made me believe that
cache_operation()is the "public api" and the way to go.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please correct me if I am wrong but isn't
dcache_inv_range()working in CPU's L1 cache only?If we look at Zynq UltraScale+ Device TRM:
From CSU Processor -> DDR:
From A53 Core X -> DDR:
Shared parts in here are CCI and DDR. If CCI just knows that everything is correct it should be good enough -- don't see the need to do any CCI operations.
SCU should help for requests that it does see but CSU Processor flow does not go thru that.
But what we have is L2 cache which is handled by
cache_operation().There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It operates on all internal caches to the Point of Coherency.
But I was a bit quick to dismiss
cache_operation(), it's the right thing to use from a driver.cache_operation()also handles an eventual external cache.The
dcache_*()functions are sometimes more precise, so there are situations where those are needed.