You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the C# Documentation there is an interesting comment concerning IOBindings
// This model has input and output of the same shape, so we can easily feed
// output to input using binding, or not using one. The example makes use of
// the binding to demonstrate the circular feeding.
// With the OrtValue API exposed, one create OrtValues over arbitrary buffers and feed them to the model using
// OrtValues based Run APIs. Thus, the Binding is not necessary any longer
Could you provide an exemple of doing such a thing in C# With the OrtValue API exposed, one create OrtValues over arbitrary buffers and feed them to the model using OrtValues based Run APIs. Thus, the Binding is not necessary any longer ?
I am basically trying to reproduce the llama example available here in C#. I use both OrtValue and the IoBindings object but I am seing that the DefaultInstance of the allocation is always the CPU while my model is correctly on GPU. When I try to give the bound output to the input of the model it works but the inference time is slower so I might be doing something wrong.
From the above comment I think I could instantiate directly an OrtValue on the correct device (which is GPU 0 for now) directly without needing IOBindings at all. Could you create or point me to an example of such a thing ?
My end goal would be to use create only one cache per LLM inference, as the "cache" output by an LLM is usually re-used in the next call as a part of the input.
Maybe @yuslepukhin can help since he wrote the comment ?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
In the C# Documentation there is an interesting comment concerning IOBindings
Could you provide an exemple of doing such a thing in C#
With the OrtValue API exposed, one create OrtValues over arbitrary buffers and feed them to the model using OrtValues based Run APIs. Thus, the Binding is not necessary any longer
?I am basically trying to reproduce the llama example available here in C#. I use both OrtValue and the IoBindings object but I am seing that the DefaultInstance of the allocation is always the CPU while my model is correctly on GPU. When I try to give the bound output to the input of the model it works but the inference time is slower so I might be doing something wrong.
From the above comment I think I could instantiate directly an OrtValue on the correct device (which is GPU 0 for now) directly without needing IOBindings at all. Could you create or point me to an example of such a thing ?
My end goal would be to use create only one cache per LLM inference, as the "cache" output by an LLM is usually re-used in the next call as a part of the input.
Maybe @yuslepukhin can help since he wrote the comment ?
Thank you very much
Beta Was this translation helpful? Give feedback.
All reactions