-
Notifications
You must be signed in to change notification settings - Fork 0
Description
There are 2 distinct scenarios.
- Someone downloads a model from HF or OCI in onnx format, runs onnxruntime to compile to
.bundle, save to local cache, and then run it - Someone downloads a model from HF or OCI in
.bundleformat, runs onnxruntime to run it, no compilation necessary.
There are two image-related issues with the second scenario.
First, the image is much larger than it needs to be. It still needs the inference server, and maybe neuralizer, but it does not need glow or gcc or llvm, etc. This makes it harder to use and is a worse user experience.
Second, the image name is confusing. It no longer is an _onnx_runner; it is a _bundle_runner. This is equivalent to using gcc tools to run an executable. Sure, developers can (and sometimes do) use gdb to run something, but for the vast majority of users, they do not need or want the toolchain; they just want the compiled binary.
We probably need to separate the image into two:
- bundlerunner - runs a compiled
.bundle - onnxruntime - compiles an onnx model into a
.bundleand then optionally runs it
We can discuss if onnxruntime should be able to run the compiled bundle, or if the correct process is to say "run this" and then "run that". Since it all is wrapped in nekko, I would favour separate.
cc @jerenkrantz @rvs