Skip to content

Conversation

@mmankal
Copy link

@mmankal mmankal commented Jun 20, 2025

Make sure to read the contributing guidelines before submitting a PR

Anoop Kapoor and others added 30 commits May 23, 2025 22:13
@FR-702 @FIR-702 - llama.cpp: Sync with latest opensource
This change has following.
1. Move to new SDK 0.1.2
2. remove the requirement for libgomp in fpga build
@FIR-707: Fix requirement for libgomp and move to new sdk 0.1.2
The chanegs have following
1. Enable profiling for tsavorite backed for txe
2. Add std c++20 for compiling the profiler
The test results are as follows
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv30_05_24_2025/bin# ./run_platform_test.sh
Check if tnApcMgr is running; if it is not, uncomment below line and execute the run_platform_test.sh script.
Running on v0.1.1.tsv30_05_24_2025
[2018-03-09 13:52:26.300409] 271:272 [ info]  :: </proj/work/atrivedi/workspace/05_25_2025/tsi_yocto_workspace/tsi-apc-manager/platform/rsm_mgr/rsm_process_req.c:129> TXE resource allocation request processed successfully.
[2018-03-09 13:52:27.339] [info] [llama.cpp:56] Execution time: 1019 ms
[2018-03-09 13:52:27.347638] 2909:2909 [ info] [LlamaForCausalLM_Random v. 2] TestBase.h:154: Model executed successfully. Validating result...
[2018-03-09 13:52:27.380511] 2909:2909 [ info] [LlamaForCausalLM_Random v. 2] TestBase.h:193: PASS [relative err=0.000000, relTol=1.000000e-05]
[2018-03-09 13:52:27.405665] 271:272 [ info]  :: </proj/work/atrivedi/workspace/05_25_2025/tsi_yocto_workspace/tsi-apc-manager/platform/rsm_mgr/rsm_process_req.c:145> TXE resource release request processed successfully.

Profiling Results (LlamaForCausalLM_Random):
------------------------------------------------------------------------------------------------------------------------
Calls  Total(ms)    T/call  Self(ms)  Function
------------------------------------------------------------------------------------------------------------------------
  243    498.000     2.049     0.000  [45%] RuntimeHostShim::awaitCommandListCompletion
   84    200.688     2.389   200.688  └─ [18%] [ txe_blob_1 ]
   32     76.626     2.395    76.626  └─ [ 7%] [ txe_blob_6 ]
   16     55.493     3.468    55.493  └─ [ 5%] [ txe_blob_12 ]
    8     31.821     3.978    31.821  └─ [ 3%] [ txe_blob_10 ]
    8     31.322     3.915    31.322  └─ [ 3%] [ txe_blob_7 ]
    8     31.152     3.894    31.152  └─ [ 3%] [ txe_blob_8 ]
    8     27.693     3.462    27.693  └─ [ 2%] [ txe_blob_9 ]
   17     26.019     1.531    26.019  └─ [ 2%] [ txe_blob_2 ]
   17     25.906     1.524    25.906  └─ [ 2%] [ txe_blob_5 ]
   17     25.899     1.523    25.899  └─ [ 2%] [ txe_blob_3 ]
   17     25.833     1.520    25.833  └─ [ 2%] [ txe_blob_4 ]
    8     23.993     2.999    23.993  └─ [ 2%] [ txe_blob_11 ]
    3      6.002     2.001     6.002  └─ [ 1%] [ txe_blob_0 ]
    1     35.000    35.000    35.000  [ 3%] RuntimeHostShim::finalize
  188     33.000     0.176    33.000  [ 3%] RuntimeHostShim::copy
    1     16.000    16.000    16.000  [ 1%] RuntimeHostShim::initialize
   13      1.000     0.077     1.000  [ 0%] RuntimeHostShim::loadBlob
  573      0.000     0.000     0.000  [ 0%] RuntimeHostShim::allocate
  573      0.000     0.000     0.000  [ 0%] RuntimeHostShim::deallocate
  243      0.000     0.000     0.000  [ 0%] RuntimeHostShim::createCommandList
  922      0.000     0.000     0.000  [ 0%] RuntimeHostShim::getShmemManager
  243      0.000     0.000     0.000  [ 0%] RuntimeHostShim::launchBlob
  243      0.000     0.000     0.000  [ 0%] RuntimeHostShim::addCommandToList
  243      0.000     0.000     0.000  [ 0%] RuntimeHostShim::finalizeCommandList
   13      0.000     0.000     0.000  [ 0%] RuntimeHostShim::unloadBlob
   33      0.000     0.000     0.000  [ 0%] RuntimeHostShim::stridedCopy
========================================================================================================================
 3532   1116.000     0.316  1116.000  [100%] TOTAL
========================================================================================================================

register_backend: registered backend Tsavorite (1 devices)
register_device: registered device Tsavorite (txe)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (CPU)
load_backend: failed to find ggml_backend_init in /usr/bin/tsi/v0.1.1.tsv30_05_24_2025/bin/tsi-ggml/libggml-tsavorite.so
load_backend: failed to find ggml_backend_init in /usr/bin/tsi/v0.1.1.tsv30_05_24_2025/bin/tsi-ggml/libggml-cpu.so
build: 5464 (194fbaa) with gcc (GCC) 13.3.0 for x86_64-pc-linux-gnu (debug)
main: llama backend init
main: load the model and apply lora adapter, if any

 TXE Device MEMORY Summary total 134217728 and free 134217728
llama_model_load_from_file_impl: using device Tsavorite (txe) - 128 MiB free
llama_model_loader: loaded meta data with 24 key-value pairs and 75 tensors from /tsi/anoop_feb26/tinyllama-vo-5m-para.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Vicuna Hf
llama_model_loader: - kv   3:                         general.size_label str              = 4.6M
llama_model_loader: - kv   4:                            general.license str              = apache-2.0
llama_model_loader: - kv   5:                          llama.block_count u32              = 8
llama_model_loader: - kv   6:                       llama.context_length u32              = 2048
llama_model_loader: - kv   7:                     llama.embedding_length u32              = 64
llama_model_loader: - kv   8:                  llama.feed_forward_length u32              = 256
llama_model_loader: - kv   9:                 llama.attention.head_count u32              = 16
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  11:                          general.file_type u32              = 32
llama_model_loader: - kv  12:                           llama.vocab_size u32              = 32000
llama_model_loader: - kv  13:                 llama.rope.dimension_count u32              = 4
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  21:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  22:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   17 tensors
llama_model_loader: - type bf16:   58 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = BF16
print_info: file size   = 8.82 MiB (16.00 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 3
load: token to piece cache size = 0.1914 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 2048
print_info: n_embd           = 64
print_info: n_layer          = 8
print_info: n_head           = 16
print_info: n_head_kv        = 16
print_info: n_rot            = 4
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 4
print_info: n_embd_head_v    = 4
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 64
print_info: n_embd_v_gqa     = 64
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 256
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 2048
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = ?B
print_info: model params     = 4.62 M
print_info: general.name     = Vicuna Hf
print_info: vocab type       = SPM
print_info: n_vocab          = 32000
print_info: n_merges         = 0
print_info: BOS token        = 1 '<s>'
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 0 '<unk>'
print_info: PAD token        = 0 '<unk>'
print_info: LF token         = 13 '<0x0A>'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 18
load_tensors: loading model tensors, this can take a while... (mmap = true)

 TXE Device MEMORY Summary total 134217728 and free 134217728
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/9 layers to GPU
load_tensors:   CPU_Mapped model buffer size =     8.82 MiB
..............
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 12288
llama_context: n_ctx_per_seq = 12288
llama_context: n_batch       = 1024
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (12288) > n_ctx_train (2048) -- possible training context overflow
[2018-03-09 13:52:28.706203] 271:272 [ info]  :: </proj/work/atrivedi/workspace/05_25_2025/tsi_yocto_workspace/tsi-apc-manager/platform/rsm_mgr/rsm_process_req.c:129> TXE resource allocation request processed successfully.
llama_context:        CPU  output buffer size =     0.12 MiB
llama_kv_cache_unified:        CPU KV buffer size =    24.00 MiB
llama_kv_cache_unified: size =   24.00 MiB ( 12288 cells,   8 layers,  1 seqs), K (f16):   12.00 MiB, V (f16):   12.00 MiB
ggml_backend_tsavorite_buffer_type_alloc_buffer is called from llama data Loader

 ANoop Allocating memory from tsi_alloc with size  266240

 Allocating memory from tsi_alloc with size  266240 starting memory 0xffff93e00080

Address of Newly Created BUffer 0xffff93e00080 and size 266240
llama_context:  tsavorite compute buffer size =     0.25 MiB
llama_context:        CPU compute buffer size =   408.51 MiB
llama_context: graph nodes  = 294
llama_context: graph splits = 67 (with bs=512), 37 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 12288
main: llama threadpool init, n_threads = 4
main: model was trained on only 2048 context tokens (12288 specified)

system_info: n_threads = 4 (n_threads_batch = 4) / 4 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 |

sampler seed: 177927434
sampler params:
	repeat_last_n = 5, repeat_penalty = 1.500, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 12288
	top_k = 50, top_p = 0.900, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.000
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 12288, n_batch = 1024, n_predict = 10, n_keep = 1

 my cat's name was Tim. He loved to play with his toy

llama_perf_sampler_print:    sampling time =     195.98 ms /    16 runs   (   12.25 ms per token,    81.64 tokens per second)
llama_perf_context_print:        load time =    1577.27 ms
llama_perf_context_print: prompt eval time =     305.19 ms /     6 tokens (   50.86 ms per token,    19.66 tokens per second)
llama_perf_context_print:        eval time =     803.59 ms /     9 runs   (   89.29 ms per token,    11.20 tokens per second)
llama_perf_context_print:       total time =    2628.44 ms /    15 tokens

 TXE_ADD Operation, total tensor: 10  Number of Kernel Call: 10  Number of tensor got spilt: 0 Min Num of Elem 64 Max Num of Elem 64

 TXE_SUB Operation, total tensor: 0  Number of Kernel Call: 0  Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0

 TXE_MULT Operation, total tensor: 170  Number of Kernel Call: 245  Number of tensor got spilt: 0 Min Num of Elem 64 Max Num of Elem 384

 TXE_DIV Operation, total tensor: 0  Number of Kernel Call: 0  Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0

 TXE_SQRT Operation, total tensor: 0  Number of Kernel Call: 0  Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0

 TXE_NEG Operation, total tensor: 0  Number of Kernel Call: 0  Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0

 TXE_ABS Operation, total tensor: 0  Number of Kernel Call: 0  Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0

 TXE_SIN Operation, total tensor: 0  Number of Kernel Call: 0  Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0

 TXE_SIGMOID Operation, total tensor: 0  Number of Kernel Call: 0  Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0
[2018-03-09 13:52:32.222949] 271:272 [ info]  :: </proj/work/atrivedi/workspace/05_25_2025/tsi_yocto_workspace/tsi-apc-manager/platform/rsm_mgr/rsm_process_req.c:145> TXE resource release request processed successfully.

GGML Tsavorite Profiling Results:
------------------------------------------------------------------------------------------------------------------------
Calls  Total(ms)    T/call  Self(ms)  Function
------------------------------------------------------------------------------------------------------------------------
  255    255.000     1.000     0.000  [ 7%] RuntimeHostShim::awaitCommandListCompletion
  245    379.466     1.549   379.466  └─ [11%] [ txe_mult_blob ]
   10     15.443     1.544    15.443  └─ [ 0%] [ txe_add_blob ]
    1     35.000    35.000    35.000  [ 1%] RuntimeHostShim::finalize
    1     19.000    19.000     2.000  [ 1%] GGML Tsavorite
    1     17.000    17.000    17.000  └─ [ 0%] RuntimeHostShim::initialize
  256      0.000     0.000     0.000  [ 0%] RuntimeHostShim::allocate
 1020      0.000     0.000     0.000  [ 0%] RuntimeHostShim::getShmemManager
  255      0.000     0.000     0.000  [ 0%] RuntimeHostShim::createCommandList
  255      0.000     0.000     0.000  [ 0%] RuntimeHostShim::loadBlob
  255      0.000     0.000     0.000  [ 0%] RuntimeHostShim::launchBlob
  255      0.000     0.000     0.000  [ 0%] RuntimeHostShim::addCommandToList
  255      0.000     0.000     0.000  [ 0%] RuntimeHostShim::finalizeCommandList
  255      0.000     0.000     0.000  [ 0%] RuntimeHostShim::unloadBlob
  255      0.000     0.000     0.000  [ 0%] RuntimeHostShim::deallocate
========================================================================================================================
 3318   3529.000     1.064  3529.000  [100%] TOTAL
========================================================================================================================

root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv30_05_24_2025/bin#
Fir 709 - gGGML: Adding SILU Kernel
as follows

/proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/bin/../lib/gcc/aarch64-none-linux-gnu/14.2.1/../../../../aarch64-none-linux-gnu/bin/ld: /proj/work/atrivedi/workspace/06_02_2025/llama.cpp/ggml-tsi-kernel/fpga/host/host_abs.o: in function `txe_abs_host':
LLVMDialectModule:(.text+0x18): undefined reference to `tsi_alloc'
/proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/bin/../lib/gcc/aarch64-none-linux-gnu/14.2.1/../../../../aarch64-none-linux-gnu/bin/ld: LLVMDialectModule:(.text+0x24): undefined reference to `tsi_shmem_handle_from_ptr'
/proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/bin/../lib/gcc/aarch64-none-linux-gnu/14.2.1/../../../../aarch64-none-linux-gnu/bin/ld: LLVMDialectModule:(.text+0x30): undefined reference to `tsi_shmem_handle_from_ptr'
/proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/bin/../lib/gcc/aarch64-none-linux-gnu/14.2.1/../../../../aarch64-none-linux-gnu/bin/ld: LLVMDialectModule:(.text+0x3c): undefined reference to `tsi_create_command_list'
/proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/bin/../lib/gcc/aarch64-none-linux-gnu/14.2.1/../../../../aarch64-none-linux-gnu/bin/ld: LLVMDialectModule:(.text+0x58): undefined reference to `tsi_load_blob'
/proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/bin/../lib/gcc/aarch64-none-linux-gnu/14.2.1/../../../../aarch64-none-linux-gnu/bin/ld: LLVMDialectModule:(.text+0x64): undefined reference to `tsi_shmem_handle_from_ptr'
/proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/bin/../lib/gcc/aarch64-none-linux-gnu/14.2.1/../../../../aarch64-none-linux-gnu/bin/ld: LLVMDialectModule:(.text+0x70): undefined reference to `tsi_launch_blob'
/proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/bin/../lib/gcc/aarch64-none-linux-gnu/14.2.1/../../../../aarch64-none-linux-gnu/bin/ld: LLVMDialectModule:(.text+0x7c): undefined reference to `tsi_add_command_to_list'
/proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/bin/../lib/gcc/aarch64-none-linux-gnu/14.2.1/../../../../aarch64-none-linux-gnu/bin/ld: LLVMDialectModule:(.text+0x84): undefined reference to `tsi_finalize_command_list'
/proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/bin/../lib/gcc/aarch64-none-linux-gnu/14.2.1/../../../../aarch64-none-linux-gnu/bin/ld: LLVMDialectModule:(.text+0x8c): undefined reference to `tsi_wait'
/proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/bin/../lib/gcc/aarch64-none-linux-gnu/14.2.1/../../../../aarch64-none-linux-gnu/bin/ld: LLVMDialectModule:(.text+0x94): undefined reference to `tsi_unload_blob'
/proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/bin/../lib/gcc/aarch64-none-linux-gnu/14.2.1/../../../../aarch64-none-linux-gnu/bin/ld: LLVMDialectModule:(.text+0xa0): undefined reference to `tsi_dealloc'
/proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/bin/../lib/gcc/aarch64-none-linux-gnu/14.2.1/../../../../aarch64-none-linux-gnu/bin/ld: /proj/work/atrivedi/workspace/06_02_2025/llama.cpp/ggml-tsi-kernel/fpga/host/host_add.o: in function `txe_add_host':
LLVMDialectModule:(.text+0x20): undefined reference to `tsi_alloc'
/proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/bin/../lib/gcc/aarch64-none-linux-gnu/14.2.1/../../../../aarch64-none-linux-gnu/bin/ld: LLVMDialectModule:(.text+0x2c): undefined reference to `tsi_shmem_handle_from_ptr'
/proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/bin/../lib/gcc/aarch64-none-linux-gnu/14.2.1/../../../../aarch64-none-linux-gnu/bin/ld: LLVMDialectModule:(.text+0x38): undefined reference to `tsi_shmem_handle_from_ptr'
FIR-714: Updated the SDK Release r0.1.3
FIR-714: Updated the SDK Release r0.1.3
FIR 722 --- ggml-tsi-kernel latest changes updated
This is a first version of FlaskInterface tool with following
1. Xterm Interface via Browser via /terminal endpoint
2. Serial console interface via Browser via /serial endpoint
@FIR-715: Added FlaskInterface tool for serial port
Llama.cpp: Webserver & HTML pages support
akapoor3518 and others added 19 commits June 11, 2025 21:52
:@FIR-733 - Lllama.cpp: Webserver, add JOB status support for Model
@FIR-731 - serial_script.py changes to identify end of output
This commit has two changes
1. Added another endpoint llama-cli to invole the run_platform_test.sh
   directly
2. Updated reading of output to byte by byte to identify marking prompt
   and exit when the marker is seen
@FIR-737: Added another endpoint llama-cli t invoke directly in URL
run_platform_test.sh

Co-authored-by: Ashish Trivedi <[email protected]>
@FIR-736 - lama.cpp: Disable all logs except token generation log
…path (#16)

The changes are as follows
1. change directory to right folder before running the commands
2. Add system-info and txe-restart functionlity

Co-authored-by: Ashish Trivedi <[email protected]>
@FIR-720--GGML: Add TMU(MAT_MUL) kernel
* @FIR-754: Added all parameter parsing for the llama-cli

The test results are as follows
Model Response
cd /usr/bin/tsi/v0.1.1.tsv31_06_06_2025/bin/; ./run_llama_cli.sh "My cat's name"
" 50 tinyllama-vo-5m-para.gguf tSavorite 1.5 1024 50 0.9 5 12288 0.0
[2018-03-09 13:03:17.788243] 271:272 [[32m info[m]  :: </proj/work/mmankali/bld-setuptest/tsirel-31/tsi_yocto_workspace/tsi-apc-manager/platform/rsm_mgr/rsm_process_req.c:129> TXE resource allocation request processed successfully.
 My cat's name was Tim. He loved to play with his toy car. He would run and jump in the park, making loud noises. Tim was very happy with his new toy car.
One day, Tim's mom said, "Tim. You

llama_perf_sampler_print:    sampling time =     999.96 ms /    56 runs   (   17.86 ms per token,    56.00 tokens per second)llama_perf_context_print:        load time =    1713.55 ms
llama_perf_context_print: prompt eval time =     603.51 ms /     6 tokens (  100.58 ms per token,     9.94 tokens per second)
llama_perf_context_print:        eval time =    7069.36 ms /    49 runs   (  144.27 ms per token,     6.93 tokens per second)
llama_perf_context_print:       total time =   10046.17 ms /    55 tokens
[2018-03-09 13:03:28.875126] 271:272 [[32m info[m]  :: </proj/work/mmankali/bld-setuptest/tsirel-31/tsi_yocto_workspace/tsi-apc-manager/platform/rsm_mgr/rsm_process_req.c:145> TXE resource release request processed successfully.

GGML Tsavorite Profiling Results:
------------------------------------------------------------------------------------------------------------------------
Calls  Total(ms)    T/call  Self(ms)  Function
------------------------------------------------------------------------------------------------------------------------
 2715   2720.000     1.002     0.000  [25%] RuntimeHostShim::awaitCommandListCompletion
 1740   2635.984     1.515  2635.984  └─ [24%] [ txe_silu ]
  925   1379.715     1.492  1379.715  └─ [12%] [ txe_mult ]
   50     74.450     1.489    74.450  └─ [ 1%] [ txe_add ]
 2715      0.448     0.000     0.448  └─ [ 0%] TXE 0 Idle
    1     34.000    34.000    34.000  [ 0%] RuntimeHostShim::finalize
    1     16.000    16.000     1.000  [ 0%] GGML Tsavorite
    1     15.000    15.000    15.000  └─ [ 0%] RuntimeHostShim::initialize
 2716      0.000     0.000     0.000  [ 0%] RuntimeHostShim::allocate
 9120      0.000     0.000     0.000  [ 0%] RuntimeHostShim::getShmemManager
 2715      0.000     0.000     0.000  [ 0%] RuntimeHostShim::createCommandList
 2715      0.000     0.000     0.000  [ 0%] RuntimeHostShim::loadBlob
 2715      0.000     0.000     0.000  [ 0%] RuntimeHostShim::launchBlob
 2715      0.000     0.000     0.000  [ 0%] RuntimeHostShim::addCommandToList
 2715      0.000     0.000     0.000  [ 0%] RuntimeHostShim::finalizeCommandList
 2715      0.000     0.000     0.000  [ 0%] RuntimeHostShim::unloadBlob
 2715      0.000     0.000     0.000  [ 0%] RuntimeHostShim::deallocate
========================================================================================================================
33558  11098.000     0.331 11098.000  [100%] TOTAL
========================================================================================================================

⟵ Back to Form

The URL used is as follows
http://10.50.0.124:5003/llama-cli?model=tiny-llama&backend=tSavorite&tokens=10&prompt=My+cat%27s+name&repeat-penalty=1.5&batch-size=1024&top-k=50&top-p=0.9&last-n=5&context-length=12288&temp=0.0

* @FIR-754: Addressed review comments.

---------

Co-authored-by: Ashish Trivedi <[email protected]>
#20)

The test results with ./run_llama_cli.sh with 5 tokens is as follows

+++
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv31_06_06_2025/bin# ./run_llama_cli.sh
 my cat's name is Max. He'

llama_perf_sampler_print:    sampling time =     111.70 ms /    11 runs   (   10.15 ms per token,    98.47 tokens per second)llama_perf_context_print:        load time =  132926.48 ms
llama_perf_context_print: prompt eval time =  109957.33 ms /     6 tokens (18326.22 ms per token,     0.05 tokens per second)
llama_perf_context_print:        eval time =  195682.91 ms /     4 runs   (48920.73 ms per token,     0.02 tokens per second)
llama_perf_context_print:       total time =  328764.01 ms /    10 tokens

GGML Tsavorite Profiling Results:
------------------------------------------------------------------------------------------------------------------------
Calls  Total(ms)    T/call  Self(ms)  Function
------------------------------------------------------------------------------------------------------------------------
33160 100086.000     3.018 47907.157  [32%] RuntimeHostShim::awaitCommandListCompletion
18920  29912.952     1.581 29912.952  └─ [10%] [ txe_silu ]
14080  22010.102     1.563 22010.102  └─ [ 7%] [ txe_mult ]
  160    253.071     1.582   253.071  └─ [ 0%] [ txe_add ]
33160      1.178     0.000     1.178  └─ [ 0%] TXE 0 Idle
    1    114.000   114.000    18.000  [ 0%] GGML Tsavorite
    1     96.000    96.000    96.000  └─ [ 0%] RuntimeHostShim::initialize
    1     52.000    52.000    52.000  [ 0%] RuntimeHostShim::finalize
33160     26.000     0.001    26.000  [ 0%] RuntimeHostShim::loadBlob
33160     23.000     0.001    23.000  [ 0%] RuntimeHostShim::finalizeCommandList
33160      5.000     0.000     5.000  [ 0%] RuntimeHostShim::addCommandToList
33161      3.000     0.000     3.000  [ 0%] RuntimeHostShim::allocate
33160      3.000     0.000     3.000  [ 0%] RuntimeHostShim::createCommandList
113720      0.000     0.000     0.000  [ 0%] RuntimeHostShim::getShmemManager
33160      0.000     0.000     0.000  [ 0%] RuntimeHostShim::launchBlob
33160      0.000     0.000     0.000  [ 0%] RuntimeHostShim::unloadBlob
33160      0.000     0.000     0.000  [ 0%] RuntimeHostShim::deallocate
========================================================================================================================
412163 308849.000     0.749308849.000  [100%] TOTAL
========================================================================================================================

root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv31_06_06_2025/bin#
+++
@mmankal mmankal closed this Jun 20, 2025
@mmankal mmankal deleted the integrate-copy2fpga-filetransfer branch June 20, 2025 15:08
@mmankal mmankal restored the integrate-copy2fpga-filetransfer branch June 20, 2025 15:11
@github-actions github-actions bot added documentation Improvements or additions to documentation build Compilation issues testing Everything test related examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Jun 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build Compilation issues documentation Improvements or additions to documentation examples ggml changes relating to the ggml tensor library for machine learning python python script changes testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants