Feature/txe sqr #15567

dineshReddy6381 · 2025-08-25T15:39:44Z

[dreddy@wssw01 llama.cpp]$ ./build-posix/bin/simple-backend-tsi "sqr"
load_model: using TSavorite backend

Calculating mem_size 384 1 and creating ggml context

Creating input Tensor

Creating Backend Buffer

Loading Input Tensor Data to Backend Buffer

Bringing tensor data from Backend buffer and printing 32 tensor data:
[ 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00 21.00 22.00 23.00 24.00 25.00 26.00 27.00 28.00 29.00 30.00 31.00 32.00 ]
main: compute buffer size: 0.2500 KB

Under Test case for compute API creating build_graph

Compute Done

operation type: 5, num of elements 32

compute is also done

TEST CASE PASSED

GGML Tsavorite Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    17.7000   17.7000      0.1540  [8.74e-01%] [Thread] GGML Tsavorite 
1    17.5460   17.5460      7.4410  └─ [8.66e-01%] tsi::runtime::TsavRTPosix::initialize
1     9.9110    9.9110      1.8920    └─ [4.89e-01%] tsi::runtime::TsavRTPosix::initializeQueues
1     7.3270    7.3270      7.3270      └─ [3.62e-01%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.6370    0.6370      0.6370      └─ [3.14e-02%] tsi::runtime::TsavRTPosix::requestTXEDevice
1     0.0550    0.0550      0.0520      └─ [2.71e-03%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0030    0.0030      0.0030        └─ [1.48e-04%] tsi::runtime::executeWithTimeout
1     0.1940    0.1940      0.1940    └─ [9.58e-03%] tsi::runtime::TsavRT::initialize

[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)

1     2.5880    2.5880      1.9240  [1.28e-01%] [Thread] tsi::runtime::TsavRT::finalize
1     0.6510    0.6510      0.0550  └─ [3.21e-02%] tsi::runtime::TsavRTPosix::detachFromTXEDevice
1     0.5960    0.5960      0.0940    └─ [2.94e-02%] tsi::runtime::TsavRT::executeSyncCommand
1     0.4650    0.4650      0.4650      └─ [2.30e-02%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.0370    0.0370      0.0330      └─ [1.83e-03%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0040    0.0040      0.0040        └─ [1.97e-04%] tsi::runtime::executeWithTimeout
2     0.0130    0.0065      0.0130  └─ [6.42e-04%] tsi::runtime::TsavRT::deallocate

[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)

1     2.9740    2.9740      0.0520  [1.47e-01%] [Thread] tsi::runtime::TsavRTPosix::loadBlob
2     2.9220    1.4610      2.9220  └─ [1.44e-01%] tsi::runtime::executeWithTimeout
1     0.0000    0.0000      0.0000  └─ [0.00e+00%] LOAD_BLOB Command Execution
1     0.0000    0.0000      0.0000  └─ [0.00e+00%] Command{command=2 (LOAD_BLOB), blob_args=[2147500672[0x800...
1     0.0000    0.0000      0.0000  └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)

1     0.1780    0.1780      0.0290  [8.79e-03%] [Thread] tsi::runtime::TsavRTPosix::unloadBlob
2     0.1490    0.0745      0.1490  └─ [7.35e-03%] tsi::runtime::executeWithTimeout
1     0.0000    0.0000      0.0000  └─ [0.00e+00%] UNLOAD_BLOB Command Execution
1     0.0000    0.0000      0.0000  └─ [0.00e+00%] Command{command=3 (UNLOAD_BLOB), blob_args=[2147500672[0x8...
1     0.0000    0.0000      0.0000  └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

3     8.0060    2.6687      0.1090  [3.95e-01%] [Thread] tsi::runtime::TsavRT::processResponses
3     7.8970    2.6323      7.8970  └─ [3.90e-01%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

1     0.0130    0.0130      0.0120  [6.42e-04%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
1     0.0010    0.0010      0.0010  └─ [4.94e-05%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

3     0.0400    0.0133      0.0400  [1.97e-03%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

1     0.0020    0.0020      0.0020  [9.87e-05%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

1     0.2150    0.2150      0.2150  [1.06e-02%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

1     0.0040    0.0040      0.0040  [1.97e-04%] [Thread] tsi::runtime::TsavRT::deallocate

========================================================================================================================
- 2025.8510 0.0000 2025.8510 [100.00%] TOTAL

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.8333

[dreddy@wssw01 llama.cpp]$

@FR-702 @FIR-702 - llama.cpp: Sync with latest opensource

This change has following. 1. Move to new SDK 0.1.2 2. remove the requirement for libgomp in fpga build

@FIR-707: Fix requirement for libgomp and move to new sdk 0.1.2

The chanegs have following 1. Enable profiling for tsavorite backed for txe 2. Add std c++20 for compiling the profiler The test results are as follows root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv30_05_24_2025/bin# ./run_platform_test.sh Check if tnApcMgr is running; if it is not, uncomment below line and execute the run_platform_test.sh script. Running on v0.1.1.tsv30_05_24_2025 [2018-03-09 13:52:26.300409] 271:272 [ info] :: </proj/work/atrivedi/workspace/05_25_2025/tsi_yocto_workspace/tsi-apc-manager/platform/rsm_mgr/rsm_process_req.c:129> TXE resource allocation request processed successfully. [2018-03-09 13:52:27.339] [info] [llama.cpp:56] Execution time: 1019 ms [2018-03-09 13:52:27.347638] 2909:2909 [ info] [LlamaForCausalLM_Random v. 2] TestBase.h:154: Model executed successfully. Validating result... [2018-03-09 13:52:27.380511] 2909:2909 [ info] [LlamaForCausalLM_Random v. 2] TestBase.h:193: PASS [relative err=0.000000, relTol=1.000000e-05] [2018-03-09 13:52:27.405665] 271:272 [ info] :: </proj/work/atrivedi/workspace/05_25_2025/tsi_yocto_workspace/tsi-apc-manager/platform/rsm_mgr/rsm_process_req.c:145> TXE resource release request processed successfully. Profiling Results (LlamaForCausalLM_Random): ------------------------------------------------------------------------------------------------------------------------ Calls Total(ms) T/call Self(ms) Function ------------------------------------------------------------------------------------------------------------------------ 243 498.000 2.049 0.000 [45%] RuntimeHostShim::awaitCommandListCompletion 84 200.688 2.389 200.688 └─ [18%] [ txe_blob_1 ] 32 76.626 2.395 76.626 └─ [ 7%] [ txe_blob_6 ] 16 55.493 3.468 55.493 └─ [ 5%] [ txe_blob_12 ] 8 31.821 3.978 31.821 └─ [ 3%] [ txe_blob_10 ] 8 31.322 3.915 31.322 └─ [ 3%] [ txe_blob_7 ] 8 31.152 3.894 31.152 └─ [ 3%] [ txe_blob_8 ] 8 27.693 3.462 27.693 └─ [ 2%] [ txe_blob_9 ] 17 26.019 1.531 26.019 └─ [ 2%] [ txe_blob_2 ] 17 25.906 1.524 25.906 └─ [ 2%] [ txe_blob_5 ] 17 25.899 1.523 25.899 └─ [ 2%] [ txe_blob_3 ] 17 25.833 1.520 25.833 └─ [ 2%] [ txe_blob_4 ] 8 23.993 2.999 23.993 └─ [ 2%] [ txe_blob_11 ] 3 6.002 2.001 6.002 └─ [ 1%] [ txe_blob_0 ] 1 35.000 35.000 35.000 [ 3%] RuntimeHostShim::finalize 188 33.000 0.176 33.000 [ 3%] RuntimeHostShim::copy 1 16.000 16.000 16.000 [ 1%] RuntimeHostShim::initialize 13 1.000 0.077 1.000 [ 0%] RuntimeHostShim::loadBlob 573 0.000 0.000 0.000 [ 0%] RuntimeHostShim::allocate 573 0.000 0.000 0.000 [ 0%] RuntimeHostShim::deallocate 243 0.000 0.000 0.000 [ 0%] RuntimeHostShim::createCommandList 922 0.000 0.000 0.000 [ 0%] RuntimeHostShim::getShmemManager 243 0.000 0.000 0.000 [ 0%] RuntimeHostShim::launchBlob 243 0.000 0.000 0.000 [ 0%] RuntimeHostShim::addCommandToList 243 0.000 0.000 0.000 [ 0%] RuntimeHostShim::finalizeCommandList 13 0.000 0.000 0.000 [ 0%] RuntimeHostShim::unloadBlob 33 0.000 0.000 0.000 [ 0%] RuntimeHostShim::stridedCopy ======================================================================================================================== 3532 1116.000 0.316 1116.000 [100%] TOTAL ======================================================================================================================== register_backend: registered backend Tsavorite (1 devices) register_device: registered device Tsavorite (txe) register_backend: registered backend CPU (1 devices) register_device: registered device CPU (CPU) load_backend: failed to find ggml_backend_init in /usr/bin/tsi/v0.1.1.tsv30_05_24_2025/bin/tsi-ggml/libggml-tsavorite.so load_backend: failed to find ggml_backend_init in /usr/bin/tsi/v0.1.1.tsv30_05_24_2025/bin/tsi-ggml/libggml-cpu.so build: 5464 (194fbaa9) with gcc (GCC) 13.3.0 for x86_64-pc-linux-gnu (debug) main: llama backend init main: load the model and apply lora adapter, if any TXE Device MEMORY Summary total 134217728 and free 134217728 llama_model_load_from_file_impl: using device Tsavorite (txe) - 128 MiB free llama_model_loader: loaded meta data with 24 key-value pairs and 75 tensors from /tsi/anoop_feb26/tinyllama-vo-5m-para.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Vicuna Hf llama_model_loader: - kv 3: general.size_label str = 4.6M llama_model_loader: - kv 4: general.license str = apache-2.0 llama_model_loader: - kv 5: llama.block_count u32 = 8 llama_model_loader: - kv 6: llama.context_length u32 = 2048 llama_model_loader: - kv 7: llama.embedding_length u32 = 64 llama_model_loader: - kv 8: llama.feed_forward_length u32 = 256 llama_model_loader: - kv 9: llama.attention.head_count u32 = 16 llama_model_loader: - kv 10: llama.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 11: general.file_type u32 = 32 llama_model_loader: - kv 12: llama.vocab_size u32 = 32000 llama_model_loader: - kv 13: llama.rope.dimension_count u32 = 4 llama_model_loader: - kv 14: tokenizer.ggml.model str = llama llama_model_loader: - kv 15: tokenizer.ggml.pre str = default llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 21: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 17 tensors llama_model_loader: - type bf16: 58 tensors print_info: file format = GGUF V3 (latest) print_info: file type = BF16 print_info: file size = 8.82 MiB (16.00 BPW) load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: special tokens cache size = 3 load: token to piece cache size = 0.1914 MB print_info: arch = llama print_info: vocab_only = 0 print_info: n_ctx_train = 2048 print_info: n_embd = 64 print_info: n_layer = 8 print_info: n_head = 16 print_info: n_head_kv = 16 print_info: n_rot = 4 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 4 print_info: n_embd_head_v = 4 print_info: n_gqa = 1 print_info: n_embd_k_gqa = 64 print_info: n_embd_v_gqa = 64 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 256 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = linear print_info: freq_base_train = 10000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 2048 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = ?B print_info: model params = 4.62 M print_info: general.name = Vicuna Hf print_info: vocab type = SPM print_info: n_vocab = 32000 print_info: n_merges = 0 print_info: BOS token = 1 '<s>' print_info: EOS token = 2 '</s>' print_info: UNK token = 0 '<unk>' print_info: PAD token = 0 '<unk>' print_info: LF token = 13 '<0x0A>' print_info: EOG token = 2 '</s>' print_info: max token length = 18 load_tensors: loading model tensors, this can take a while... (mmap = true) TXE Device MEMORY Summary total 134217728 and free 134217728 load_tensors: offloading 0 repeating layers to GPU load_tensors: offloaded 0/9 layers to GPU load_tensors: CPU_Mapped model buffer size = 8.82 MiB .............. llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 12288 llama_context: n_ctx_per_seq = 12288 llama_context: n_batch = 1024 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 0 llama_context: freq_base = 10000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (12288) > n_ctx_train (2048) -- possible training context overflow [2018-03-09 13:52:28.706203] 271:272 [ info] :: </proj/work/atrivedi/workspace/05_25_2025/tsi_yocto_workspace/tsi-apc-manager/platform/rsm_mgr/rsm_process_req.c:129> TXE resource allocation request processed successfully. llama_context: CPU output buffer size = 0.12 MiB llama_kv_cache_unified: CPU KV buffer size = 24.00 MiB llama_kv_cache_unified: size = 24.00 MiB ( 12288 cells, 8 layers, 1 seqs), K (f16): 12.00 MiB, V (f16): 12.00 MiB ggml_backend_tsavorite_buffer_type_alloc_buffer is called from llama data Loader ANoop Allocating memory from tsi_alloc with size 266240 Allocating memory from tsi_alloc with size 266240 starting memory 0xffff93e00080 Address of Newly Created BUffer 0xffff93e00080 and size 266240 llama_context: tsavorite compute buffer size = 0.25 MiB llama_context: CPU compute buffer size = 408.51 MiB llama_context: graph nodes = 294 llama_context: graph splits = 67 (with bs=512), 37 (with bs=1) common_init_from_params: setting dry_penalty_last_n to ctx_size = 12288 main: llama threadpool init, n_threads = 4 main: model was trained on only 2048 context tokens (12288 specified) system_info: n_threads = 4 (n_threads_batch = 4) / 4 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | sampler seed: 177927434 sampler params: repeat_last_n = 5, repeat_penalty = 1.500, frequency_penalty = 0.000, presence_penalty = 0.000 dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 12288 top_k = 50, top_p = 0.900, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.000 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist generate: n_ctx = 12288, n_batch = 1024, n_predict = 10, n_keep = 1 my cat's name was Tim. He loved to play with his toy llama_perf_sampler_print: sampling time = 195.98 ms / 16 runs ( 12.25 ms per token, 81.64 tokens per second) llama_perf_context_print: load time = 1577.27 ms llama_perf_context_print: prompt eval time = 305.19 ms / 6 tokens ( 50.86 ms per token, 19.66 tokens per second) llama_perf_context_print: eval time = 803.59 ms / 9 runs ( 89.29 ms per token, 11.20 tokens per second) llama_perf_context_print: total time = 2628.44 ms / 15 tokens TXE_ADD Operation, total tensor: 10 Number of Kernel Call: 10 Number of tensor got spilt: 0 Min Num of Elem 64 Max Num of Elem 64 TXE_SUB Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0 TXE_MULT Operation, total tensor: 170 Number of Kernel Call: 245 Number of tensor got spilt: 0 Min Num of Elem 64 Max Num of Elem 384 TXE_DIV Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0 TXE_SQRT Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0 TXE_NEG Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0 TXE_ABS Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0 TXE_SIN Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0 TXE_SIGMOID Operation, total tensor: 0 Number of Kernel Call: 0 Number of tensor got spilt: 0 Min Num of Elem 0 Max Num of Elem 0 [2018-03-09 13:52:32.222949] 271:272 [ info] :: </proj/work/atrivedi/workspace/05_25_2025/tsi_yocto_workspace/tsi-apc-manager/platform/rsm_mgr/rsm_process_req.c:145> TXE resource release request processed successfully. GGML Tsavorite Profiling Results: ------------------------------------------------------------------------------------------------------------------------ Calls Total(ms) T/call Self(ms) Function ------------------------------------------------------------------------------------------------------------------------ 255 255.000 1.000 0.000 [ 7%] RuntimeHostShim::awaitCommandListCompletion 245 379.466 1.549 379.466 └─ [11%] [ txe_mult_blob ] 10 15.443 1.544 15.443 └─ [ 0%] [ txe_add_blob ] 1 35.000 35.000 35.000 [ 1%] RuntimeHostShim::finalize 1 19.000 19.000 2.000 [ 1%] GGML Tsavorite 1 17.000 17.000 17.000 └─ [ 0%] RuntimeHostShim::initialize 256 0.000 0.000 0.000 [ 0%] RuntimeHostShim::allocate 1020 0.000 0.000 0.000 [ 0%] RuntimeHostShim::getShmemManager 255 0.000 0.000 0.000 [ 0%] RuntimeHostShim::createCommandList 255 0.000 0.000 0.000 [ 0%] RuntimeHostShim::loadBlob 255 0.000 0.000 0.000 [ 0%] RuntimeHostShim::launchBlob 255 0.000 0.000 0.000 [ 0%] RuntimeHostShim::addCommandToList 255 0.000 0.000 0.000 [ 0%] RuntimeHostShim::finalizeCommandList 255 0.000 0.000 0.000 [ 0%] RuntimeHostShim::unloadBlob 255 0.000 0.000 0.000 [ 0%] RuntimeHostShim::deallocate ======================================================================================================================== 3318 3529.000 1.064 3529.000 [100%] TOTAL ======================================================================================================================== root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv30_05_24_2025/bin#

Fir 709 - gGGML: Adding SILU Kernel

as follows /proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/bin/../lib/gcc/aarch64-none-linux-gnu/14.2.1/../../../../aarch64-none-linux-gnu/bin/ld: /proj/work/atrivedi/workspace/06_02_2025/llama.cpp/ggml-tsi-kernel/fpga/host/host_abs.o: in function `txe_abs_host': LLVMDialectModule:(.text+0x18): undefined reference to `tsi_alloc' /proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/bin/../lib/gcc/aarch64-none-linux-gnu/14.2.1/../../../../aarch64-none-linux-gnu/bin/ld: LLVMDialectModule:(.text+0x24): undefined reference to `tsi_shmem_handle_from_ptr' /proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/bin/../lib/gcc/aarch64-none-linux-gnu/14.2.1/../../../../aarch64-none-linux-gnu/bin/ld: LLVMDialectModule:(.text+0x30): undefined reference to `tsi_shmem_handle_from_ptr' /proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/bin/../lib/gcc/aarch64-none-linux-gnu/14.2.1/../../../../aarch64-none-linux-gnu/bin/ld: LLVMDialectModule:(.text+0x3c): undefined reference to `tsi_create_command_list' /proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/bin/../lib/gcc/aarch64-none-linux-gnu/14.2.1/../../../../aarch64-none-linux-gnu/bin/ld: LLVMDialectModule:(.text+0x58): undefined reference to `tsi_load_blob' /proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/bin/../lib/gcc/aarch64-none-linux-gnu/14.2.1/../../../../aarch64-none-linux-gnu/bin/ld: LLVMDialectModule:(.text+0x64): undefined reference to `tsi_shmem_handle_from_ptr' /proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/bin/../lib/gcc/aarch64-none-linux-gnu/14.2.1/../../../../aarch64-none-linux-gnu/bin/ld: LLVMDialectModule:(.text+0x70): undefined reference to `tsi_launch_blob' /proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/bin/../lib/gcc/aarch64-none-linux-gnu/14.2.1/../../../../aarch64-none-linux-gnu/bin/ld: LLVMDialectModule:(.text+0x7c): undefined reference to `tsi_add_command_to_list' /proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/bin/../lib/gcc/aarch64-none-linux-gnu/14.2.1/../../../../aarch64-none-linux-gnu/bin/ld: LLVMDialectModule:(.text+0x84): undefined reference to `tsi_finalize_command_list' /proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/bin/../lib/gcc/aarch64-none-linux-gnu/14.2.1/../../../../aarch64-none-linux-gnu/bin/ld: LLVMDialectModule:(.text+0x8c): undefined reference to `tsi_wait' /proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/bin/../lib/gcc/aarch64-none-linux-gnu/14.2.1/../../../../aarch64-none-linux-gnu/bin/ld: LLVMDialectModule:(.text+0x94): undefined reference to `tsi_unload_blob' /proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/bin/../lib/gcc/aarch64-none-linux-gnu/14.2.1/../../../../aarch64-none-linux-gnu/bin/ld: LLVMDialectModule:(.text+0xa0): undefined reference to `tsi_dealloc' /proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/bin/../lib/gcc/aarch64-none-linux-gnu/14.2.1/../../../../aarch64-none-linux-gnu/bin/ld: /proj/work/atrivedi/workspace/06_02_2025/llama.cpp/ggml-tsi-kernel/fpga/host/host_add.o: in function `txe_add_host': LLVMDialectModule:(.text+0x20): undefined reference to `tsi_alloc' /proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/bin/../lib/gcc/aarch64-none-linux-gnu/14.2.1/../../../../aarch64-none-linux-gnu/bin/ld: LLVMDialectModule:(.text+0x2c): undefined reference to `tsi_shmem_handle_from_ptr' /proj/rel/sw/arm-gnu-toolchain-14.2.rel1-x86_64-aarch64-none-linux-gnu/bin/../lib/gcc/aarch64-none-linux-gnu/14.2.1/../../../../aarch64-none-linux-gnu/bin/ld: LLVMDialectModule:(.text+0x38): undefined reference to `tsi_shmem_handle_from_ptr'

runtime/utils/lib/ path

FIR-714: Updated the SDK Release r0.1.3

FIR 722 --- ggml-tsi-kernel latest changes updated

This is a first version of FlaskInterface tool with following 1. Xterm Interface via Browser via /terminal endpoint 2. Serial console interface via Browser via /serial endpoint

@FIR-715: Added FlaskInterface tool for serial port

Just testing my first git pull

Llama.cpp: Webserver & HTML pages support

@FIR-781 - LLama.cpp ggml Stats:Adding Backend and Unary OP Detail

@FIR-782 Llama.cpp: Partial Offloading of Tsavorite Operations

* Added ls -l so Karrar can see files * Resolved comments * Uncommented a print statement * Fixed the custom method I made --------- Co-authored-by: Lewis Lui <[email protected]>

@FIR-783 -Llama.cpp: Update new Release version

@FIR-787 -llama.cpp: Fixed AWS linking issue for tsi-ggml-aws-latest.tz

@FIR-790 -LLama.cpp: Add all Src tensor shape & size

…t to measure performance

@FIR-827 -llama.cpp: python script to run model with different prompt to measure performance

@FIR-895 - llama.cpp: updating the MLIR SDK Version to 1.8

Signed-off-by: Dinesh Reddy <[email protected]>

Anoop Kapoor and others added 30 commits May 23, 2025 22:13

@FIR-702 - llama.cpp: Sync with latest opensource

bb1f981

Releasing next version

6995385

Updated MLIR_SDK_VERSION version

6841096

Updated the Version

1a1514a

Merge pull request #1 from tsisw/FIR-702

ca06b4d

@FR-702 @FIR-702 - llama.cpp: Sync with latest opensource

@FIR-707: Fix requirement for libgomp and move to new sdk 0.1.2

d9dd83c

This change has following. 1. Move to new SDK 0.1.2 2. remove the requirement for libgomp in fpga build

Merge pull request #2 from tsisw/FIR-707

9a1440f

@FIR-707: Fix requirement for libgomp and move to new sdk 0.1.2

@FIR-709 - GGML: Adding SILU Kernel

9d65b92

@FIR-709: Fixed the script

f919789

Merge pull request #4 from tsisw/FIR-709

614dab8

Fir 709 - gGGML: Adding SILU Kernel

@FIR-714: Updated SDK version to r0.1.3 version

9459c0c

@FIR-714: Updated TLIBS to be passed to llama_build function

c18585c

@FIR-714: Updated to use 1.30 external dependencies

47ceff0

@FIR-714: Fixed the issues of not finding fpga libs using

cea50af

runtime/utils/lib/ path

Merge pull request #5 from tsisw/FIR-714

a7b7e46

FIR-714: Updated the SDK Release r0.1.3

Merge pull request #5 from tsisw/FIR-714

d4484c5

FIR-714: Updated the SDK Release r0.1.3

Updated README

bbecb01

@FIR-722 --updating the latest changes for ggml-tsi-kernel code

d7685c7

Merge pull request #6 from tsisw/FIR-722

17d0984

FIR 722 --- ggml-tsi-kernel latest changes updated

@FIR-715: Added FlaskInterface tool for serial port

9688963

This is a first version of FlaskInterface tool with following 1. Xterm Interface via Browser via /terminal endpoint 2. Serial console interface via Browser via /serial endpoint

Merge pull request #7 from tsisw/FIR-715

c369a62

@FIR-715: Added FlaskInterface tool for serial port

Merge branch 'master' of github.com:tsisw/llama.cpp

77a3e26

Just testing my first git pull

Just wanted to see if I could push. Added one comment

a4b77bf

@FIR-732 - Llama.cpp: Webserver & HTML pages support

21ba6d1

@FIR-732: Added print back to ensure stdout has data

597f928

Merge pull request #8 from tsisw/FIR-732

ce31089

Llama.cpp: Webserver & HTML pages support

@FIR-733 - Lllama.cpp: Webserver, add JOB status support for Model

8a5ffff

removing commented code

52ae0e9

Anoop Kapoor and others added 23 commits June 30, 2025 16:10

@FIR-781 - LLama.cpp ggml Stats:Adding Backend and Unary OP Detail

fa46e7e

Merge pull request #31 from tsisw/FIR-781

8736109

@FIR-781 - LLama.cpp ggml Stats:Adding Backend and Unary OP Detail

@FIR-781 - Llama.cpp: Partial Offloading of Tsavorite Operations

ad4c63d

Updated tsi-pkg file with proper compilation to enable STATS

2e9d891

Addressed Ashish's comments

424593a

Merge pull request #32 from tsisw/FIR-782

b1e0897

@FIR-782 Llama.cpp: Partial Offloading of Tsavorite Operations

Added ls -l so Karrar can see files (#33)

c27a464

* Added ls -l so Karrar can see files * Resolved comments * Uncommented a print statement * Fixed the custom method I made --------- Co-authored-by: Lewis Lui <[email protected]>

@FIR-783 - Llama.cpp: Update new Release version

042cf41

Merge pull request #34 from tsisw/FIR-783

c40ca13

@FIR-783 -Llama.cpp: Update new Release version

@FIR-787 - llama.cpp: Fix AWS linking issue for tsi-ggml-aws-latest.tz

4400b0e

Merge pull request #36 from tsisw/FIR-783

715278d

@FIR-787 -llama.cpp: Fixed AWS linking issue for tsi-ggml-aws-latest.tz

@FIR-784: Remove flaskIfc code from llama.cpp repo (#37)

cab6b7b

@FIR-790 - LLama.cpp: Add all Src tensor shape & size

8f8a5ac

Merge pull request #38 from tsisw/FIR-790

a5441bd

@FIR-790 -LLama.cpp: Add all Src tensor shape & size

@FIR-810: Update llama.cpp and ggml-tsi-kernel to use sdk-0.1.6 (#39)

2e5b057

@FIR-827 - llama.cpp: python script to run model with different promp…

fa2243b

…t to measure performance

Merge pull request #40 from tsisw/FIR-827

7d0eb95

@FIR-827 -llama.cpp: python script to run model with different prompt to measure performance

@FIR-895 - llama.cpp: updating the MLIR SDK Version to 1.8

979555e

@FIR-895 - llama.cpp: updating the MLIR SDK Version to 1.8

dabd76e

Updated ggml version to 1.6

f620a5f

Merge pull request #41 from tsisw/FIR-895

61fb7ab

@FIR-895 - llama.cpp: updating the MLIR SDK Version to 1.8

-Added new kernel: tsi-sqr

97b51ee

Signed-off-by: Dinesh Reddy <[email protected]>

-Updated ggml-tsi-kernel branch

1ca8835

Signed-off-by: Dinesh Reddy <[email protected]>

github-actions bot added documentation Improvements or additions to documentation build Compilation issues testing Everything test related examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Aug 25, 2025

dineshReddy6381 closed this Aug 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/txe sqr #15567

Feature/txe sqr #15567

Uh oh!

dineshReddy6381 commented Aug 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Feature/txe sqr #15567

Feature/txe sqr #15567

Uh oh!

Conversation

dineshReddy6381 commented Aug 25, 2025

GGML Tsavorite Profiling Results:

Calls Total(ms) T/call Self(ms) Function

[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)

[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)

[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

======================================================================================================================== - 2025.8510 0.0000 2025.8510 [100.00%] TOTAL

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.8333

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

========================================================================================================================
- 2025.8510 0.0000 2025.8510 [100.00%] TOTAL