Skip to content

Conversation

@IanWood1
Copy link
Contributor

@IanWood1 IanWood1 commented Jun 10, 2025

Export command:

python3 -m sharktank.examples.export_paged_llm_v1 \
	--irpa-file=/shark-dev/data/llama3.1/weights/8b/fp16/llama3.1_8b_fp16.irpa \
	--output-mlir=model.mlir --output-config=/dev/null --bs-prefill=4 \
	--bs-decode=4  --attention-dtype=float16 --activation-dtype=float16 \
	--use-attention-mask --use-hf --kv-cache-dtype=float16

@IanWood1 IanWood1 marked this pull request as ready for review June 10, 2025 21:56
Copy link
Member

@kuhar kuhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know what changed?

(Maybe should be fine to update the IR regardless?)

@IanWood1
Copy link
Contributor Author

IanWood1 commented Jun 11, 2025

Do we know what changed?

(Maybe should be fine to update the IR regardless?)

Looks like there are some reshapes that aren't getting folded causing iree_linalg_ext.gather to fail to fuse. llvm/llvm-project#142827 fixes it for llama & mistral. With the change, the before vs after decode times are about equal.

@kuhar
Copy link
Member

kuhar commented Jun 12, 2025

Looks like there are some reshapes that aren't getting folded causing iree_linalg_ext.gather to fail to fuse.

@Groverkss

@IanWood1
Copy link
Contributor Author

IanWood1 commented Jun 20, 2025

Updating both instead at #16

@IanWood1 IanWood1 closed this Jun 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants