Skip to content

Conversation

@digantdesai
Copy link
Contributor

Summary:
This changes the default behavior. Helps prefill ~20%, hurts decode ~7%.

As a next step, I will try to debug more into perf regression on decode and if anything more we can get on prefill by tuning xnnpack thread dispatcher for gemm, gemv, mul, add, sigmoid, and sub.

On my local (unreliable) S23 -

  • Vanilla:
dm1q:/data/local/tmp/llama $ ./llama_main_release  \
--model_path ./llama_gs32_vanilla.pte  \
--tokenizer_path ./tokenizer.bin \
--seq_len=128 \
--prompt="${prompt}

[...]
I 00:00:22.188618 executorch:stats.h:84]        Prompt Tokens: 44    Generated Tokens: 83
I 00:00:22.188621 executorch:stats.h:90]        Model Load Time:                12.922000 (seconds)
I 00:00:22.188624 executorch:stats.h:100]       Total inference time:           9.252000 (seconds)               Rate:  8.971033 (tokens/second)
I 00:00:22.188627 executorch:stats.h:108]               Prompt evaluation:      1.740000 (seconds)               Rate:  25.287356 (tokens/second)
I 00:00:22.188630 executorch:stats.h:119]               Generated 83 tokens:    7.512000 (seconds)               Rate:  11.048988 (tokens/second)
I 00:00:22.188632 executorch:stats.h:127]       Time to first generated token:  1.740000 (seconds)
I 00:00:22.188634 executorch:stats.h:134]       Sampling time over 127 tokens:  0.015000 (seconds)
[...]
  • Two partition (2part)
dm1q:/data/local/tmp/llama $ ./llama_main_release \
 --model_path ./llama_gs32_2part.pte  \  # New PTE
--tokenizer_path ./tokenizer.bin \
--seq_len=128 \
--prompt="${prompt}"

[...]
I 00:00:22.205058 executorch:stats.h:84]        Prompt Tokens: 44    Generated Tokens: 83
I 00:00:22.205061 executorch:stats.h:90]        Model Load Time:                12.876000 (seconds)
I 00:00:22.205063 executorch:stats.h:100]       Total inference time:           9.323000 (seconds)               Rate:  8.902714 (tokens/second)
I 00:00:22.205067 executorch:stats.h:108]               Prompt evaluation:      1.549000 (seconds)               Rate:  28.405423 (tokens/second)
I 00:00:22.205070 executorch:stats.h:119]               Generated 83 tokens:    7.774000 (seconds)               Rate:  10.676614 (tokens/second)
I 00:00:22.205073 executorch:stats.h:127]       Time to first generated token:  1.549000 (seconds)
I 00:00:22.205075 executorch:stats.h:134]       Sampling time over 127 tokens:  0.029000 (seconds)
[...]

Similar results on AiBench OnePlus12 with gs=32,

# gs=32

I 00:00:21.792659 executorch:stats.h:84] 	Prompt Tokens: 5    Generated Tokens: 118
I 00:00:21.792721 executorch:stats.h:90] 	Model Load Time:		11.666000 (seconds)
I 00:00:21.792754 executorch:stats.h:100] 	Total inference time:		10.109000 (seconds)		 Rate: 	11.672767 (tokens/second)
I 00:00:21.792778 executorch:stats.h:108] 		Prompt evaluation:	0.365000 (seconds)		 Rate: 	13.698630 (tokens/second)
I 00:00:21.792799 executorch:stats.h:119] 		Generated 118 tokens:	9.744000 (seconds)		 Rate: 	12.110016 (tokens/second)
I 00:00:21.792818 executorch:stats.h:127] 	Time to first generated token:	0.365000 (seconds)
I 00:00:21.792837 executorch:stats.h:134] 	Sampling time over 123 tokens:	0.008000 (seconds)
# gs=32

I 00:00:22.584271 executorch:stats.h:84] 	Prompt Tokens: 5    Generated Tokens: 118
I 00:00:22.584336 executorch:stats.h:90] 	Model Load Time:		11.610000 (seconds)
I 00:00:22.584367 executorch:stats.h:100] 	Total inference time:		10.960000 (seconds)		 Rate: 	10.766423 (tokens/second)
I 00:00:22.584389 executorch:stats.h:108] 		Prompt evaluation:	0.286000 (seconds)		 Rate: 	17.482517 (tokens/second)
I 00:00:22.584409 executorch:stats.h:119] 		Generated 118 tokens:	10.674000 (seconds)		 Rate: 	11.054900 (tokens/second)
I 00:00:22.584428 executorch:stats.h:127] 	Time to first generated token:	0.286000 (seconds)
I 00:00:22.584446 executorch:stats.h:134] 	Sampling time over 123 tokens:	0.013000 (seconds)

Differential Revision: D63271101

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 24, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/5573

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 5901df3 with merge base 3f04c3c (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 24, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D63271101

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D63271101

digantdesai added a commit to digantdesai/executorch-1 that referenced this pull request Sep 24, 2024
Summary:
Pull Request resolved: pytorch#5573

This changes the default behavior. Helps prefill ~20%, hurts decode ~7%.

As a next step, I will try to debug more into perf regression on decode and if anything more we can get on prefill by tuning xnnpack thread dispatcher for gemm, gemv, mul, add, sigmoid, and sub.

**On my local (unreliable) S23** -

* Vanilla:

```
dm1q:/data/local/tmp/llama $ ./llama_main_release  \
--model_path ./llama_gs32_vanilla.pte  \
--tokenizer_path ./tokenizer.bin \
--seq_len=128 \
--prompt="${prompt}"

[...]
I 00:00:22.188618 executorch:stats.h:84]        Prompt Tokens: 44    Generated Tokens: 83
I 00:00:22.188621 executorch:stats.h:90]        Model Load Time:                12.922000 (seconds)
I 00:00:22.188624 executorch:stats.h:100]       Total inference time:           9.252000 (seconds)               Rate:  8.971033 (tokens/second)
I 00:00:22.188627 executorch:stats.h:108]               Prompt evaluation:      1.740000 (seconds)               Rate:  25.287356 (tokens/second)
I 00:00:22.188630 executorch:stats.h:119]               Generated 83 tokens:    7.512000 (seconds)               Rate:  11.048988 (tokens/second)
I 00:00:22.188632 executorch:stats.h:127]       Time to first generated token:  1.740000 (seconds)
I 00:00:22.188634 executorch:stats.h:134]       Sampling time over 127 tokens:  0.015000 (seconds)
[...]
```

* Two partition (2part)
```
dm1q:/data/local/tmp/llama $ ./llama_main_release \
 --model_path ./llama_gs32_2part.pte  \  # New PTE
--tokenizer_path ./tokenizer.bin \
--seq_len=128 \
--prompt="${prompt}"

[...]
I 00:00:22.205058 executorch:stats.h:84]        Prompt Tokens: 44    Generated Tokens: 83
I 00:00:22.205061 executorch:stats.h:90]        Model Load Time:                12.876000 (seconds)
I 00:00:22.205063 executorch:stats.h:100]       Total inference time:           9.323000 (seconds)               Rate:  8.902714 (tokens/second)
I 00:00:22.205067 executorch:stats.h:108]               Prompt evaluation:      1.549000 (seconds)               Rate:  28.405423 (tokens/second)
I 00:00:22.205070 executorch:stats.h:119]               Generated 83 tokens:    7.774000 (seconds)               Rate:  10.676614 (tokens/second)
I 00:00:22.205073 executorch:stats.h:127]       Time to first generated token:  1.549000 (seconds)
I 00:00:22.205075 executorch:stats.h:134]       Sampling time over 127 tokens:  0.029000 (seconds)
[...]
```

**Similar results on AiBench OnePlus12**,
* Vanilla, AiBench Links: [gs=32](https://www.internalfb.com/intern/aibench/details/114258284562772), [gs=256](https://www.internalfb.com/intern/aibench/details/438103192423336)
```
# gs=32

I 00:00:21.792659 executorch:stats.h:84] 	Prompt Tokens: 5    Generated Tokens: 118
I 00:00:21.792721 executorch:stats.h:90] 	Model Load Time:		11.666000 (seconds)
I 00:00:21.792754 executorch:stats.h:100] 	Total inference time:		10.109000 (seconds)		 Rate: 	11.672767 (tokens/second)
I 00:00:21.792778 executorch:stats.h:108] 		Prompt evaluation:	0.365000 (seconds)		 Rate: 	13.698630 (tokens/second)
I 00:00:21.792799 executorch:stats.h:119] 		Generated 118 tokens:	9.744000 (seconds)		 Rate: 	12.110016 (tokens/second)
I 00:00:21.792818 executorch:stats.h:127] 	Time to first generated token:	0.365000 (seconds)
I 00:00:21.792837 executorch:stats.h:134] 	Sampling time over 123 tokens:	0.008000 (seconds)
```

* Two partition,  AiBench Links:  [gs=32](https://www.internalfb.com/intern/aibench/details/852029802754424), [gs=256](https://www.internalfb.com/intern/aibench/details/491722732991273)
```
# gs=32

I 00:00:22.584271 executorch:stats.h:84] 	Prompt Tokens: 5    Generated Tokens: 118
I 00:00:22.584336 executorch:stats.h:90] 	Model Load Time:		11.610000 (seconds)
I 00:00:22.584367 executorch:stats.h:100] 	Total inference time:		10.960000 (seconds)		 Rate: 	10.766423 (tokens/second)
I 00:00:22.584389 executorch:stats.h:108] 		Prompt evaluation:	0.286000 (seconds)		 Rate: 	17.482517 (tokens/second)
I 00:00:22.584409 executorch:stats.h:119] 		Generated 118 tokens:	10.674000 (seconds)		 Rate: 	11.054900 (tokens/second)
I 00:00:22.584428 executorch:stats.h:127] 	Time to first generated token:	0.286000 (seconds)
I 00:00:22.584446 executorch:stats.h:134] 	Sampling time over 123 tokens:	0.013000 (seconds)
```

Differential Revision: D63271101
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D63271101

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D63271101

digantdesai added a commit to digantdesai/executorch-1 that referenced this pull request Sep 24, 2024
Summary:
Pull Request resolved: pytorch#5573

This changes the default behavior. Helps prefill ~20%, hurts decode ~7%.

As a next step, I will try to debug more into perf regression on decode and if anything more we can get on prefill by tuning xnnpack thread dispatcher for gemm, gemv, mul, add, sigmoid, and sub.

**On my local (unreliable) S23** -

* Vanilla:

```
dm1q:/data/local/tmp/llama $ ./llama_main_release  \
--model_path ./llama_gs32_vanilla.pte  \
--tokenizer_path ./tokenizer.bin \
--seq_len=128 \
--prompt="${prompt}"

[...]
I 00:00:22.188618 executorch:stats.h:84]        Prompt Tokens: 44    Generated Tokens: 83
I 00:00:22.188621 executorch:stats.h:90]        Model Load Time:                12.922000 (seconds)
I 00:00:22.188624 executorch:stats.h:100]       Total inference time:           9.252000 (seconds)               Rate:  8.971033 (tokens/second)
I 00:00:22.188627 executorch:stats.h:108]               Prompt evaluation:      1.740000 (seconds)               Rate:  25.287356 (tokens/second)
I 00:00:22.188630 executorch:stats.h:119]               Generated 83 tokens:    7.512000 (seconds)               Rate:  11.048988 (tokens/second)
I 00:00:22.188632 executorch:stats.h:127]       Time to first generated token:  1.740000 (seconds)
I 00:00:22.188634 executorch:stats.h:134]       Sampling time over 127 tokens:  0.015000 (seconds)
[...]
```

* Two partition (2part)
```
dm1q:/data/local/tmp/llama $ ./llama_main_release \
 --model_path ./llama_gs32_2part.pte  \  # New PTE
--tokenizer_path ./tokenizer.bin \
--seq_len=128 \
--prompt="${prompt}"

[...]
I 00:00:22.205058 executorch:stats.h:84]        Prompt Tokens: 44    Generated Tokens: 83
I 00:00:22.205061 executorch:stats.h:90]        Model Load Time:                12.876000 (seconds)
I 00:00:22.205063 executorch:stats.h:100]       Total inference time:           9.323000 (seconds)               Rate:  8.902714 (tokens/second)
I 00:00:22.205067 executorch:stats.h:108]               Prompt evaluation:      1.549000 (seconds)               Rate:  28.405423 (tokens/second)
I 00:00:22.205070 executorch:stats.h:119]               Generated 83 tokens:    7.774000 (seconds)               Rate:  10.676614 (tokens/second)
I 00:00:22.205073 executorch:stats.h:127]       Time to first generated token:  1.549000 (seconds)
I 00:00:22.205075 executorch:stats.h:134]       Sampling time over 127 tokens:  0.029000 (seconds)
[...]
```

**Similar results on AiBench OnePlus12**,
* Vanilla, AiBench Links: [gs=32](https://www.internalfb.com/intern/aibench/details/114258284562772), [gs=256](https://www.internalfb.com/intern/aibench/details/438103192423336)
```
# gs=32

I 00:00:21.792659 executorch:stats.h:84] 	Prompt Tokens: 5    Generated Tokens: 118
I 00:00:21.792721 executorch:stats.h:90] 	Model Load Time:		11.666000 (seconds)
I 00:00:21.792754 executorch:stats.h:100] 	Total inference time:		10.109000 (seconds)		 Rate: 	11.672767 (tokens/second)
I 00:00:21.792778 executorch:stats.h:108] 		Prompt evaluation:	0.365000 (seconds)		 Rate: 	13.698630 (tokens/second)
I 00:00:21.792799 executorch:stats.h:119] 		Generated 118 tokens:	9.744000 (seconds)		 Rate: 	12.110016 (tokens/second)
I 00:00:21.792818 executorch:stats.h:127] 	Time to first generated token:	0.365000 (seconds)
I 00:00:21.792837 executorch:stats.h:134] 	Sampling time over 123 tokens:	0.008000 (seconds)
```

* Two partition,  AiBench Links:  [gs=32](https://www.internalfb.com/intern/aibench/details/852029802754424), [gs=256](https://www.internalfb.com/intern/aibench/details/491722732991273)
```
# gs=32

I 00:00:22.584271 executorch:stats.h:84] 	Prompt Tokens: 5    Generated Tokens: 118
I 00:00:22.584336 executorch:stats.h:90] 	Model Load Time:		11.610000 (seconds)
I 00:00:22.584367 executorch:stats.h:100] 	Total inference time:		10.960000 (seconds)		 Rate: 	10.766423 (tokens/second)
I 00:00:22.584389 executorch:stats.h:108] 		Prompt evaluation:	0.286000 (seconds)		 Rate: 	17.482517 (tokens/second)
I 00:00:22.584409 executorch:stats.h:119] 		Generated 118 tokens:	10.674000 (seconds)		 Rate: 	11.054900 (tokens/second)
I 00:00:22.584428 executorch:stats.h:127] 	Time to first generated token:	0.286000 (seconds)
I 00:00:22.584446 executorch:stats.h:134] 	Sampling time over 123 tokens:	0.013000 (seconds)
```

Differential Revision: D63271101
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D63271101

digantdesai added a commit to digantdesai/executorch-1 that referenced this pull request Sep 27, 2024
Summary:
Pull Request resolved: pytorch#5573

This changes the default behavior. Helps prefill ~20%, hurts decode ~7%.

As a next step, I will try to debug more into perf regression on decode and if anything more we can get on prefill by tuning xnnpack thread dispatcher for gemm, gemv, mul, add, sigmoid, and sub.

**On my local (unreliable) S23** -

* Vanilla:

```
dm1q:/data/local/tmp/llama $ ./llama_main_release  \
--model_path ./llama_gs32_vanilla.pte  \
--tokenizer_path ./tokenizer.bin \
--seq_len=128 \
--prompt="${prompt}"

[...]
I 00:00:22.188618 executorch:stats.h:84]        Prompt Tokens: 44    Generated Tokens: 83
I 00:00:22.188621 executorch:stats.h:90]        Model Load Time:                12.922000 (seconds)
I 00:00:22.188624 executorch:stats.h:100]       Total inference time:           9.252000 (seconds)               Rate:  8.971033 (tokens/second)
I 00:00:22.188627 executorch:stats.h:108]               Prompt evaluation:      1.740000 (seconds)               Rate:  25.287356 (tokens/second)
I 00:00:22.188630 executorch:stats.h:119]               Generated 83 tokens:    7.512000 (seconds)               Rate:  11.048988 (tokens/second)
I 00:00:22.188632 executorch:stats.h:127]       Time to first generated token:  1.740000 (seconds)
I 00:00:22.188634 executorch:stats.h:134]       Sampling time over 127 tokens:  0.015000 (seconds)
[...]
```

* Two partition (2part)
```
dm1q:/data/local/tmp/llama $ ./llama_main_release \
 --model_path ./llama_gs32_2part.pte  \  # New PTE
--tokenizer_path ./tokenizer.bin \
--seq_len=128 \
--prompt="${prompt}"

[...]
I 00:00:22.205058 executorch:stats.h:84]        Prompt Tokens: 44    Generated Tokens: 83
I 00:00:22.205061 executorch:stats.h:90]        Model Load Time:                12.876000 (seconds)
I 00:00:22.205063 executorch:stats.h:100]       Total inference time:           9.323000 (seconds)               Rate:  8.902714 (tokens/second)
I 00:00:22.205067 executorch:stats.h:108]               Prompt evaluation:      1.549000 (seconds)               Rate:  28.405423 (tokens/second)
I 00:00:22.205070 executorch:stats.h:119]               Generated 83 tokens:    7.774000 (seconds)               Rate:  10.676614 (tokens/second)
I 00:00:22.205073 executorch:stats.h:127]       Time to first generated token:  1.549000 (seconds)
I 00:00:22.205075 executorch:stats.h:134]       Sampling time over 127 tokens:  0.029000 (seconds)
[...]
```

**Similar results on AiBench OnePlus12**,
* Vanilla, AiBench Links: [gs=32](https://www.internalfb.com/intern/aibench/details/114258284562772), [gs=256](https://www.internalfb.com/intern/aibench/details/438103192423336)
```
# gs=32

I 00:00:21.792659 executorch:stats.h:84] 	Prompt Tokens: 5    Generated Tokens: 118
I 00:00:21.792721 executorch:stats.h:90] 	Model Load Time:		11.666000 (seconds)
I 00:00:21.792754 executorch:stats.h:100] 	Total inference time:		10.109000 (seconds)		 Rate: 	11.672767 (tokens/second)
I 00:00:21.792778 executorch:stats.h:108] 		Prompt evaluation:	0.365000 (seconds)		 Rate: 	13.698630 (tokens/second)
I 00:00:21.792799 executorch:stats.h:119] 		Generated 118 tokens:	9.744000 (seconds)		 Rate: 	12.110016 (tokens/second)
I 00:00:21.792818 executorch:stats.h:127] 	Time to first generated token:	0.365000 (seconds)
I 00:00:21.792837 executorch:stats.h:134] 	Sampling time over 123 tokens:	0.008000 (seconds)
```

* Two partition,  AiBench Links:  [gs=32](https://www.internalfb.com/intern/aibench/details/852029802754424), [gs=256](https://www.internalfb.com/intern/aibench/details/491722732991273)
```
# gs=32

I 00:00:22.584271 executorch:stats.h:84] 	Prompt Tokens: 5    Generated Tokens: 118
I 00:00:22.584336 executorch:stats.h:90] 	Model Load Time:		11.610000 (seconds)
I 00:00:22.584367 executorch:stats.h:100] 	Total inference time:		10.960000 (seconds)		 Rate: 	10.766423 (tokens/second)
I 00:00:22.584389 executorch:stats.h:108] 		Prompt evaluation:	0.286000 (seconds)		 Rate: 	17.482517 (tokens/second)
I 00:00:22.584409 executorch:stats.h:119] 		Generated 118 tokens:	10.674000 (seconds)		 Rate: 	11.054900 (tokens/second)
I 00:00:22.584428 executorch:stats.h:127] 	Time to first generated token:	0.286000 (seconds)
I 00:00:22.584446 executorch:stats.h:134] 	Sampling time over 123 tokens:	0.013000 (seconds)
```

bypass-github-export-checks
bypass-github-pytorch-ci-checks
bypass-github-executorch-ci-checks

Reviewed By: mcr229

Differential Revision: D63271101
Summary:
Pull Request resolved: pytorch#5573

This changes the default behavior. Helps prefill ~20%, hurts decode ~7%.

As a next step, I will try to debug more into perf regression on decode and if anything more we can get on prefill by tuning xnnpack thread dispatcher for gemm, gemv, mul, add, sigmoid, and sub.

**On my local (unreliable) S23** -

* Vanilla:

```
dm1q:/data/local/tmp/llama $ ./llama_main_release  \
--model_path ./llama_gs32_vanilla.pte  \
--tokenizer_path ./tokenizer.bin \
--seq_len=128 \
--prompt="${prompt}"

[...]
I 00:00:22.188618 executorch:stats.h:84]        Prompt Tokens: 44    Generated Tokens: 83
I 00:00:22.188621 executorch:stats.h:90]        Model Load Time:                12.922000 (seconds)
I 00:00:22.188624 executorch:stats.h:100]       Total inference time:           9.252000 (seconds)               Rate:  8.971033 (tokens/second)
I 00:00:22.188627 executorch:stats.h:108]               Prompt evaluation:      1.740000 (seconds)               Rate:  25.287356 (tokens/second)
I 00:00:22.188630 executorch:stats.h:119]               Generated 83 tokens:    7.512000 (seconds)               Rate:  11.048988 (tokens/second)
I 00:00:22.188632 executorch:stats.h:127]       Time to first generated token:  1.740000 (seconds)
I 00:00:22.188634 executorch:stats.h:134]       Sampling time over 127 tokens:  0.015000 (seconds)
[...]
```

* Two partition (2part)
```
dm1q:/data/local/tmp/llama $ ./llama_main_release \
 --model_path ./llama_gs32_2part.pte  \  # New PTE
--tokenizer_path ./tokenizer.bin \
--seq_len=128 \
--prompt="${prompt}"

[...]
I 00:00:22.205058 executorch:stats.h:84]        Prompt Tokens: 44    Generated Tokens: 83
I 00:00:22.205061 executorch:stats.h:90]        Model Load Time:                12.876000 (seconds)
I 00:00:22.205063 executorch:stats.h:100]       Total inference time:           9.323000 (seconds)               Rate:  8.902714 (tokens/second)
I 00:00:22.205067 executorch:stats.h:108]               Prompt evaluation:      1.549000 (seconds)               Rate:  28.405423 (tokens/second)
I 00:00:22.205070 executorch:stats.h:119]               Generated 83 tokens:    7.774000 (seconds)               Rate:  10.676614 (tokens/second)
I 00:00:22.205073 executorch:stats.h:127]       Time to first generated token:  1.549000 (seconds)
I 00:00:22.205075 executorch:stats.h:134]       Sampling time over 127 tokens:  0.029000 (seconds)
[...]
```

**Similar results on AiBench OnePlus12**,
* Vanilla, AiBench Links: [gs=32](https://www.internalfb.com/intern/aibench/details/114258284562772), [gs=256](https://www.internalfb.com/intern/aibench/details/438103192423336)
```
# gs=32

I 00:00:21.792659 executorch:stats.h:84] 	Prompt Tokens: 5    Generated Tokens: 118
I 00:00:21.792721 executorch:stats.h:90] 	Model Load Time:		11.666000 (seconds)
I 00:00:21.792754 executorch:stats.h:100] 	Total inference time:		10.109000 (seconds)		 Rate: 	11.672767 (tokens/second)
I 00:00:21.792778 executorch:stats.h:108] 		Prompt evaluation:	0.365000 (seconds)		 Rate: 	13.698630 (tokens/second)
I 00:00:21.792799 executorch:stats.h:119] 		Generated 118 tokens:	9.744000 (seconds)		 Rate: 	12.110016 (tokens/second)
I 00:00:21.792818 executorch:stats.h:127] 	Time to first generated token:	0.365000 (seconds)
I 00:00:21.792837 executorch:stats.h:134] 	Sampling time over 123 tokens:	0.008000 (seconds)
```

* Two partition,  AiBench Links:  [gs=32](https://www.internalfb.com/intern/aibench/details/852029802754424), [gs=256](https://www.internalfb.com/intern/aibench/details/491722732991273)
```
# gs=32

I 00:00:22.584271 executorch:stats.h:84] 	Prompt Tokens: 5    Generated Tokens: 118
I 00:00:22.584336 executorch:stats.h:90] 	Model Load Time:		11.610000 (seconds)
I 00:00:22.584367 executorch:stats.h:100] 	Total inference time:		10.960000 (seconds)		 Rate: 	10.766423 (tokens/second)
I 00:00:22.584389 executorch:stats.h:108] 		Prompt evaluation:	0.286000 (seconds)		 Rate: 	17.482517 (tokens/second)
I 00:00:22.584409 executorch:stats.h:119] 		Generated 118 tokens:	10.674000 (seconds)		 Rate: 	11.054900 (tokens/second)
I 00:00:22.584428 executorch:stats.h:127] 	Time to first generated token:	0.286000 (seconds)
I 00:00:22.584446 executorch:stats.h:134] 	Sampling time over 123 tokens:	0.013000 (seconds)
```

bypass-github-export-checks
bypass-github-pytorch-ci-checks
bypass-github-executorch-ci-checks

Reviewed By: mcr229

Differential Revision: D63271101
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D63271101

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 55cc430.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants