Skip to content

Commit f385280

Browse files
astancovTTign-febin
authored andcommitted
Removed enable_split_reader from Conv2dConfig (tenstorrent#27731)
### Ticket tenstorrent#26430 ### Problem description After removing split_reader impact on `out_subblock_h/w` choice in tenstorrent#26220, split_reader can be enabled in cases where it is beneficial and the parameter can be removed. Currently split_reader is supported only if: - Height sharded convolution - not 1D depthwise convolution - activation block size height [tiles] > 1 ( so there's something for each reader to read) Enabling should be chosen by heuristic that takes into account the following things: - local L1 transfer speed (activations are transferred locally L1->L1, depends on chunk sizes being transferred, differs between dilated convs and non-dilated convs) - Noc link bandwith ( split reader works only for Hight sharded convolutions currently and in HS convs only 1 core reads weights from DRAM) - speed of reading weights from DRAM for 1 core and speed of that same core multicasting to the rest of the cores is aprox the same - size of the activations block being read - size of the weights block being read - time for transferring activations is 2x more than transferring weights. (exact number subject to change down the line) ### What's changed - removed `enable_split_reader` from `ttnn.Conv2dConfig` - added said heuristic ## Model Performance Comparison: AVG DEVICE KERNEL SAMPLES/S [main](https://github.com/tenstorrent/tt-metal/actions/runs/17569945562/job/49904184081) [this_branch](https://github.com/tenstorrent/tt-metal/actions/runs/17557035559) | Model | Main Branch | Current Branch | Difference | % Change | |-------|-------------|----------------|------------|----------| | falcon7b_decode_bfloat16-l1_sharded_seq1_kv_cache_len1024 | 578.1555 | 580.5385 | +2.3830 | 🟢 +0.41% | | falcon7b_decode_bfloat16-l1_sharded_seq1_kv_cache_len128 | 640.3621 | 643.1583 | +2.7962 | 🟢 +0.44% | | falcon7b_decode_bfloat16-l1_sharded_seq1_kv_cache_len2047 | 541.8032 | 544.0483 | +2.2451 | 🟢 +0.41% | | falcon7b_prefill_bfloat16-dram_seq1024_kv_cache_len0 | 3112.8762 | 3109.6271 | -3.2491 | 🔴 -0.10% | | falcon7b_prefill_bfloat16-dram_seq128_kv_cache_len0 | 2113.3771 | 2119.5114 | +6.1343 | 🟢 +0.29% | | falcon7b_prefill_bfloat16-dram_seq2048_kv_cache_len0 | 2857.7001 | 2854.8933 | -2.8068 | 🔴 -0.10% | | segformer_for_semantic_segmentation | 108.9917 | 109.1203 | +0.1286 | 🟢 +0.12% | | tt_mnist128 | 909142.5649 | 910481.2035 | +1338.6386 | 🟢 +0.15% | | sdxl_unet | 4.9903 | 4.9749 | -0.0154 | 🔴 -0.31% | | sdxl_vae | 0.8346 | 0.8339 | -0.0007 | ⚪ -0.08% | | stable_diffusion_1batch | 13.0289 | 13.0202 | -0.0087 | ⚪ -0.07% | | ttnn_functional_Swin_s_1 (first) | 6.5262 | 6.5224 | -0.0038 | ⚪ -0.06% | | ttnn_functional_Swin_v2_1 (first) | 7.4043 | 7.4038 | -0.0005 | ⚪ -0.01% | | ttnn_functional_ttnn_vgg11_1 | 436.2223 | 438.8588 | +2.6365 | 🟢 +0.60% | | ttnn_functional_ttnn_vgg16_1 | 352.6563 | 355.9618 | +3.3055 | 🟢 +0.94% | | ttnn_functional_vgg_unet1 | 280.8845 | 280.2404 | -0.6441 | 🔴 -0.23% | | ttnn_resnet50_batch_size16 | 5699.5534 | 5665.1853 | -34.3681 | 🔴 -0.60% | | ttnn_roberta_8 | 180.1718 | 180.9504 | +0.7786 | 🟢 +0.43% | | ttnn_sentence_bert8 | 461.7476 | 461.5519 | -0.1957 | ⚪ -0.04% | | ttnn_ufld_v21 | 345.7357 | 344.7538 | -0.9819 | 🔴 -0.28% | | ttnn_yolov10x1 | 50.6070 | 50.6093 | +0.0023 | ⚪ +0.00% | | ttnn_yolov111 | 213.4715 | 217.6344 | +4.1629 | 🟢 +1.95% | | unet-shallow_batch-1_groups-4 | 1585.2156 | 1585.7685 | +0.5529 | ⚪ +0.03% | | mamba-2.8b_batch_32 | 19861.4416 | 19863.3156 | +1.8740 | ⚪ +0.01% | | ttnn_efficientnetb01 | 94.8605 | 95.2415 | +0.3810 | 🟢 +0.40% | | ttnn_functional_Swin_s_1 (second) | 6.5240 | 6.5243 | +0.0003 | ⚪ +0.00% | | ttnn_functional_Swin_v2_1 (second) | 7.4053 | 7.4027 | -0.0026 | ⚪ -0.04% | | ttnn_functional_mobilenetv210 | 3497.8757 | 3506.8598 | +8.9841 | 🟢 +0.26% | | ttnn_functional_yolov4_1 | 98.4996 | 98.3081 | -0.1915 | 🔴 -0.19% | | ttnn_functional_yolov71 | 121.9927 | 122.1801 | +0.1874 | 🟢 +0.15% | | ttnn_functional_yolov8x1 | 69.7647 | 69.8535 | +0.0888 | 🟢 +0.13% | | ttnn_functional_yolov9c1 | 31.9248 | 32.0006 | +0.0758 | 🟢 +0.24% | | ttnn_vanilla_unet1 | 62.1010 | 62.1471 | +0.0461 | ⚪ +0.07% | | ttnn_vovnet1 | 121.5876 | 121.5079 | -0.0797 | ⚪ -0.07% | | ttnn_yolov12x1 | 14.3381 | 14.3413 | +0.0032 | ⚪ +0.02% | | ttnn_yolov5x1 | 54.9410 | 54.9104 | -0.0306 | ⚪ -0.06% | | ttnn_yolov6l1 | 76.9110 | 76.7928 | -0.1182 | 🔴 -0.15% | | ttnn_yolov8s1 | 239.5090 | 240.5884 | +1.0794 | 🟢 +0.45% | | vit-8 | 1538.0972 | 1539.6779 | +1.5807 | 🟢 +0.10% | ### Checklist - [X] [All post commit](https://github.com/tenstorrent/tt-metal/actions/workflows/all-post-commit-workflows.yaml) CI [passed](https://github.com/tenstorrent/tt-metal/actions/runs/17713433723) - [X] [Blackhole Post commit](https://github.com/tenstorrent/tt-metal/actions/workflows/blackhole-post-commit.yaml) CI [passed](https://github.com/tenstorrent/tt-metal/actions/runs/17713434918) - [X] [Model regression](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-models.yaml) CI [same as main](https://github.com/tenstorrent/tt-metal/actions/runs/17609536091) - [X] [Device performance regression](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-device-models.yaml) CI [passes](https://github.com/tenstorrent/tt-metal/actions/runs/17676787104) - [X] [(Single-card) Frequent model and ttnn tests](https://github.com/tenstorrent/tt-metal/actions/workflows/fast-dispatch-full-regressions-and-models.yaml) CI [passes](https://github.com/tenstorrent/tt-metal/actions/runs/17676793012) - [X] [Nightly tt-metal L2 tests](https://github.com/tenstorrent/tt-metal/actions/workflows/tt-metal-l2-nightly.yaml) CI [passes](https://github.com/tenstorrent/tt-metal/actions/runs/17609520723)
1 parent 3b27d35 commit f385280

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

49 files changed

+492
-345
lines changed

models/demos/mobilenetv2/tests/perf/test_perf_mobilenetv2.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
@pytest.mark.parametrize(
1515
"batch_size, expected_perf",
1616
[
17-
[MOBILENETV2_BATCH_SIZE, 3436],
17+
[MOBILENETV2_BATCH_SIZE, 3445],
1818
],
1919
)
2020
def test_perf_device_mobilenetv2(batch_size, expected_perf):

models/demos/mobilenetv2/tt/common.py

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,6 @@ def __init__(
2323
width_shard=False,
2424
act_blocks=32,
2525
enable_act_double_buffer=False,
26-
enable_split_reader=False,
2726
reshard_if_not_optimal=True,
2827
activation_dtype=ttnn.bfloat8_b,
2928
shard_layout=ttnn.TensorMemoryLayout.HEIGHT_SHARDED,
@@ -42,7 +41,6 @@ def __init__(
4241
self.width_shard = width_shard
4342
self.act_blocks = act_blocks
4443
self.enable_act_double_buffer = enable_act_double_buffer
45-
self.enable_split_reader = enable_split_reader
4644
self.reshard_if_not_optimal = reshard_if_not_optimal
4745
self.batch_size = batch_size
4846
self.shard_layout = shard_layout
@@ -64,9 +62,6 @@ def _initialize_conv_config(self):
6462
act_block_w_div=1,
6563
deallocate_activation=self.deallocate_activation,
6664
enable_act_double_buffer=self.enable_act_double_buffer,
67-
enable_split_reader=True
68-
if self.shard_layout == ttnn.TensorMemoryLayout.HEIGHT_SHARDED
69-
else self.enable_split_reader,
7065
output_layout=self.output_layout,
7166
reallocate_halo_output=False,
7267
reshard_if_not_optimal=self.reshard_if_not_optimal,

models/demos/segformer/tt/common.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,6 @@ def __call__(self, device, input_tensor):
4848
deallocate_activation=self.deallocate,
4949
reallocate_halo_output=True,
5050
enable_act_double_buffer=True,
51-
enable_split_reader=False,
5251
output_layout=self.output_layout,
5352
)
5453
compute_config = ttnn.init_device_compute_kernel_config(

models/demos/ttnn_resnet/tt/ttnn_functional_resnet50.py

Lines changed: 0 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -150,7 +150,6 @@ def run_downsample_if_req(
150150
height_sharding=None,
151151
packer_l1_accum_enabled=True,
152152
enable_act_double_buffer=False,
153-
enable_split_reader=False,
154153
):
155154
if self.downsample:
156155
logger.debug(f"Running downsample")
@@ -180,7 +179,6 @@ def run_downsample_if_req(
180179
if input_width < 56
181180
else False,
182181
enable_weights_double_buffer=True if input_width < 56 else False,
183-
enable_split_reader=enable_split_reader,
184182
full_inner_dim=True,
185183
),
186184
}
@@ -217,7 +215,6 @@ def __call__(
217215
eltwise_binary_out_in_place=True,
218216
packer_l1_acc=True,
219217
enable_act_double_buffer=False,
220-
enable_split_reader=False,
221218
ops_parallel_config=None,
222219
layer_module=None,
223220
):
@@ -287,7 +284,6 @@ def __call__(
287284
height_sharding,
288285
packer_l1_accum_enabled=packer_l1_acc,
289286
enable_act_double_buffer=False,
290-
enable_split_reader=enable_split_reader,
291287
)
292288
if layer_module and layer_module == "layer4_module1":
293289
if ops_parallel_config and "layer4_module1_downsample" not in ops_parallel_config:
@@ -331,7 +327,6 @@ def __call__(
331327
reshard_if_not_optimal=reshard_if_not_optimal,
332328
enable_act_double_buffer=enable_act_double_buffer,
333329
enable_weights_double_buffer=True,
334-
enable_split_reader=enable_split_reader,
335330
full_inner_dim=True,
336331
),
337332
}
@@ -439,7 +434,6 @@ def __call__(
439434
height_sharding,
440435
packer_l1_accum_enabled=packer_l1_acc,
441436
enable_act_double_buffer=enable_act_double_buffer,
442-
enable_split_reader=enable_split_reader,
443437
)
444438

445439
assert ds_out is not None, "ds_out is None"
@@ -578,7 +572,6 @@ def __init__(
578572
deallocate_activation=dealloc_input,
579573
act_block_h_override=act_block_h_override,
580574
enable_act_double_buffer=is_wormhole_b0() or is_blackhole(),
581-
enable_split_reader=True,
582575
shard_layout=ttnn.TensorMemoryLayout.HEIGHT_SHARDED,
583576
reshard_if_not_optimal=False,
584577
# otherwise act block h is not big enough for the reuse
@@ -812,7 +805,6 @@ def run(self, input_tensor, device, ops_parallel_config) -> ttnn.Tensor:
812805
reshard_if_not_optimal=reshard,
813806
height_sharding=height_shard,
814807
enable_act_double_buffer=True,
815-
enable_split_reader=True,
816808
)
817809

818810
if is_first_run:
@@ -833,7 +825,6 @@ def run(self, input_tensor, device, ops_parallel_config) -> ttnn.Tensor:
833825
x_height,
834826
x_width,
835827
enable_act_double_buffer=False,
836-
enable_split_reader=True,
837828
layer_module="layer1_module2",
838829
)
839830

@@ -845,7 +836,6 @@ def run(self, input_tensor, device, ops_parallel_config) -> ttnn.Tensor:
845836
x_height,
846837
x_width,
847838
enable_act_double_buffer=False,
848-
enable_split_reader=True,
849839
layer_module="layer1_module3",
850840
)
851841

@@ -864,7 +854,6 @@ def run(self, input_tensor, device, ops_parallel_config) -> ttnn.Tensor:
864854
reshard_if_not_optimal=reshard,
865855
height_sharding=height_shard,
866856
enable_act_double_buffer=True,
867-
enable_split_reader=True,
868857
layer_module="layer2_module1",
869858
)
870859

@@ -886,7 +875,6 @@ def run(self, input_tensor, device, ops_parallel_config) -> ttnn.Tensor:
886875
x_height,
887876
x_width,
888877
enable_act_double_buffer=True,
889-
enable_split_reader=True,
890878
layer_module="layer2_module2",
891879
)
892880

@@ -898,7 +886,6 @@ def run(self, input_tensor, device, ops_parallel_config) -> ttnn.Tensor:
898886
x_height,
899887
x_width,
900888
enable_act_double_buffer=True,
901-
enable_split_reader=True,
902889
layer_module="layer2_module3",
903890
)
904891

@@ -910,7 +897,6 @@ def run(self, input_tensor, device, ops_parallel_config) -> ttnn.Tensor:
910897
x_height,
911898
x_width,
912899
enable_act_double_buffer=True,
913-
enable_split_reader=True,
914900
layer_module="layer2_module4",
915901
)
916902

@@ -931,7 +917,6 @@ def run(self, input_tensor, device, ops_parallel_config) -> ttnn.Tensor:
931917
reshard_if_not_optimal=reshard,
932918
height_sharding=height_shard,
933919
enable_act_double_buffer=True,
934-
enable_split_reader=False,
935920
)
936921

937922
if is_first_run:
@@ -952,7 +937,6 @@ def run(self, input_tensor, device, ops_parallel_config) -> ttnn.Tensor:
952937
x_height,
953938
x_width,
954939
enable_act_double_buffer=True,
955-
enable_split_reader=False,
956940
)
957941

958942
logger.debug(f"==== Running layer 3 module 3")
@@ -963,7 +947,6 @@ def run(self, input_tensor, device, ops_parallel_config) -> ttnn.Tensor:
963947
x_height,
964948
x_width,
965949
enable_act_double_buffer=True,
966-
enable_split_reader=False,
967950
layer_module="layer3_module3",
968951
)
969952

@@ -975,7 +958,6 @@ def run(self, input_tensor, device, ops_parallel_config) -> ttnn.Tensor:
975958
x_height,
976959
x_width,
977960
enable_act_double_buffer=True,
978-
enable_split_reader=False,
979961
layer_module="layer3_module4",
980962
)
981963

@@ -987,7 +969,6 @@ def run(self, input_tensor, device, ops_parallel_config) -> ttnn.Tensor:
987969
x_height,
988970
x_width,
989971
enable_act_double_buffer=True,
990-
enable_split_reader=False,
991972
layer_module="layer3_module5",
992973
)
993974

@@ -1000,7 +981,6 @@ def run(self, input_tensor, device, ops_parallel_config) -> ttnn.Tensor:
1000981
x_width,
1001982
eltwise_binary_out_in_place=True,
1002983
enable_act_double_buffer=True,
1003-
enable_split_reader=False,
1004984
)
1005985

1006986
reshard = is_blackhole() and self.batch_size == 20
@@ -1031,7 +1011,6 @@ def run(self, input_tensor, device, ops_parallel_config) -> ttnn.Tensor:
10311011
reshard_if_not_optimal=reshard,
10321012
height_sharding=height_shard,
10331013
enable_act_double_buffer=True,
1034-
enable_split_reader=False,
10351014
ops_parallel_config=ops_parallel_config,
10361015
layer_module="layer4_module1",
10371016
)
@@ -1044,7 +1023,6 @@ def run(self, input_tensor, device, ops_parallel_config) -> ttnn.Tensor:
10441023
x_height,
10451024
x_width,
10461025
enable_act_double_buffer=True,
1047-
enable_split_reader=False,
10481026
layer_module="layer4_module2",
10491027
)
10501028

@@ -1056,7 +1034,6 @@ def run(self, input_tensor, device, ops_parallel_config) -> ttnn.Tensor:
10561034
x_height,
10571035
x_width,
10581036
enable_act_double_buffer=True,
1059-
enable_split_reader=False,
10601037
layer_module="layer4_module3",
10611038
)
10621039

models/demos/ufld_v2/ttnn/common.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,6 @@ def __init__(
4646
deallocate_activation=dealloc_act,
4747
enable_act_double_buffer=True if is_blk else False,
4848
enable_weights_double_buffer=True if is_blk else False,
49-
enable_split_reader=True if not is_blk else False,
5049
reshard_if_not_optimal=True,
5150
activation=activation,
5251
)

models/demos/vanilla_unet/common.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88

99
from models.demos.vanilla_unet.reference.unet import UNet
1010

11-
VANILLA_UNET_L1_SMALL_SIZE = (7 * 8192) + 1730
11+
VANILLA_UNET_L1_SMALL_SIZE = (7 * 8192) + 2592
1212

1313

1414
def load_torch_model(model_location_generator=None):

0 commit comments

Comments
 (0)