Skip to content

refactor common used toy model #2729

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 1 addition & 26 deletions benchmarks/benchmark_aq.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
Int4WeightOnlyQuantizedLinearWeight,
Int8WeightOnlyQuantizedLinearWeight,
)
from torchao.testing.model_architectures import ToyLinearModel
from torchao.utils import (
TORCH_VERSION_AT_LEAST_2_4,
TORCH_VERSION_AT_LEAST_2_5,
Expand Down Expand Up @@ -62,32 +63,6 @@ def _int4wo_api(mod, **kwargs):
change_linear_weights_to_int4_woqtensors(mod, **kwargs)


class ToyLinearModel(torch.nn.Module):
"""Single linear for m * k * n problem size"""

def __init__(
self, m=64, n=32, k=64, has_bias=False, dtype=torch.float, device="cuda"
):
super().__init__()
self.m = m
self.dtype = dtype
self.device = device
self.linear = torch.nn.Linear(k, n, bias=has_bias).to(
dtype=self.dtype, device=self.device
)

def example_inputs(self):
return (
torch.randn(
self.m, self.linear.in_features, dtype=self.dtype, device=self.device
),
)

def forward(self, x):
x = self.linear(x)
return x


def _ref_change_linear_weights_to_int8_dqtensors(model, filter_fn=None, **kwargs):
"""
The deprecated implementation for int8 dynamic quant API, used as a reference for
Expand Down
12 changes: 1 addition & 11 deletions docs/source/quick_start.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,17 +29,7 @@ First, let's set up our toy model:

import copy
import torch

class ToyLinearModel(torch.nn.Module):
def __init__(self, m: int, n: int, k: int):
super().__init__()
self.linear1 = torch.nn.Linear(m, n, bias=False)
self.linear2 = torch.nn.Linear(n, k, bias=False)

def forward(self, x):
x = self.linear1(x)
x = self.linear2(x)
return x
from torchao.testing.model_architectures import ToyLinearModel

model = ToyLinearModel(1024, 1024, 1024).eval().to(torch.bfloat16).to("cuda")

Expand Down
23 changes: 4 additions & 19 deletions docs/source/serialization.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Serialization and deserialization flow
======================================

Here is the serialization and deserialization flow::

import copy
import tempfile
import torch
Expand All @@ -16,20 +16,7 @@ Here is the serialization and deserialization flow::
quantize_,
Int4WeightOnlyConfig,
)

class ToyLinearModel(torch.nn.Module):
def __init__(self, m=64, n=32, k=64):
super().__init__()
self.linear1 = torch.nn.Linear(m, n, bias=False)
self.linear2 = torch.nn.Linear(n, k, bias=False)

def example_inputs(self, batch_size=1, dtype=torch.float32, device="cpu"):
return (torch.randn(batch_size, self.linear1.in_features, dtype=dtype, device=device),)

def forward(self, x):
x = self.linear1(x)
x = self.linear2(x)
return x
from torchao.testing.model_architectures import ToyLinearModel

dtype = torch.bfloat16
m = ToyLinearModel(1024, 1024, 1024).eval().to(dtype).to("cuda")
Expand Down Expand Up @@ -62,7 +49,7 @@ What happens when serializing an optimized model?
To serialize an optimized model, we just need to call ``torch.save(m.state_dict(), f)``, because in torchao, we use tensor subclass to represent different dtypes or support different optimization techniques like quantization and sparsity. So after optimization, the only thing change is the weight Tensor is changed to an optimized weight Tensor, and the model structure is not changed at all. For example:

original floating point model ``state_dict``::

{"linear1.weight": float_weight1, "linear2.weight": float_weight2}

quantized model ``state_dict``::
Expand All @@ -75,7 +62,7 @@ The size of the quantized model is typically going to be smaller to the original
original model size: 4.0 MB
quantized model size: 1.0625 MB


What happens when deserializing an optimized model?
===================================================
To deserialize an optimized model, we can initialize the floating point model in `meta <https://pytorch.org/docs/stable/meta.html>`__ device and then load the optimized ``state_dict`` with ``assign=True`` using `model.load_state_dict <https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.load_state_dict>`__::
Expand All @@ -97,5 +84,3 @@ We can also verify that the weight is properly loaded by checking the type of we

type of weight before loading: (<class 'torch.Tensor'>, <class 'torch.Tensor'>)
type of weight after loading: (<class 'torchao.dtypes.affine_quantized_tensor.AffineQuantizedTensor'>, <class 'torchao.dtypes.affine_quantized_tensor.AffineQuantizedTensor'>)


14 changes: 1 addition & 13 deletions scripts/quick_start.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
import torch

from torchao.quantization import Int4WeightOnlyConfig, quantize_
from torchao.testing.model_architectures import ToyLinearModel
from torchao.utils import (
TORCH_VERSION_AT_LEAST_2_5,
benchmark_model,
Expand All @@ -18,19 +19,6 @@
# | Set up model |
# ================


class ToyLinearModel(torch.nn.Module):
def __init__(self, m: int, n: int, k: int):
super().__init__()
self.linear1 = torch.nn.Linear(m, n, bias=False)
self.linear2 = torch.nn.Linear(n, k, bias=False)

def forward(self, x):
x = self.linear1(x)
x = self.linear2(x)
return x


model = ToyLinearModel(1024, 1024, 1024).eval().to(torch.bfloat16).to("cuda")

# Optional: compile model for faster inference and generation
Expand Down
13 changes: 1 addition & 12 deletions test/dtypes/test_affine_quantized_float.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@
_quantize_affine_float8,
choose_qparams_affine,
)
from torchao.testing.model_architectures import ToyLinearModel
from torchao.utils import (
is_sm_at_least_89,
is_sm_at_least_90,
Expand All @@ -55,18 +56,6 @@
torch.manual_seed(0)


class ToyLinearModel(torch.nn.Module):
def __init__(self, in_features, out_features):
super().__init__()
self.linear1 = torch.nn.Linear(in_features, out_features, bias=False)
self.linear2 = torch.nn.Linear(out_features, in_features, bias=False)

def forward(self, x):
x = self.linear1(x)
x = self.linear2(x)
return x


class TestAffineQuantizedFloat8Compile(InductorTestCase):
@unittest.skipIf(not torch.cuda.is_available(), "Need CUDA available")
@unittest.skipIf(
Expand Down
18 changes: 1 addition & 17 deletions test/integration/test_integration.py
Original file line number Diff line number Diff line change
Expand Up @@ -2134,23 +2134,7 @@ def test_get_model_size_aqt(self, api, test_device, test_dtype):


class TestBenchmarkModel(unittest.TestCase):
class ToyLinearModel(torch.nn.Module):
def __init__(self, m=64, n=32, k=64):
super().__init__()
self.linear1 = torch.nn.Linear(m, n, bias=False)
self.linear2 = torch.nn.Linear(n, k, bias=False)

def example_inputs(self, batch_size=1, dtype=torch.float32, device="cpu"):
return (
torch.randn(
batch_size, self.linear1.in_features, dtype=dtype, device=device
),
)

def forward(self, x):
x = self.linear1(x)
x = self.linear2(x)
return x
from torchao.testing.model_architectures import ToyLinearModel

def run_benchmark_model(self, device):
# params
Expand Down
25 changes: 1 addition & 24 deletions test/prototype/test_awq.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,36 +15,13 @@

from torchao.prototype.awq import AWQConfig, AWQStep
from torchao.quantization import FbgemmConfig, Int4WeightOnlyConfig, quantize_
from torchao.testing.model_architectures import ToyLinearModel
from torchao.utils import (
TORCH_VERSION_AT_LEAST_2_6,
_is_fbgemm_genai_gpu_available,
)


class ToyLinearModel(torch.nn.Module):
def __init__(self, m=512, n=256, k=128):
super().__init__()
self.linear1 = torch.nn.Linear(m, n, bias=False)
self.linear2 = torch.nn.Linear(n, k, bias=False)
self.linear3 = torch.nn.Linear(k, 64, bias=False)

def example_inputs(
self, batch_size, sequence_length=10, dtype=torch.bfloat16, device="cuda"
):
return [
torch.randn(
1, sequence_length, self.linear1.in_features, dtype=dtype, device=device
)
for j in range(batch_size)
]

def forward(self, x):
x = self.linear1(x)
x = self.linear2(x)
x = self.linear3(x)
return x


@unittest.skipIf(not torch.cuda.is_available(), reason="CUDA not available")
@unittest.skipIf(
not _is_fbgemm_genai_gpu_available(),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,23 +31,12 @@
is_sm_at_least_89,
is_sm_at_least_90,
)
from torchao.testing.model_architectures import ToyLinearModel

# Needed since changing args to function causes recompiles
torch._dynamo.config.cache_size_limit = 128


class ToyLinearModel(torch.nn.Module):
def __init__(self, in_features, out_features):
super().__init__()
self.linear1 = torch.nn.Linear(in_features, out_features, bias=False)
self.linear2 = torch.nn.Linear(out_features, in_features, bias=False)

def forward(self, x):
x = self.linear1(x)
x = self.linear2(x)
return x


# TODO: move tests in test_affine_quantized_float.py here after we migrated all implementations
@unittest.skipIf(not TORCH_VERSION_AT_LEAST_2_8, "Need pytorch 2.8+")
@unittest.skipIf(not torch.cuda.is_available(), "Need CUDA available")
Expand Down
20 changes: 1 addition & 19 deletions test/quantization/test_quant_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@
Int8WeightOnlyQuantizedLinearWeight,
)
from torchao.quantization.utils import compute_error
from torchao.testing.model_architectures import ToyLinearModel
from torchao.testing.utils import skip_if_rocm
from torchao.utils import (
TORCH_VERSION_AT_LEAST_2_3,
Expand Down Expand Up @@ -131,25 +132,6 @@ def quantize(self, model: torch.nn.Module) -> torch.nn.Module:
return model


class ToyLinearModel(torch.nn.Module):
def __init__(self, m=64, n=32, k=64, bias=False):
super().__init__()
self.linear1 = torch.nn.Linear(m, n, bias=bias).to(torch.float)
self.linear2 = torch.nn.Linear(n, k, bias=bias).to(torch.float)

def example_inputs(self, batch_size=1, dtype=torch.float, device="cpu"):
return (
torch.randn(
batch_size, self.linear1.in_features, dtype=dtype, device=device
),
)

def forward(self, x):
x = self.linear1(x)
x = self.linear2(x)
return x


def _ref_change_linear_weights_to_int8_dqtensors(model, filter_fn=None, **kwargs):
"""
The deprecated implementation for int8 dynamic quant API, used as a reference for
Expand Down
18 changes: 3 additions & 15 deletions test/sparsity/test_fast_sparse_training.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,22 +15,10 @@
swap_linear_with_semi_sparse_linear,
swap_semi_sparse_linear_with_linear,
)
from torchao.testing.model_architectures import ToyLinearModel
from torchao.utils import TORCH_VERSION_AT_LEAST_2_4, is_fbcode


class ToyModel(nn.Module):
def __init__(self):
super().__init__()
self.linear1 = nn.Linear(128, 256, bias=False)
self.linear2 = nn.Linear(256, 128, bias=False)

def forward(self, x):
x = self.linear1(x)
x = torch.nn.functional.relu(x)
x = self.linear2(x)
return x


class TestRuntimeSemiStructuredSparsity(TestCase):
@unittest.skipIf(not TORCH_VERSION_AT_LEAST_2_4, "pytorch 2.4+ feature")
@unittest.skipIf(not torch.cuda.is_available(), "Need CUDA available")
Expand All @@ -42,7 +30,7 @@ def test_runtime_weight_sparsification(self):

input = torch.rand((128, 128)).half().cuda()
grad = torch.rand((128, 128)).half().cuda()
model = ToyModel().half().cuda()
model = ToyLinearModel().half().cuda()
model_c = copy.deepcopy(model)

for name, mod in model.named_modules():
Expand Down Expand Up @@ -91,7 +79,7 @@ def test_runtime_weight_sparsification_compile(self):

input = torch.rand((128, 128)).half().cuda()
grad = torch.rand((128, 128)).half().cuda()
model = ToyModel().half().cuda()
model = ToyLinearModel().half().cuda()
model_c = copy.deepcopy(model)

for name, mod in model.named_modules():
Expand Down
15 changes: 1 addition & 14 deletions torchao/quantization/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -276,20 +276,7 @@ from torchao.quantization.quant_api import (
quantize_,
Int4WeightOnlyConfig,
)

class ToyLinearModel(torch.nn.Module):
def __init__(self, m=64, n=32, k=64):
super().__init__()
self.linear1 = torch.nn.Linear(m, n, bias=False)
self.linear2 = torch.nn.Linear(n, k, bias=False)

def example_inputs(self, batch_size=1, dtype=torch.float32, device="cpu"):
return (torch.randn(batch_size, self.linear1.in_features, dtype=dtype, device=device),)

def forward(self, x):
x = self.linear1(x)
x = self.linear2(x)
return x
from torchao.testing.model_architectures import ToyLinearModel

dtype = torch.bfloat16
m = ToyLinearModel(1024, 1024, 1024).eval().to(dtype).to("cuda")
Expand Down
19 changes: 16 additions & 3 deletions torchao/testing/model_architectures.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,27 @@
import torch.nn.functional as F


# TODO: Refactor torchao and tests to use these models
class ToyLinearModel(torch.nn.Module):
def __init__(self, k=64, n=32, dtype=torch.bfloat16):
def __init__(self, m=512, n=256, k=128):
super().__init__()
self.linear1 = torch.nn.Linear(k, n, bias=False).to(dtype)
self.linear1 = torch.nn.Linear(m, n, bias=False)
self.linear2 = torch.nn.Linear(n, k, bias=False)
self.linear3 = torch.nn.Linear(k, 1, bias=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create a separate model for two linear layers. This model for single linear layer is used in benchmarking run on CI.

Copy link
Contributor Author

@namgyu-youn namgyu-youn Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jainapurva I prefer to define ToySingleLinearModel and ToyMultiLinearModel for a future update as you mentioned, but how about reverting benchmark_aq.py?

Unit tests (e.g., test_quant_api.py, test_awq.py) are using single/multiple layers in a mixed manner, and using only multiple layers might be an update. If this makes sense, benchmark_aq.py would be the only case using single layers. Let me know which one aligns better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ToySingleLinearModel and ToyMultiLinearModel sounds good. Please ensure all the tests are running smoothly for it.
For benchmark_aq.py you can add the bias parameter as the last arg in init and set it to False by default. In addition to this, ToySingleLinearModel is used in running .github/workflows/run_microbenchmarks.yml. It uses the create_model_and_input_data, please ensure that method is running smoothly, and is updated as per the new toy models.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for opening the PR without checking it. And I will move into your suggestion; thanks for your leading.


def example_inputs(
self, batch_size, sequence_length=10, dtype=torch.bfloat16, device="cuda"
):
return [
torch.randn(
1, sequence_length, self.linear1.in_features, dtype=dtype, device=device
)
for j in range(batch_size)
]

def forward(self, x):
x = self.linear1(x)
x = self.linear2(x)
x = self.linear3(x)
return x


Expand Down
Loading
Loading