-
Notifications
You must be signed in to change notification settings - Fork 5.4k
pytorch new tdnnf structure #3923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,6 +1,7 @@ | ||
| #!/usr/bin/env python3 | ||
|
|
||
| # Copyright 2019-2020 Mobvoi AI Lab, Beijing, China (author: Fangjun Kuang) | ||
| # Copyright 2019-2020 JD AI, Beijing, China (author: Lu Fan) | ||
| # Apache 2.0 | ||
|
|
||
| import logging | ||
|
|
@@ -20,14 +21,13 @@ def get_chain_model(feat_dim, | |
| hidden_dim, | ||
| bottleneck_dim, | ||
| time_stride_list, | ||
| conv_stride_list, | ||
| lda_mat_filename=None): | ||
| model = ChainModel(feat_dim=feat_dim, | ||
| output_dim=output_dim, | ||
| lda_mat_filename=lda_mat_filename, | ||
| hidden_dim=hidden_dim, | ||
| time_stride_list=time_stride_list, | ||
| conv_stride_list=conv_stride_list) | ||
| bottleneck_dim=bottleneck_dim, | ||
| time_stride_list=time_stride_list) | ||
| return model | ||
|
|
||
|
|
||
|
|
@@ -72,15 +72,14 @@ def __init__(self, | |
| lda_mat_filename=None, | ||
| hidden_dim=1024, | ||
| bottleneck_dim=128, | ||
| time_stride_list=[1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1], | ||
| conv_stride_list=[1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1], | ||
| time_stride_list=[1, 1, 1, 0, 3, 3, 3, 3, 3, 3, 3, 3], | ||
| frame_subsampling_factor=3): | ||
| super().__init__() | ||
|
|
||
| # at present, we support only frame_subsampling_factor to be 3 | ||
| assert frame_subsampling_factor == 3 | ||
|
|
||
| assert len(time_stride_list) == len(conv_stride_list) | ||
| self.frame_subsampling_factor = frame_subsampling_factor | ||
| self.time_stride_list = time_stride_list | ||
| num_layers = len(time_stride_list) | ||
|
|
||
| # tdnn1_affine requires [N, T, C] | ||
|
|
@@ -93,20 +92,17 @@ def __init__(self, | |
| tdnnfs = [] | ||
| for i in range(num_layers): | ||
| time_stride = time_stride_list[i] | ||
| conv_stride = conv_stride_list[i] | ||
| layer = FactorizedTDNN(dim=hidden_dim, | ||
| bottleneck_dim=bottleneck_dim, | ||
| time_stride=time_stride, | ||
| conv_stride=conv_stride) | ||
| time_stride=time_stride) | ||
| tdnnfs.append(layer) | ||
|
|
||
| # tdnnfs requires [N, C, T] | ||
| self.tdnnfs = nn.ModuleList(tdnnfs) | ||
|
|
||
| # prefinal_l affine requires [N, C, T] | ||
| self.prefinal_l = OrthonormalLinear(dim=hidden_dim, | ||
| bottleneck_dim=bottleneck_dim * 2, | ||
| time_stride=0) | ||
| bottleneck_dim=bottleneck_dim * 2) | ||
|
|
||
| # prefinal_chain requires [N, C, T] | ||
| self.prefinal_chain = PrefinalLayer(big_dim=hidden_dim, | ||
|
|
@@ -174,6 +170,13 @@ def forward(self, x): | |
| # tdnnf requires input of shape [N, C, T] | ||
| for i in range(len(self.tdnnfs)): | ||
| x = self.tdnnfs[i](x) | ||
| # stride manually, do not stride context | ||
| if self.tdnnfs[i].time_stride == 0: | ||
| cur_context = sum(self.time_stride_list[i:]) | ||
| x_left = x[:, :, :cur_context] | ||
| x_mid = x[:, :, cur_context:-cur_context:self.frame_subsampling_factor] | ||
| x_right = x[:, :, -cur_context:] | ||
| x = torch.cat([x_left, x_mid, x_right], dim=2) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm surprised that you are doing this manually rather than using a 1d convolution. This could be quite slow.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I want subsample the length of window only rather than left_context and right_context. And this is slower than before training, but it worked. please help me to write this 1d convolution.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What might have happened here is that you tripled the dimension in the middle of the network.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just subsample the t_out_length from (24+150+24) to (24+50+24) manually, the number of parameters will not increase than stride kernel(2,2) version. I explained this code behaviour in the picture below. |
||
|
|
||
| # at this point, x is [N, C, T] | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -53,24 +53,16 @@ def _constrain_orthonormal_internal(M): | |
|
|
||
| class OrthonormalLinear(nn.Module): | ||
|
|
||
| def __init__(self, dim, bottleneck_dim, time_stride): | ||
| def __init__(self, dim, bottleneck_dim, kernel_size=1, dilation=1): | ||
| super().__init__() | ||
| assert time_stride in [0, 1] | ||
| # WARNING(fangjun): kaldi uses [-1, 0] for the first linear layer | ||
| # and [0, 1] for the second affine layer; | ||
| # we use [-1, 0, 1] for the first linear layer if time_stride == 1 | ||
|
|
||
| if time_stride == 0: | ||
| kernel_size = 1 | ||
| else: | ||
| kernel_size = 3 | ||
|
|
||
| self.kernel_size = kernel_size | ||
| self.dilation = dilation | ||
|
|
||
| # conv requires [N, C, T] | ||
| self.conv = nn.Conv1d(in_channels=dim, | ||
| out_channels=bottleneck_dim, | ||
| kernel_size=kernel_size, | ||
| dilation=dilation, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You should never need the dilation parameter. I think we discussed this before.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ... instead of using dilation, do a 3-fold subsampling after the last layer that had stride=1. Please don't argue about this! I remember last time was quite painful.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hah, I find the discussed info before. I just to make the length of output is equal to the supervision
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are these shape of output generated by tdnnf layers correct? |
||
| bias=False) | ||
|
|
||
| def forward(self, x): | ||
|
|
@@ -116,7 +108,7 @@ def __init__(self, big_dim, small_dim): | |
| self.batchnorm1 = nn.BatchNorm1d(num_features=big_dim) | ||
| self.linear = OrthonormalLinear(dim=big_dim, | ||
| bottleneck_dim=small_dim, | ||
| time_stride=0) | ||
| kernel_size=1) | ||
| self.batchnorm2 = nn.BatchNorm1d(num_features=small_dim) | ||
|
|
||
| def forward(self, x): | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm surprised you didn't implement the TDNN_F layer in the "obvious" way with 1-d convolution.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Inside
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. kernel_size=1 doesn't look right. Some extremely weird stuff is going on in this PR.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi, dan, this code's behavior is not different with before code's. just changed the param. the original stride version use kernel_size=1 as default also
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's getting less clear to me, not more clear, and in any case the code is not right.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sorry about this |
||
|
|
@@ -161,29 +153,32 @@ def __init__(self, | |
| dim, | ||
| bottleneck_dim, | ||
| time_stride, | ||
| conv_stride, | ||
| bypass_scale=0.66): | ||
| super().__init__() | ||
|
|
||
| assert conv_stride in [1, 3] | ||
| assert abs(bypass_scale) <= 1 | ||
|
|
||
| self.bypass_scale = bypass_scale | ||
| self.time_stride = time_stride | ||
|
|
||
| self.conv_stride = conv_stride | ||
| if time_stride > 0: | ||
| kernel_size, dilation = 2, time_stride | ||
| else: | ||
| kernel_size, dilation = 1, 1 | ||
|
|
||
| # linear requires [N, C, T] | ||
| self.linear = OrthonormalLinear(dim=dim, | ||
| bottleneck_dim=bottleneck_dim, | ||
| time_stride=time_stride) | ||
| kernel_size=kernel_size, | ||
| dilation=dilation) | ||
|
|
||
| # affine requires [N, C, T] | ||
| # WARNING(fangjun): we do not use nn.Linear here | ||
| # since we want to use `stride` | ||
| self.affine = nn.Conv1d(in_channels=bottleneck_dim, | ||
| out_channels=dim, | ||
| kernel_size=1, | ||
| stride=conv_stride) | ||
| kernel_size=kernel_size, | ||
| dilation=dilation) | ||
|
|
||
| # batchnorm requires [N, C, T] | ||
| self.batchnorm = nn.BatchNorm1d(num_features=dim) | ||
|
|
@@ -213,10 +208,11 @@ def forward(self, x): | |
|
|
||
| # TODO(fangjun): implement GeneralDropoutComponent in PyTorch | ||
|
|
||
| if self.linear.kernel_size == 3: | ||
| x = self.bypass_scale * input_x[:, :, 1:-1:self.conv_stride] + x | ||
| # at this point, x is [N, C, T] | ||
| if self.linear.kernel_size == 2: | ||
| x = self.bypass_scale * input_x[:, :, self.linear.dilation:-self.linear.dilation:1] + x | ||
| else: | ||
| x = self.bypass_scale * input_x[:, :, ::self.conv_stride] + x | ||
| x = self.bypass_scale * input_x[:, :, ::1] + x | ||
| return x | ||
|
|
||
| def constrain_orthonormal(self): | ||
|
|
@@ -257,8 +253,7 @@ def compute_loss(M): | |
|
|
||
| model = FactorizedTDNN(dim=1024, | ||
| bottleneck_dim=128, | ||
| time_stride=1, | ||
| conv_stride=3) | ||
| time_stride=1) | ||
| loss = [] | ||
| model.constrain_orthonormal() | ||
| loss.append( | ||
|
|
@@ -278,40 +273,29 @@ def _test_factorized_tdnn(): | |
| N = 1 | ||
| T = 10 | ||
| C = 4 | ||
|
|
||
| # case 0: time_stride == 1, conv_stride == 1 | ||
| # https://pytorch.org/docs/stable/nn.html?highlight=conv1d#torch.nn.Conv1d | ||
| # T_out = math.ceil((T + 2 * padding - dilation * (kernel_size - 1) - 1) / stride) | ||
| # case 0: time_stride == 1, kernel_size==2, dilation = 1 | ||
| model = FactorizedTDNN(dim=C, | ||
| bottleneck_dim=2, | ||
| time_stride=1, | ||
| conv_stride=1) | ||
| time_stride=1) | ||
| x = torch.arange(N * T * C).reshape(N, C, T).float() | ||
| y = model(x) | ||
| assert y.size(2) == T - 2 | ||
|
|
||
| # case 1: time_stride == 0, conv_stride == 1 | ||
| # case 1: time_stride == 0, kernel_size == 1, dilation == 1 | ||
| model = FactorizedTDNN(dim=C, | ||
| bottleneck_dim=2, | ||
| time_stride=0, | ||
| conv_stride=1) | ||
| time_stride=0) | ||
| y = model(x) | ||
| assert y.size(2) == T | ||
|
|
||
| # case 2: time_stride == 1, conv_stride == 3 | ||
| # case 2: time_stride == 3, kernel_size == 2, dilation = 3 | ||
| model = FactorizedTDNN(dim=C, | ||
| bottleneck_dim=2, | ||
| time_stride=1, | ||
| conv_stride=3) | ||
| time_stride=3) | ||
| y = model(x) | ||
| assert y.size(2) == math.ceil((T - 2) / 3) | ||
|
|
||
| # case 3: time_stride == 0, conv_stride == 3 | ||
| model = FactorizedTDNN(dim=C, | ||
| bottleneck_dim=2, | ||
| time_stride=0, | ||
| conv_stride=3) | ||
| y = model(x) | ||
| assert y.size(2) == math.ceil(T / 3) | ||
|
|
||
| assert y.size(2) == math.ceil(math.ceil((T - 3)) - 3) | ||
|
|
||
| if __name__ == '__main__': | ||
| torch.manual_seed(20200130) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mm.. I'm a bit surprised this
* 2is here?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mm.. I think you were assuming that the final layer's bottleneck is always twice the TDNN-F layers' bottleneck.
In fact we generally leave the final layer's bottleneck at 256, which for some reason seems to work across a range
of conditions. You could make that a separate configuration value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when I have checked the param shape of kaldi's model, I don't find the difference betwieen the final layer and previous layers what you said.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* 2is used here to follow what kaldi does.I've changed it to be configurable in this pullrequest: #3925