Possible implementation errors with the AudioEncoder class in whisper/model.py #2217

milkyfun0 · 2024-06-10T09:54:11Z

milkyfun0
Jun 10, 2024

The Problems

class AudioEncoder(nn.Module):
    def __init__(
        self, n_mels: int, n_ctx: int, n_state: int, n_head: int, n_layer: int
    ):
        super().__init__()
        self.conv1 = Conv1d(n_mels, n_state, kernel_size=3, padding=1)
        self.conv2 = Conv1d(n_state, n_state, kernel_size=3, stride=2, padding=1)
        self.register_buffer("positional_embedding", sinusoids(n_ctx, n_state))

        self.blocks: Iterable[ResidualAttentionBlock] = nn.ModuleList(
            [ResidualAttentionBlock(n_state, n_head) for _ in range(n_layer)]
        )
        self.ln_post = LayerNorm(n_state)

    def forward(self, x: Tensor):
        """
        x : torch.Tensor, shape = (batch_size, n_mels, n_ctx)
            the mel spectrogram of the audio
        """
        x = F.gelu(self.conv1(x)) # batch_size, n_mels, n_ctx
        x = F.gelu(self.conv2(x))  # batch_size, n_mels, n_ctx // 2 
        x = x.permute(0, 2, 1)  # batch_size, n_ctx // 2, ,n_mels

        assert x.shape[1:] == self.positional_embedding.shape, "incorrect audio shape"
        x = (x + self.positional_embedding).to(x.dtype)

        for block in self.blocks:
            x = block(x)

        x = self.ln_post(x)
        return x

self.conv2 has step size 2, and in forward(), the shape of x, n_ctx, must be halved，so "incorrect audio shape" is a must

In the pre-trained weights for Whisper , the shape of self.positional_embedding remains as (n_ctx, n_state). Therefore, I suspect there may be an error in the code, but I am not certain.

I am a beginner, so I am not able to distinguish errors. Please criticize and correct any mistakes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Possible implementation errors with the AudioEncoder class in whisper/model.py #2217

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Possible implementation errors with the AudioEncoder class in whisper/model.py #2217

Uh oh!

Uh oh!

milkyfun0 Jun 10, 2024

The Problems

Replies: 0 comments

milkyfun0
Jun 10, 2024