Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why perform Audio2Mel's method on extracting mel spectrogram? #36

Open
shawnbzhang opened this issue Nov 24, 2020 · 1 comment
Open

Comments

@shawnbzhang
Copy link

Audio2Mel does the following to extract the mel spectrogram:

    data, sampling_rate = load(full_path, sr=self.sampling_rate)
    data = 0.95 * normalize(data)

    if self.augment:
        amplitude = np.random.uniform(low=0.3, high=1.0)
        data = data * amplitude

    return torch.from_numpy(data).float(), sampling_rate

which is forwarded as:

    def forward(self, audio):
        p = (self.n_fft - self.hop_length) // 2
        audio = F.pad(audio, (p, p), "reflect").squeeze(1)
        fft = torch.stft(
            audio,
            n_fft=self.n_fft,
            hop_length=self.hop_length,
            win_length=self.win_length,
            window=self.window,
            center=False,
        )
        real_part, imag_part = fft.unbind(-1)
        magnitude = torch.sqrt(real_part ** 2 + imag_part ** 2)
        mel_output = torch.matmul(self.mel_basis, magnitude)
        log_mel_spec = torch.log10(torch.clamp(mel_output, min=1e-5))
        return log_mel_spec

Is there a benefit of doing this over Torchaudio's mel spectrogram function, e.g.:

    data, sampling_rate = torchaudio.load(full_path)
    melspec_ops = torchaudio.transforms.MelSpectrogram(sample_rate=sampling_rate,
        n_fft=self.n_fft,
        win_length=self.win_length,
        hop_length=self.hop_length,
        f_min=0,
        f_max=None,
        n_mels=self.n_mel_channels)

    mel_spec = melspec_ops(data)

    log_mel_spec = torch.log10(mel_spec + 0.000000001)
    return log_mel_spec

I'm just curious about this design choice — it wasn't really touched in the paper.

Side quesiton: Why do you multiply the normalized waveform by 0.95 in the original method?

@J0shuaFernandes
Copy link

Can you share the full Audio2Mel code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants