Skip to content

Conversation

@whettenr
Copy link
Contributor

What does this PR do?

Adds an implementation of BEST-RQ.
Add layer dropout interface for transformer classes.
Add output hidden layers and wrappers to be able to run MP3S benchmarks.

Fixes #<issue_number>

Before submitting
  • Did you read the contributor guideline?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Does your code adhere to project-specific code style and conventions?

PR review

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified
  • Confirm that the changes adhere to compatibility requirements (e.g., Python version, platform)
  • Review the self-review checklist to ensure the code is ready for review

@whettenr whettenr changed the title Sync deletion BEST-RQ implementation Dec 19, 2023
@Adel-Moumen
Copy link
Collaborator

Hello @whettenr,

Thanks for this PR! We've recently merged in develop a lot of new PRs as part of SpeechBrain 1.0. Unfortunately, this PR as many conflicts due to the latests merge. Do you mind updating your fork and fix all the potential conflicts ? Thanks.

Best,
Adel

@whettenr
Copy link
Contributor Author

Hey @Adel-Moumen,
Yes I will work on this! Thanks!
Ryan

@TParcollet
Copy link
Collaborator

Hello @whettenr, a few major changes before we go into the details of the code. You know that SB follows a dataset-task-oriented structure. Hence you will need to refactor the structure of your code. BEST-RQ is trained on LS, so it should be on Librispeech self-supervised-learning folder, alongside wav2vec2. For the finetuning, they should be in the different task folders, like ASR, CTC. You also have too many hparams I think, let's keep only the most meaningful ones.

@AmirHussein96
Copy link

AmirHussein96 commented Jul 10, 2024

@whettenr thank you for your contribution. I tried reproducing your results from this paper. I followed your pretraining implementation from here using the Branchformer configuration. The only differences in my setup were increasing the batch size from 100 to 1000 and to match your setup I reduced the warmup steps from 25000 to 2500. After pretraining, I froze the model and used all hidden representation with linear combination following the SUPURB CTC downstream task and trained a single-layer BiLSTM on 100 Libri, but I got PER 65 on test-clean, which is significantly worse than the WER reported in the paper. Do you have any ideas on what might be causing this issue?
I also tried Hubert in same setup and the resulted PER is 7.

@AmirHussein96
Copy link

AmirHussein96 commented Jul 10, 2024

@whettenr thank you for your contribution. I tried reproducing your results from this paper. I followed your pretraining implementation from here using the Branchformer configuration. The only differences in my setup were increasing the batch size from 100 to 1000 and to match your setup I reduced the warmup steps from 25000 to 2500. After pretraining, I froze the model and used all hidden representation with linear combination following the SUPURB CTC downstream task and trained a single-layer BiLSTM on 100 Libri, but I got a PER of around 65, which is significantly worse than the WER reported in the paper. Do you have any ideas on what might be causing this issue? I also tried Hubert in same setup and it PER is 7.

This is my expert.py which shows how I loaded and used the pretrained BestRQ model in s3prl:

from collections import OrderedDict
from typing import Dict, List, Union
import torch.nn.functional as F
import sys
from audiotools import AudioSignal
import torch.nn as nn
from torch import Tensor
from torch.nn.utils.rnn import pad_sequence

# load speech brain model

import sys
sys.path.insert(0, '/mm2/ahussein/speechbrain/recipes/BEST-RQ') 
import speechbrain as sb
import torch
from hyperpyyaml import load_hyperpyyaml

HIDDEN_DIM = 8

import torch
from hyperpyyaml import load_hyperpyyaml

hyperparams = """
output_folder: /mm2/scratch/ahussein/speechbrain/results/best_hyperbranchconformer/2000
save_folder: !ref <output_folder>/save
pt_model_hub: /mm2/scratch/ahussein/speechbrain/results/best_hyperbranchconformer/2000/save/CKPT+2024-07-10+16-14-27+00
pt_model_output_dim: 512
sample_rate: 16000

####################### Model parameters ###########################
# Transformer
d_model: 512
nhead: 8 # table 1 https://arxiv.org/pdf/2010.10504.pdf
num_encoder_layers: 12 # section 4.1.1
num_decoder_layers: 0
d_ffn: 2048
transformer_dropout: 0.1
activation: !name:torch.nn.GELU
output_neurons: 5000


# Feature parameters
n_fft: 400
n_mels: 80
hop_length: 10
pad_to_divisible_by: 4

# quantizer parameters not using though for finetunning
p_input: 320
cb_dim: 16
cb_vocab: 8192


compute_features: !new:speechbrain.lobes.features.Fbank
    sample_rate: !ref <sample_rate>
    n_fft: !ref <n_fft>
    n_mels: !ref <n_mels>
    
normalize: !new:speechbrain.processing.features.InputNormalization
    norm_type: global
    update_until_epoch: 4

CNN: !new:speechbrain.lobes.models.convolution.ConvolutionFrontEnd
    input_shape: (8, 10, 80)
    num_blocks: 2
    num_layers_per_block: 1
    out_channels: (128, 32)
    kernel_sizes: (3, 3)
    strides: (2, 2)
    residuals: (False, False)

Transformer: !new:speechbrain.lobes.models.transformer.TransformerASR.TransformerASR # yamllint disable-line rule:line-length
    input_size: 640
    tgt_vocab: !ref <output_neurons>
    d_model: !ref <d_model>
    nhead: !ref <nhead>
    num_encoder_layers: !ref <num_encoder_layers>
    num_decoder_layers: !ref <num_decoder_layers>
    d_ffn: !ref <d_ffn>
    dropout: !ref <transformer_dropout>
    activation: !ref <activation>
    encoder_module: branchformer
    attention_type: hypermixing
    normalize_before: True
    causal: False
    output_hidden_states: True
    

enc: !new:speechbrain.lobes.models.transformer.TransformerASR.EncoderWrapper
   transformer: !ref <Transformer>

# Quantizer: !new:.quantiser.RandomProjectionQuantizer
Quantizer: !new:speechbrain.nnet.quantisers.RandomProjectionQuantizer
    # projection
    input_dim: !ref <p_input>
    # codebook
    cb_dim: !ref <cb_dim>    
    cb_vocab: !ref <cb_vocab>

linear: !new:speechbrain.nnet.linear.Linear
    input_size: !ref <d_model>
    n_neurons: !ref <cb_vocab>

pt_model: !new:torch.nn.ModuleList
    - [!ref <CNN>, !ref <enc>, !ref <Quantizer>, !ref <linear>]

modules:
   normalize: !ref <normalize>
   pt_model: !ref <pt_model>
   
pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
   collect_in: !ref <save_folder>
   loadables:
      pt_model: !ref <pt_model>
      normalize: !ref <normalize>

   paths:
      pt_model: !ref <pt_model_hub>/model.ckpt
      normalize: !ref <pt_model_hub>/normalizer.ckpt
"""



class UpstreamExpert(nn.Module):
    def __init__(self, ckpt: str = None, model_config: str = None, **kwargs):
        """
        Args:
            ckpt:
                The checkpoint path for loading your pretrained weights.
                Can be assigned by the -k option in run_downstream.py

            model_config:
                The config path for constructing your model.
                Can be assigned by the -g option in run_downstream.py
        """
        super().__init__()
        loaded_hparams = load_hyperpyyaml(hyperparams)
        self.name = "[BestRQ]"
        
        self.model = loaded_hparams



    def get_downsample_rates(self, key: str) -> int:
        """
        Since we do not do any downsampling in this example upstream
        All keys' corresponding representations have downsample rate of 1
        """
        return 160*4
    
    def get_output_lengths(self, input_lengths):
        """ Function to get the output length of the feature extractor this is
            necessery to compute the masks of BestRQ.
        """
        sr = self.model["sample_rate"]
        hop_length = self.model["hop_length"]

        return (input_lengths // (sr*hop_length / 1000) + 1).to(torch.long)
    
    def forward(self, wavs: List[Tensor]) -> Dict[str, Union[Tensor, List[Tensor]]]:
        """
        When the returning Dict contains the List with more than one Tensor,
        those Tensors should be in the same shape to train a weighted-sum on them.
        """
        device = wavs[0].device
        length = torch.tensor([data.shape[0] for data in wavs], dtype=torch.int32, device=device)
        
        length = self.get_output_lengths(length)
        max_len = torch.max(length)
        length = length//max_len
        wavs = pad_sequence(wavs, batch_first=True)
        feats = self.model['compute_features'](wavs)
        feats = self.model['modules']['normalize'](feats, length, epoch=10)
        self.model['modules']['pt_model'][0] = self.model['modules']['pt_model'][0].to(device)
        src = self.model['modules']['pt_model'][0](feats)
        self.model['modules']['pt_model'][1] = self.model['modules']['pt_model'][1].to(device)
        enc_out, hidden_activations = self.model['modules']['pt_model'][1](src, length)
        return {
            "hidden_states": hidden_activations[1:],
        }

@whettenr
Copy link
Contributor Author

whettenr commented Jul 11, 2024

@whettenr thank you for your contribution. I tried reproducing your results from this paper. I followed your pretraining implementation from here using the Branchformer configuration. The only differences in my setup were increasing the batch size from 100 to 1000 and to match your setup I reduced the warmup steps from 25000 to 2500. After pretraining, I froze the model and used all hidden representation with linear combination following the SUPURB CTC downstream task and trained a single-layer BiLSTM on 100 Libri, but I got PER 65 on test-clean, which is significantly worse than the WER reported in the paper. Do you have any ideas on what might be causing this issue? I also tried Hubert in same setup and the resulted PER is 7.

I'm not sure what exactly could be causing it, but there few key differences that maybe could lead to bad performance.

  • Changing from Conformer to Branchformer, I have not experimented thoroughly with branchformer, this might mean you need to adjust other hyperparameters to work with Branchformer.
  • I'm not sure if how well branchformer and hypermixing work well together. Maybe you could try the original branchformer. I have experimented with conformer and hypermixing and got pretty good results, but I'm not sure about hypermixing with branchformer
  • Changing warmup steps, I'm not sure if this could make a big difference, but it is a difference, slow warmup can help prevent getting stuck in a high local minimum
  • Batch size, in my opinion this probably has the biggest effect. The batch size, mask ratio, and learning rate are all very important. Remember that the total batch size is max number of seconds * number of GPUs, * grad accumulation. It looks like you increase the batch size a lot (you might need to increase the number of buckets a bit too if you haven't). With this is might be worth trying to change the mask ratio and or learning rate.

My advice:
If you are ok with not using branchformer with hypermixing and just want to reproduce results, I would try with the conformer (I have done conformer with a bigger batch size of around 1.7 hours and got better results than https://arxiv.org/pdf/2405.04296). If you have to use branchformer, I would try using the standard branchformer just to see if you can get good results with that combination of hyperparameters. If not I would experiment with the hyperparameters (mask size, learning rate, warmup, ...).

Hope this helps

@AmirHussein96
Copy link

AmirHussein96 commented Jul 11, 2024

Thanks, @whettenr, for the quick response. I followed your recommendation and used the Conformer setup. However, I noticed that although you have defined the Noam scheduler in the Conformer config file, it is not used in your train.py file (https://github.com/whettenr/speechbrain/blob/sync-deletion/recipes/BEST-RQ/train.py). During training, the logs showed the same value for the learning rate throughout. Was this intentional or an oversight?

I assumed it might have been an oversight, so I modified train.py slightly to incorporate the Noam scheduler. Here is the change I made:

    def on_fit_batch_end(self, batch, outputs, loss, should_step):
        """ Called after fit_batch(), updates learning rate and does per-step logging. """
        if should_step:
            if isinstance(
                self.hparams.noam_annealing, sb.nnet.schedulers.NoamScheduler
            ) or isinstance(
                self.hparams.noam_annealing,
                sb.nnet.schedulers.CyclicCosineScheduler,
            ):
                self.hparams.noam_annealing(self.optimizer)

I also added TensorBoard to visualize training and validation losses:
bestrq

The current setup is exactly the same as yours, except I am using 5 GPUs with a batch size of 1000 samples per GPU, buckets=100, and warmup=700, masking 10%. Let me know if the values I obtained for training and validation are close to the ranges of your values or if they are off. By the way, the accuracy on validation is around 20%.

@Adel-Moumen
Copy link
Collaborator

Adel-Moumen commented Jul 12, 2024

@whettenr thank you for your contribution. I tried reproducing your results from this paper. I followed your pretraining implementation from here using the Branchformer configuration. The only differences in my setup were increasing the batch size from 100 to 1000 and to match your setup I reduced the warmup steps from 25000 to 2500. After pretraining, I froze the model and used all hidden representation with linear combination following the SUPURB CTC downstream task and trained a single-layer BiLSTM on 100 Libri, but I got PER 65 on test-clean, which is significantly worse than the WER reported in the paper. Do you have any ideas on what might be causing this issue? I also tried Hubert in same setup and the resulted PER is 7.

if you want to reproduce the results, I believe @whettenr used the MP3S benchmark available here: https://github.com/speechbrain/benchmarks/tree/main/benchmarks/MP3S. As pointed out by Ryan, augmenting the batch size and reducing the warmup might affect negatively the dowstream results.

BTW, I just saw on your GitHub profile that you were at JHU. I am currently working there as part of the JSALT workshop so if you want we can have a one-to-one discussion about speechbrain and your issue :)

@whettenr
Copy link
Contributor Author

Thanks, @whettenr, for the quick response. I followed your recommendation and used the Conformer setup. However, I noticed that although you have defined the Noam scheduler in the Conformer config file, it is not used in your train.py file (https://github.com/whettenr/speechbrain/blob/sync-deletion/recipes/BEST-RQ/train.py). During training, the logs showed the same value for the learning rate throughout. Was this intentional or an oversight?

I assumed it might have been an oversight, so I modified train.py slightly to incorporate the Noam scheduler. Here is the change I made:

    def on_fit_batch_end(self, batch, outputs, loss, should_step):
        """ Called after fit_batch(), updates learning rate and does per-step logging. """
        if should_step:
            if isinstance(
                self.hparams.noam_annealing, sb.nnet.schedulers.NoamScheduler
            ) or isinstance(
                self.hparams.noam_annealing,
                sb.nnet.schedulers.CyclicCosineScheduler,
            ):
                self.hparams.noam_annealing(self.optimizer)

I also added TensorBoard to visualize training and validation losses: bestrq

The current setup is exactly the same as yours, except I am using 5 GPUs with a batch size of 1000 samples per GPU, buckets=100, and warmup=700, masking 10%. Let me know if the values I obtained for training and validation are close to the ranges of your values or if they are off. By the way, the accuracy on validation is around 20%.

I believe a validation accuracy around 20% is pretty good! Have you tried that model on the DS tasks? and yes I did use the MP3S benchmark.

@AmirHussein96
Copy link

@whettenr thank you for your contribution. I tried reproducing your results from this paper. I followed your pretraining implementation from here using the Branchformer configuration. The only differences in my setup were increasing the batch size from 100 to 1000 and to match your setup I reduced the warmup steps from 25000 to 2500. After pretraining, I froze the model and used all hidden representation with linear combination following the SUPURB CTC downstream task and trained a single-layer BiLSTM on 100 Libri, but I got PER 65 on test-clean, which is significantly worse than the WER reported in the paper. Do you have any ideas on what might be causing this issue? I also tried Hubert in same setup and the resulted PER is 7.

if you want to reproduce the results, I believe @whettenr used the MP3S benchmark available here: https://github.com/speechbrain/benchmarks/tree/main/benchmarks/MP3S. As pointed out by Ryan, augmenting the batch size and reducing the warmup might affect negatively the dowstream results.

BTW, I just saw on your GitHub profile that you were at JHU. I am currently working there as part of the JSALT workshop so if you want we can have a one-to-one discussion about speechbrain and your issue :)

Thank you, @Adel-Moumen! I will definitely try the MP3S benchmark.

It's such a coincidence that you're at JSALT. I hope you're enjoying your time there. Unfortunately, I'm currently in Boston, but I look forward to meeting you in person someday.

@AmirHussein96
Copy link

AmirHussein96 commented Jul 12, 2024

I believe a validation accuracy around 20% is pretty good! Have you tried that model on the DS tasks? and yes I did use the MP3S benchmark.

@whettenr I saw your logs here https://github.com/whettenr/bestrq/blob/main/results/best_hyperconformer/2000/log.txt and the numbers look close to mine. Do you mind sharing the train_brq.py and ssl_brq_hyperconformer.yaml from https://github.com/whettenr/bestrq/blob/main/run/downstream/run_base_hyperconf_v100.sh?

@TParcollet TParcollet added ready to review Waiting on reviewer to provide feedback enhancement New feature or request recipes Changes to recipes only (add/edit) labels Sep 2, 2024
@TParcollet
Copy link
Collaborator

@asumagic I took care of cleaning, fixing, documenting the code and even retraining the models. Could you please have a look very briefly at the code to see if anything seems crazy to you? I also added the return type documentation due to this new hidden_states return.

The models are missing from the readme, but we can do another PR for that imho.

If @asumagic is happy, i'll merge.

Titouan Parcollet/Embedded AI /SRUK/Engineer/Samsung Electronics added 3 commits September 2, 2024 14:05
@asumagic asumagic self-assigned this Sep 2, 2024
@asumagic asumagic added this to the v1.1.0 milestone Sep 2, 2024
Copy link
Collaborator

@asumagic asumagic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice! I've only made a superficial review, though. With those points fixed, if the recipe tests pass, it's ok to me.

@xxchauncey
Copy link

@whettenr thank you for your contribution. I tried reproducing your results from this paper. I followed your pretraining implementation from here using the Branchformer configuration. The only differences in my setup were increasing the batch size from 100 to 1000 and to match your setup I reduced the warmup steps from 25000 to 2500. After pretraining, I froze the model and used all hidden representation with linear combination following the SUPURB CTC downstream task and trained a single-layer BiLSTM on 100 Libri, but I got PER 65 on test-clean, which is significantly worse than the WER reported in the paper. Do you have any ideas on what might be causing this issue? I also tried Hubert in same setup and the resulted PER is 7.

hi @AmirHussein96 ,

Have you tried finetuning only with a linear and CTC loss like HuBERT, does it work?

@TParcollet
Copy link
Collaborator

@xxchauncey it does work, you have a recipe to do that in this PR. It will get merged soon. We will train the models and get the results soon-ish as well.

@TParcollet
Copy link
Collaborator

@asumagic I added a few docstring test for the hidden_state thing, but i guess you are right, we don't even have a single unit test for the transformerASR models outside of docstrings tests... maybe we should have a PR to fix that.

@TParcollet TParcollet merged commit c2a5b7d into speechbrain:develop Sep 3, 2024
@karynaur
Copy link

yayyy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request ready to review Waiting on reviewer to provide feedback recipes Changes to recipes only (add/edit)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants