BEST-RQ implementation #2309

whettenr · 2023-12-19T13:35:20Z

What does this PR do?

Adds an implementation of BEST-RQ.
Add layer dropout interface for transformer classes.
Add output hidden layers and wrappers to be able to run MP3S benchmarks.

Fixes #<issue_number>

Before submitting

Did you read the contributor guideline?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Does your code adhere to project-specific code style and conventions?

PR review

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified
Confirm that the changes adhere to compatibility requirements (e.g., Python version, platform)
Review the self-review checklist to ensure the code is ready for review

…the same time

… checkpoint

Adel-Moumen · 2024-01-16T13:49:35Z

Hello @whettenr,

Thanks for this PR! We've recently merged in develop a lot of new PRs as part of SpeechBrain 1.0. Unfortunately, this PR as many conflicts due to the latests merge. Do you mind updating your fork and fix all the potential conflicts ? Thanks.

Best,
Adel

whettenr · 2024-01-17T09:18:58Z

Hey @Adel-Moumen,
Yes I will work on this! Thanks!
Ryan

…into sync-deletion

TParcollet · 2024-07-09T14:08:47Z

Hello @whettenr, a few major changes before we go into the details of the code. You know that SB follows a dataset-task-oriented structure. Hence you will need to refactor the structure of your code. BEST-RQ is trained on LS, so it should be on Librispeech self-supervised-learning folder, alongside wav2vec2. For the finetuning, they should be in the different task folders, like ASR, CTC. You also have too many hparams I think, let's keep only the most meaningful ones.

AmirHussein96 · 2024-07-10T22:17:00Z

@whettenr thank you for your contribution. I tried reproducing your results from this paper. I followed your pretraining implementation from here using the Branchformer configuration. The only differences in my setup were increasing the batch size from 100 to 1000 and to match your setup I reduced the warmup steps from 25000 to 2500. After pretraining, I froze the model and used all hidden representation with linear combination following the SUPURB CTC downstream task and trained a single-layer BiLSTM on 100 Libri, but I got PER 65 on test-clean, which is significantly worse than the WER reported in the paper. Do you have any ideas on what might be causing this issue?
I also tried Hubert in same setup and the resulted PER is 7.

AmirHussein96 · 2024-07-10T22:31:18Z

@whettenr thank you for your contribution. I tried reproducing your results from this paper. I followed your pretraining implementation from here using the Branchformer configuration. The only differences in my setup were increasing the batch size from 100 to 1000 and to match your setup I reduced the warmup steps from 25000 to 2500. After pretraining, I froze the model and used all hidden representation with linear combination following the SUPURB CTC downstream task and trained a single-layer BiLSTM on 100 Libri, but I got a PER of around 65, which is significantly worse than the WER reported in the paper. Do you have any ideas on what might be causing this issue? I also tried Hubert in same setup and it PER is 7.

This is my expert.py which shows how I loaded and used the pretrained BestRQ model in s3prl:

from collections import OrderedDict
from typing import Dict, List, Union
import torch.nn.functional as F
import sys
from audiotools import AudioSignal
import torch.nn as nn
from torch import Tensor
from torch.nn.utils.rnn import pad_sequence

# load speech brain model

import sys
sys.path.insert(0, '/mm2/ahussein/speechbrain/recipes/BEST-RQ') 
import speechbrain as sb
import torch
from hyperpyyaml import load_hyperpyyaml

HIDDEN_DIM = 8

import torch
from hyperpyyaml import load_hyperpyyaml

hyperparams = """
output_folder: /mm2/scratch/ahussein/speechbrain/results/best_hyperbranchconformer/2000
save_folder: !ref <output_folder>/save
pt_model_hub: /mm2/scratch/ahussein/speechbrain/results/best_hyperbranchconformer/2000/save/CKPT+2024-07-10+16-14-27+00
pt_model_output_dim: 512
sample_rate: 16000

####################### Model parameters ###########################
# Transformer
d_model: 512
nhead: 8 # table 1 https://arxiv.org/pdf/2010.10504.pdf
num_encoder_layers: 12 # section 4.1.1
num_decoder_layers: 0
d_ffn: 2048
transformer_dropout: 0.1
activation: !name:torch.nn.GELU
output_neurons: 5000


# Feature parameters
n_fft: 400
n_mels: 80
hop_length: 10
pad_to_divisible_by: 4

# quantizer parameters not using though for finetunning
p_input: 320
cb_dim: 16
cb_vocab: 8192


compute_features: !new:speechbrain.lobes.features.Fbank
    sample_rate: !ref <sample_rate>
    n_fft: !ref <n_fft>
    n_mels: !ref <n_mels>
    
normalize: !new:speechbrain.processing.features.InputNormalization
    norm_type: global
    update_until_epoch: 4

CNN: !new:speechbrain.lobes.models.convolution.ConvolutionFrontEnd
    input_shape: (8, 10, 80)
    num_blocks: 2
    num_layers_per_block: 1
    out_channels: (128, 32)
    kernel_sizes: (3, 3)
    strides: (2, 2)
    residuals: (False, False)

Transformer: !new:speechbrain.lobes.models.transformer.TransformerASR.TransformerASR # yamllint disable-line rule:line-length
    input_size: 640
    tgt_vocab: !ref <output_neurons>
    d_model: !ref <d_model>
    nhead: !ref <nhead>
    num_encoder_layers: !ref <num_encoder_layers>
    num_decoder_layers: !ref <num_decoder_layers>
    d_ffn: !ref <d_ffn>
    dropout: !ref <transformer_dropout>
    activation: !ref <activation>
    encoder_module: branchformer
    attention_type: hypermixing
    normalize_before: True
    causal: False
    output_hidden_states: True
    

enc: !new:speechbrain.lobes.models.transformer.TransformerASR.EncoderWrapper
   transformer: !ref <Transformer>

# Quantizer: !new:.quantiser.RandomProjectionQuantizer
Quantizer: !new:speechbrain.nnet.quantisers.RandomProjectionQuantizer
    # projection
    input_dim: !ref <p_input>
    # codebook
    cb_dim: !ref <cb_dim>    
    cb_vocab: !ref <cb_vocab>

linear: !new:speechbrain.nnet.linear.Linear
    input_size: !ref <d_model>
    n_neurons: !ref <cb_vocab>

pt_model: !new:torch.nn.ModuleList
    - [!ref <CNN>, !ref <enc>, !ref <Quantizer>, !ref <linear>]

modules:
   normalize: !ref <normalize>
   pt_model: !ref <pt_model>
   
pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
   collect_in: !ref <save_folder>
   loadables:
      pt_model: !ref <pt_model>
      normalize: !ref <normalize>

   paths:
      pt_model: !ref <pt_model_hub>/model.ckpt
      normalize: !ref <pt_model_hub>/normalizer.ckpt
"""



class UpstreamExpert(nn.Module):
    def __init__(self, ckpt: str = None, model_config: str = None, **kwargs):
        """
        Args:
            ckpt:
                The checkpoint path for loading your pretrained weights.
                Can be assigned by the -k option in run_downstream.py

            model_config:
                The config path for constructing your model.
                Can be assigned by the -g option in run_downstream.py
        """
        super().__init__()
        loaded_hparams = load_hyperpyyaml(hyperparams)
        self.name = "[BestRQ]"
        
        self.model = loaded_hparams



    def get_downsample_rates(self, key: str) -> int:
        """
        Since we do not do any downsampling in this example upstream
        All keys' corresponding representations have downsample rate of 1
        """
        return 160*4
    
    def get_output_lengths(self, input_lengths):
        """ Function to get the output length of the feature extractor this is
            necessery to compute the masks of BestRQ.
        """
        sr = self.model["sample_rate"]
        hop_length = self.model["hop_length"]

        return (input_lengths // (sr*hop_length / 1000) + 1).to(torch.long)
    
    def forward(self, wavs: List[Tensor]) -> Dict[str, Union[Tensor, List[Tensor]]]:
        """
        When the returning Dict contains the List with more than one Tensor,
        those Tensors should be in the same shape to train a weighted-sum on them.
        """
        device = wavs[0].device
        length = torch.tensor([data.shape[0] for data in wavs], dtype=torch.int32, device=device)
        
        length = self.get_output_lengths(length)
        max_len = torch.max(length)
        length = length//max_len
        wavs = pad_sequence(wavs, batch_first=True)
        feats = self.model['compute_features'](wavs)
        feats = self.model['modules']['normalize'](feats, length, epoch=10)
        self.model['modules']['pt_model'][0] = self.model['modules']['pt_model'][0].to(device)
        src = self.model['modules']['pt_model'][0](feats)
        self.model['modules']['pt_model'][1] = self.model['modules']['pt_model'][1].to(device)
        enc_out, hidden_activations = self.model['modules']['pt_model'][1](src, length)
        return {
            "hidden_states": hidden_activations[1:],
        }

whettenr · 2024-07-11T06:30:07Z

@whettenr thank you for your contribution. I tried reproducing your results from this paper. I followed your pretraining implementation from here using the Branchformer configuration. The only differences in my setup were increasing the batch size from 100 to 1000 and to match your setup I reduced the warmup steps from 25000 to 2500. After pretraining, I froze the model and used all hidden representation with linear combination following the SUPURB CTC downstream task and trained a single-layer BiLSTM on 100 Libri, but I got PER 65 on test-clean, which is significantly worse than the WER reported in the paper. Do you have any ideas on what might be causing this issue? I also tried Hubert in same setup and the resulted PER is 7.

I'm not sure what exactly could be causing it, but there few key differences that maybe could lead to bad performance.

Changing from Conformer to Branchformer, I have not experimented thoroughly with branchformer, this might mean you need to adjust other hyperparameters to work with Branchformer.
I'm not sure if how well branchformer and hypermixing work well together. Maybe you could try the original branchformer. I have experimented with conformer and hypermixing and got pretty good results, but I'm not sure about hypermixing with branchformer
Changing warmup steps, I'm not sure if this could make a big difference, but it is a difference, slow warmup can help prevent getting stuck in a high local minimum
Batch size, in my opinion this probably has the biggest effect. The batch size, mask ratio, and learning rate are all very important. Remember that the total batch size is max number of seconds * number of GPUs, * grad accumulation. It looks like you increase the batch size a lot (you might need to increase the number of buckets a bit too if you haven't). With this is might be worth trying to change the mask ratio and or learning rate.

My advice:
If you are ok with not using branchformer with hypermixing and just want to reproduce results, I would try with the conformer (I have done conformer with a bigger batch size of around 1.7 hours and got better results than https://arxiv.org/pdf/2405.04296). If you have to use branchformer, I would try using the standard branchformer just to see if you can get good results with that combination of hyperparameters. If not I would experiment with the hyperparameters (mask size, learning rate, warmup, ...).

Hope this helps

AmirHussein96 · 2024-07-11T22:38:40Z

Thanks, @whettenr, for the quick response. I followed your recommendation and used the Conformer setup. However, I noticed that although you have defined the Noam scheduler in the Conformer config file, it is not used in your train.py file (https://github.com/whettenr/speechbrain/blob/sync-deletion/recipes/BEST-RQ/train.py). During training, the logs showed the same value for the learning rate throughout. Was this intentional or an oversight?

I assumed it might have been an oversight, so I modified train.py slightly to incorporate the Noam scheduler. Here is the change I made:

    def on_fit_batch_end(self, batch, outputs, loss, should_step):
        """ Called after fit_batch(), updates learning rate and does per-step logging. """
        if should_step:
            if isinstance(
                self.hparams.noam_annealing, sb.nnet.schedulers.NoamScheduler
            ) or isinstance(
                self.hparams.noam_annealing,
                sb.nnet.schedulers.CyclicCosineScheduler,
            ):
                self.hparams.noam_annealing(self.optimizer)

I also added TensorBoard to visualize training and validation losses:

The current setup is exactly the same as yours, except I am using 5 GPUs with a batch size of 1000 samples per GPU, buckets=100, and warmup=700, masking 10%. Let me know if the values I obtained for training and validation are close to the ranges of your values or if they are off. By the way, the accuracy on validation is around 20%.

Adel-Moumen · 2024-07-12T02:34:53Z

@whettenr thank you for your contribution. I tried reproducing your results from this paper. I followed your pretraining implementation from here using the Branchformer configuration. The only differences in my setup were increasing the batch size from 100 to 1000 and to match your setup I reduced the warmup steps from 25000 to 2500. After pretraining, I froze the model and used all hidden representation with linear combination following the SUPURB CTC downstream task and trained a single-layer BiLSTM on 100 Libri, but I got PER 65 on test-clean, which is significantly worse than the WER reported in the paper. Do you have any ideas on what might be causing this issue? I also tried Hubert in same setup and the resulted PER is 7.

if you want to reproduce the results, I believe @whettenr used the MP3S benchmark available here: https://github.com/speechbrain/benchmarks/tree/main/benchmarks/MP3S. As pointed out by Ryan, augmenting the batch size and reducing the warmup might affect negatively the dowstream results.

BTW, I just saw on your GitHub profile that you were at JHU. I am currently working there as part of the JSALT workshop so if you want we can have a one-to-one discussion about speechbrain and your issue :)

whettenr · 2024-07-12T07:14:16Z

Thanks, @whettenr, for the quick response. I followed your recommendation and used the Conformer setup. However, I noticed that although you have defined the Noam scheduler in the Conformer config file, it is not used in your train.py file (https://github.com/whettenr/speechbrain/blob/sync-deletion/recipes/BEST-RQ/train.py). During training, the logs showed the same value for the learning rate throughout. Was this intentional or an oversight?

I assumed it might have been an oversight, so I modified train.py slightly to incorporate the Noam scheduler. Here is the change I made:
    def on_fit_batch_end(self, batch, outputs, loss, should_step):
        """ Called after fit_batch(), updates learning rate and does per-step logging. """
        if should_step:
            if isinstance(
                self.hparams.noam_annealing, sb.nnet.schedulers.NoamScheduler
            ) or isinstance(
                self.hparams.noam_annealing,
                sb.nnet.schedulers.CyclicCosineScheduler,
            ):
                self.hparams.noam_annealing(self.optimizer)
I also added TensorBoard to visualize training and validation losses:

The current setup is exactly the same as yours, except I am using 5 GPUs with a batch size of 1000 samples per GPU, buckets=100, and warmup=700, masking 10%. Let me know if the values I obtained for training and validation are close to the ranges of your values or if they are off. By the way, the accuracy on validation is around 20%.

I believe a validation accuracy around 20% is pretty good! Have you tried that model on the DS tasks? and yes I did use the MP3S benchmark.

AmirHussein96 · 2024-07-12T18:46:09Z

@whettenr thank you for your contribution. I tried reproducing your results from this paper. I followed your pretraining implementation from here using the Branchformer configuration. The only differences in my setup were increasing the batch size from 100 to 1000 and to match your setup I reduced the warmup steps from 25000 to 2500. After pretraining, I froze the model and used all hidden representation with linear combination following the SUPURB CTC downstream task and trained a single-layer BiLSTM on 100 Libri, but I got PER 65 on test-clean, which is significantly worse than the WER reported in the paper. Do you have any ideas on what might be causing this issue? I also tried Hubert in same setup and the resulted PER is 7.

if you want to reproduce the results, I believe @whettenr used the MP3S benchmark available here: https://github.com/speechbrain/benchmarks/tree/main/benchmarks/MP3S. As pointed out by Ryan, augmenting the batch size and reducing the warmup might affect negatively the dowstream results.

BTW, I just saw on your GitHub profile that you were at JHU. I am currently working there as part of the JSALT workshop so if you want we can have a one-to-one discussion about speechbrain and your issue :)

Thank you, @Adel-Moumen! I will definitely try the MP3S benchmark.

It's such a coincidence that you're at JSALT. I hope you're enjoying your time there. Unfortunately, I'm currently in Boston, but I look forward to meeting you in person someday.

AmirHussein96 · 2024-07-12T18:54:57Z

I believe a validation accuracy around 20% is pretty good! Have you tried that model on the DS tasks? and yes I did use the MP3S benchmark.

@whettenr I saw your logs here https://github.com/whettenr/bestrq/blob/main/results/best_hyperconformer/2000/log.txt and the numbers look close to mine. Do you mind sharing the train_brq.py and ssl_brq_hyperconformer.yaml from https://github.com/whettenr/bestrq/blob/main/run/downstream/run_base_hyperconf_v100.sh?

…ync-deletion

TParcollet · 2024-09-02T12:53:29Z

@asumagic I took care of cleaning, fixing, documenting the code and even retraining the models. Could you please have a look very briefly at the code to see if anything seems crazy to you? I also added the return type documentation due to this new hidden_states return.

The models are missing from the readme, but we can do another PR for that imho.

If @asumagic is happy, i'll merge.

asumagic

LGTM, nice! I've only made a superficial review, though. With those points fixed, if the recipe tests pass, it's ok to me.

recipes/LibriSpeech/ASR/CTC/hparams/train_sb_BEST-RQ.yaml

speechbrain/lobes/models/transformer/Transformer.py

speechbrain/lobes/models/transformer/TransformerASR.py

recipes/LibriSpeech/self-supervised-learning/BEST-RQ/train.py

xxchauncey · 2024-09-03T02:41:15Z

@whettenr thank you for your contribution. I tried reproducing your results from this paper. I followed your pretraining implementation from here using the Branchformer configuration. The only differences in my setup were increasing the batch size from 100 to 1000 and to match your setup I reduced the warmup steps from 25000 to 2500. After pretraining, I froze the model and used all hidden representation with linear combination following the SUPURB CTC downstream task and trained a single-layer BiLSTM on 100 Libri, but I got PER 65 on test-clean, which is significantly worse than the WER reported in the paper. Do you have any ideas on what might be causing this issue? I also tried Hubert in same setup and the resulted PER is 7.

hi @AmirHussein96 ,

Have you tried finetuning only with a linear and CTC loss like HuBERT, does it work?

TParcollet · 2024-09-03T08:19:05Z

@xxchauncey it does work, you have a recipe to do that in this PR. It will get merged soon. We will train the models and get the results soon-ish as well.

speechbrain/lobes/models/transformer/Transformer.py

TParcollet · 2024-09-03T10:23:15Z

@asumagic I added a few docstring test for the hidden_state thing, but i guess you are right, we don't even have a single unit test for the transformerASR models outside of docstrings tests... maybe we should have a PR to fix that.

karynaur · 2024-09-18T10:39:08Z

yayyy

whettenr and others added 14 commits November 20, 2023 08:44

added weighted encoder wrapper

4306b82

Sync before and after deleting to prevent another process writing at …

97c8abe

…the same time

add layer dropout option in conformer

fdd7ede

Remove unneeded if_main_process that conflicts with barrier in delete…

717953b

… checkpoint

Add env variable to signify single-threaded execution

3bc59fe

add output_hidden_states and layerdrop in branchformer

79e6a25

Fix wrong call to run_on_main in timit recipe

ff564b8

Fix bug in run_on_main logic

707193f

add output_hidden_states to TransformerEncoder

7c98ba9

add wrapper for using the hidden units

15af370

Merge branch 'weight-encoder-wrapper-develop' into sync-deletion

15c34f1

fix bug in main process function

b545d91

add wrappers

a462f4a

add in bestrq recipies

f084239

whettenr changed the title ~~Sync deletion~~ BEST-RQ implementation Dec 19, 2023

whettenr added 4 commits July 5, 2024 09:27

add citation and update recipes

3fb1ec7

update distriibuted

e4fce88

Merge branch 'develop' of https://github.com/speechbrain/speechbrain …

4de5d37

…into sync-deletion

update train.py

ff2f406

Titouan Parcollet/Embedded AI /SRUK/Engineer/Samsung Electronics added 3 commits September 2, 2024 11:31

Merge remote-tracking branch 'refs/remotes/ryan/sync-deletion' into s…

08e6b04

…ync-deletion

fix recipe testing

17fd25a

Merge branch 'develop' into sync-deletion

0e543b5

TParcollet added ready to review Waiting on reviewer to provide feedback enhancement New feature or request recipes Changes to recipes only (add/edit) labels Sep 2, 2024

Titouan Parcollet/Embedded AI /SRUK/Engineer/Samsung Electronics added 6 commits September 2, 2024 12:29

add docstring for masking

965d907

quantiser docstring

59e4ce1

fix CI

a930cb9

remove useless parameters

3812e01

wtf happened

4858403

add documentation for return type

2b9c3f3

Titouan Parcollet/Embedded AI /SRUK/Engineer/Samsung Electronics added 3 commits September 2, 2024 14:05

Bruh, what?

24b3ca1

wFix the stoopid error

b47c668

bruh has broken brain, he stoooopid

b7822c1

asumagic self-assigned this Sep 2, 2024

asumagic added this to the v1.1.0 milestone Sep 2, 2024

asumagic requested changes Sep 2, 2024

View reviewed changes

address Asu's comments

9004998

asumagic reviewed Sep 3, 2024

View reviewed changes

speechbrain/lobes/models/transformer/Transformer.py Show resolved Hide resolved

address Asu's comments

a2e53f4

asumagic reviewed Sep 3, 2024

View reviewed changes

speechbrain/lobes/models/transformer/Transformer.py Show resolved Hide resolved

asumagic approved these changes Sep 3, 2024

View reviewed changes

add docstring for hidden_states

2dbd441

TParcollet merged commit c2a5b7d into speechbrain:develop Sep 3, 2024

BEST-RQ implementation #2309

BEST-RQ implementation #2309

Uh oh!

Conversation

whettenr commented Dec 19, 2023

What does this PR do?

PR review

Uh oh!

Adel-Moumen commented Jan 16, 2024

Uh oh!

whettenr commented Jan 17, 2024

Uh oh!

TParcollet commented Jul 9, 2024

Uh oh!

AmirHussein96 commented Jul 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AmirHussein96 commented Jul 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

whettenr commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AmirHussein96 commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Adel-Moumen commented Jul 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

whettenr commented Jul 12, 2024

Uh oh!

AmirHussein96 commented Jul 12, 2024

Uh oh!

AmirHussein96 commented Jul 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TParcollet commented Sep 2, 2024

Uh oh!

asumagic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xxchauncey commented Sep 3, 2024

Uh oh!

TParcollet commented Sep 3, 2024

Uh oh!

Uh oh!

Uh oh!

TParcollet commented Sep 3, 2024

Uh oh!

karynaur commented Sep 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

AmirHussein96 commented Jul 10, 2024 •

edited

Loading

AmirHussein96 commented Jul 10, 2024 •

edited

Loading

whettenr commented Jul 11, 2024 •

edited

Loading

AmirHussein96 commented Jul 11, 2024 •

edited

Loading

Adel-Moumen commented Jul 12, 2024 •

edited

Loading

AmirHussein96 commented Jul 12, 2024 •

edited

Loading