Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
e983409
added downsampling code
salah-zaiem Mar 16, 2023
c5b7fe4
corrected macs bugs
salah-zaiem Mar 16, 2023
309e503
changed flake8 errors
salah-zaiem Mar 16, 2023
3309910
fixed too big spaces
salah-zaiem Mar 16, 2023
a3b5e99
added recipe in teste
salah-zaiem Mar 17, 2023
6c1165e
put in the CTC code
salah-zaiem Mar 23, 2023
696f164
added downsampling recipes in test
salah-zaiem Mar 23, 2023
1050519
reformatted downsampler file
salah-zaiem Mar 23, 2023
5cc02d3
fixed links to readme.md
salah-zaiem Mar 23, 2023
5c187f3
removed bad drive link
salah-zaiem Mar 23, 2023
7c4c27e
added docstrings
salah-zaiem Mar 23, 2023
62c47b4
fixed path in recieps
salah-zaiem Mar 23, 2023
d4fd29f
fixed README
salah-zaiem Mar 23, 2023
11bfec6
added docstring to downsampler wrapper
salah-zaiem Mar 23, 2023
f3a5f97
docstring to forward function
salah-zaiem Mar 23, 2023
51e6bc4
fixed expected shapes
salah-zaiem Mar 23, 2023
862b652
black fix on downsampling.py
salah-zaiem Mar 23, 2023
eafc7da
removed trailing whitespaces from readme
salah-zaiem Mar 23, 2023
72ddb5b
fixed white space in yaml
salah-zaiem Mar 23, 2023
8fc2539
removed white line
salah-zaiem Mar 23, 2023
a8d41cc
added recipes and check language modelling
salah-zaiem Mar 23, 2023
5d92fd6
Update extra_requirements.txt
Mar 24, 2023
1ae7e99
quick tests
TParcollet Mar 24, 2023
605a803
update
TParcollet Mar 24, 2023
bb46fb0
update readme
TParcollet Mar 24, 2023
0d0e8cb
update readme
TParcollet Mar 24, 2023
d00803e
Merge branch 'develop' of https://github.com/speechbrain/speechbrain …
TParcollet Mar 24, 2023
2f33f0c
fix mixed precision
TParcollet Mar 24, 2023
b021a2f
fix yaml
TParcollet Mar 24, 2023
4b7c7c4
fixing import
TParcollet Mar 24, 2023
1947061
update extra requirement
TParcollet Mar 24, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 51 additions & 4 deletions recipes/LibriSpeech/ASR/CTC/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,24 @@ You can download LibriSpeech at http://www.openslr.org/12.

**Supported pre-trained wav2vec2:** [SpeechBrain](https://github.com/speechbrain/speechbrain/tree/develop/recipes/LibriSpeech/self-supervised-learning/wav2vec2) and [HuggingFace](https://github.com/speechbrain/speechbrain/tree/develop/recipes/CommonVoice/self-supervised-learning/wav2vec2)

**If using a HuggingFace pre-trained model, please make sure you have "transformers"
installed in your environment (see extra-requirements.txt)**

# How to run
```
python train_with_wav2vec.py hparams/file.yaml

```
```
python train_with_whisper.py hparams/file.yaml
```
To run a fine-tuning of "WavLM" with signal downsampled inputs (for faster training and inferences)

**If using a HuggingFace pre-trained model, please make sure you have "transformers"
installed in your environment (see extra-requirements.txt)**
```
python train_with_wav2vec.py hparams/downsampled/train_hf_wavlm_signal_downsampling.yaml --downsampling_factor 2
```

# KenLM n-gram CTC rescoring
To enable n-gram rescoring during the decoding, you can download the LibriSpeech official LM from [here](https://www.openslr.org/11/). Please make sure to install the extra dependencies first. Any KenLM language model may be used with this rescoring technique. Results are reported without rescoring.

# Results

Expand All @@ -20,12 +31,36 @@ installed in your environment (see extra-requirements.txt)**
| 22-09-22 | train_sb_wav2vec.yaml | 960h | 4.2 | Not Avail. | Not Avail. | 2xTesla V100 32GB |
| 06-12-23 | train_hf_whisper.yaml (small) | 960h | 4.89 | Not Avail. | Not Avail. | 4xRTX 2080 Ti |

# Downsampling inputs for faster fine-tuning and inferences using SSL Models
This repository contains the code allowing to reproduce part of the results obtained in the paper : "Fine-tuning Strategies for Faster Inference using Speech Self-Supervised Models: A Comparative Study"
The reported experiments are the ones leading to largest inference time reductions while keeping lower error rates, using a downsampling of the input sequences. You can download LibriSpeech at http://www.openslr.org/12.

### Downsampling Results with Librispeech train-clean-100 split
The inference times shown here are for running the whole test-clean LibriSpeech split, and are in seconds. MACs shown here are the mean MACs for a test batch
These results are obtained using WavLM Large finetuned only on the train-clean-100 split of LibriSpeech (100 hours of speech)

| Name | Factor | WER | GPU- Inference Time | CPU - Inference Time | WER-LM | GPULM - Inference Time | CPULM - Inference Time | MACs (G) |
|-------|--------|-------|---------------------|----------------------|--------|------------------------|------------------------|----------|
| No SD | 1 | 4.09 | 134 | 1121 | 3.31 | 152 | 1128 | 386.538 |
| CL2 | 2 | 4.61 | 84 | 582 | 3.48 | 98 | 600 | 192.97 |
| CL3 | 3 | 5.47 | 69 | 414 | 4.12 | 91 | 436 | 134.864 |
| AV2 | 2 | 4.93 | 80 | 570 | 3.66 | 98 | 578 | 192.97 |
| AV3 | 3 | 6.01 | 64 | 406 | 4.27 | 90 | 422 | 134.864 |
| SD2 | 2 | 4.85 | 86 | 569 | 3.58 | 97 | 575 | 192.97 |
| SD3 | 3 | 5.83 | 72 | 427 | 4.08 | 89 | 458 | 134.864 |

CL: Learned convolutional downsampling

SD : Signal downsampling

AV : Averaging window

# **About SpeechBrain**
- Website: https://speechbrain.github.io/
- Code: https://github.com/speechbrain/speechbrain/
- HuggingFace: https://huggingface.co/speechbrain/

# **Citing SpeechBrain**
# **Citing**
Please, cite SpeechBrain if you use it for your research or business.

```bibtex
Expand All @@ -39,3 +74,15 @@ Please, cite SpeechBrain if you use it for your research or business.
note={arXiv:2106.04624}
}
```
If you use the downsampling approach, please cite :

```bibtex
@article{zaiem2023fine,
title={Fine-tuning Strategies for Faster Inference using Speech Self-Supervised Models: A Comparative Study},
author={Zaiem, Salah and Algayres, Robin and Parcollet, Titouan and Essid, Slim and Ravanelli, Mirco},
journal={arXiv preprint arXiv:2303.06740},
year={2023}
}
```


3 changes: 2 additions & 1 deletion recipes/LibriSpeech/ASR/CTC/extra_requirements.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
# For wav2vect recipe (HuggingFace)
kenlm
pyctcdecode
transformers
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
# ################################
# Model: downsampling + wavlm + DNN + CTC
# Augmentation: SpecAugment
# Authors: Sung-Lin Yeh 2021
# Salah Zaiem 2023
# ################################

# Seed needs to be set at top of yaml, before objects with parameters are made
seed: 1986
__set_seed: !apply:torch.manual_seed [!ref <seed>]
output_folder: !ref results/train_wav2vec2_char/<seed>
wer_file: !ref <output_folder>/wer.txt
save_folder: !ref <output_folder>/save
train_log: !ref <output_folder>/train_log.txt

# URL for the biggest Fairseq english wav2vec2 model.
wav2vec2_hub: microsoft/wavlm-large
wav2vec2_folder: !ref <save_folder>/wav2vec2_checkpoint

# Data files
data_folder: !PLACEHOLDER # e,g./path/to/LibriSpeech
# noise/ris dataset will automatically be downloaded
# data_folder_rirs: !ref <data_folder>
train_splits: ["train-clean-100", "train-clean-360", "train-other-500"]
dev_splits: ["dev-clean"]
test_splits: ["test-clean", "test-other"]
skip_prep: False
ckpt_interval_minutes: 25 # save checkpoint every N min
train_csv: !ref <output_folder>/train.csv
valid_csv: !ref <output_folder>/dev-clean.csv
test_csv:
- !ref <output_folder>/test-clean.csv
- !ref <output_folder>/test-other.csv

# Training parameters
number_of_epochs: 1
lr: 0.9
lr_wav2vec: 0.0001
sorting: ascending
auto_mix_prec: False
sample_rate: 16000

#Downsampling parameters
downsampling_factor: 2
downsampling_kernel_size: 21
upsampling: False
use_language_modelling: True
ngram_lm_path: !PLACEHOLDER

# With data_parallel batch_size is split into N jobs
# With DDP batch_size is multiplied by N jobs
# Must be 3 per GPU to fit 32GB of VRAM
batch_size: 6
test_batch_size: 8

# Dataloader options
train_dataloader_opts:
batch_size: !ref <batch_size>

valid_dataloader_opts:
batch_size: !ref <batch_size>

test_dataloader_opts:
batch_size: !ref <test_batch_size>

# Model parameters
activation: !name:torch.nn.LeakyReLU
dnn_layers: 2
dnn_neurons: 1024
freeze_wav2vec: True

# Outputs
ctc_neurons: 29
output_neurons: 29 # Characters size, index(blank/eos/bos) = 0

# Decoding parameters
blank_index: 0

#
# Functions and classes
#
epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
limit: !ref <number_of_epochs>

augmentation: !new:speechbrain.lobes.augment.TimeDomainSpecAugment
sample_rate: !ref <sample_rate>
speeds: [95, 100, 105]

enc: !new:speechbrain.lobes.models.VanillaNN.VanillaNN
input_shape: [null, null, 1024]
activation: !ref <activation>
dnn_blocks: !ref <dnn_layers>
dnn_neurons: !ref <dnn_neurons>

wav2vec2: !new:speechbrain.lobes.models.huggingface_wav2vec.HuggingFaceWav2Vec2
source: !ref <wav2vec2_hub>
output_norm: True
freeze_feature_extractor: True
freeze: !ref <freeze_wav2vec>
save_path: !ref <wav2vec2_folder>

downsampler: !new:speechbrain.lobes.downsampling.PoolingDownsampler
downsampling_factor: !ref <downsampling_factor>
kernel_size: !ref <downsampling_kernel_size>

#####
# Uncomment this block if you prefer to use a Fairseq pretrained model instead
# of a HuggingFace one. Here, we provide an URL that is obtained from the
# Fairseq github for the multilingual XLSR.
#
#wav2vec2_url: https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_vox_960h_pl.pt
#wav2vec2: !new:speechbrain.lobes.models.fairseq_wav2vec.FairseqWav2Vec2
# pretrained_path: !ref <wav2vec2_url>
# output_norm: True
# freeze: False
# save_path: !ref <save_folder>/wav2vec2_checkpoint/model.pt

ctc_lin: !new:speechbrain.nnet.linear.Linear
input_size: !ref <dnn_neurons>
n_neurons: !ref <ctc_neurons>

log_softmax: !new:speechbrain.nnet.activations.Softmax
apply_log: True

ctc_cost: !name:speechbrain.nnet.losses.ctc_loss
blank_index: !ref <blank_index>

modules:
wav2vec2: !ref <wav2vec2>
enc: !ref <enc>
ctc_lin: !ref <ctc_lin>
downsampler: !ref <downsampler>

model: !new:torch.nn.ModuleList
- [!ref <enc>, !ref <ctc_lin>, !ref <downsampler>]

model_opt_class: !name:torch.optim.Adadelta
lr: !ref <lr>
rho: 0.95
eps: 1.e-8

wav2vec_opt_class: !name:torch.optim.Adam
lr: !ref <lr_wav2vec>

lr_annealing_model: !new:speechbrain.nnet.schedulers.NewBobScheduler
initial_value: !ref <lr>
improvement_threshold: 0.0025
annealing_factor: 0.8
patient: 0

lr_annealing_wav2vec: !new:speechbrain.nnet.schedulers.NewBobScheduler
initial_value: !ref <lr_wav2vec>
improvement_threshold: 0.0025
annealing_factor: 0.9
patient: 0

label_encoder: !new:speechbrain.dataio.encoder.CTCTextEncoder

checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
checkpoints_dir: !ref <save_folder>
recoverables:
wav2vec2: !ref <wav2vec2>
model: !ref <model>
scheduler_model: !ref <lr_annealing_model>
scheduler_wav2vec: !ref <lr_annealing_wav2vec>
counter: !ref <epoch_counter>
tokenizer: !ref <label_encoder>

train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
save_file: !ref <train_log>

error_rate_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats

cer_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats
split_tokens: True
Loading