Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
3ef35ae
add recepie for whisper finetining on common-voice data
poonehmousavi Dec 7, 2022
7125dc6
add encoder-freeze optionto hparams +add extra dependecies
poonehmousavi Dec 7, 2022
536de68
minor bug
poonehmousavi Dec 12, 2022
e6faeaf
set accented_letter to True for arabic and french
poonehmousavi Dec 13, 2022
7f921bc
minor fix
poonehmousavi Dec 13, 2022
ec1ecec
fix reading audio bug
poonehmousavi Dec 13, 2022
2ea8dc1
remove extra files
poonehmousavi Dec 13, 2022
581011a
fix loss
poonehmousavi Dec 13, 2022
6a73603
fix loss in ar and fr hparams
poonehmousavi Dec 13, 2022
6fd540a
add enviroment
poonehmousavi Dec 14, 2022
7ebf8da
change test to greedu search instead of beam-serach to solve memory …
poonehmousavi Jan 6, 2023
f9e5a9b
add hparms for mnongolian, spanish, hindi, serbian, german
poonehmousavi Jan 7, 2023
f3158c8
fix
poonehmousavi Jan 7, 2023
b67c8e3
fix memory issue+ add ja and fa
poonehmousavi Jan 8, 2023
5a3b864
add whisper-encoder_only for common_voice, fix minor bugs
poonehmousavi Jan 9, 2023
ab32fdd
fix bug for es
poonehmousavi Jan 9, 2023
2b5e79f
add weighted sum version
poonehmousavi Jan 11, 2023
cb90d19
update readme file
poonehmousavi Jan 11, 2023
27edc2c
modify en hparams -set accented letter to False
poonehmousavi Jan 13, 2023
7253544
add test_only option
poonehmousavi Jan 16, 2023
6af3bdd
add final result table- final cleanig
poonehmousavi Jan 20, 2023
e917485
minor chage
poonehmousavi Jan 20, 2023
523bd2e
minor change
poonehmousavi Jan 20, 2023
add221d
fix type
poonehmousavi Jan 20, 2023
8e64623
fix requested change in review
poonehmousavi Jan 25, 2023
d92896a
remove enviroment file
poonehmousavi Jan 25, 2023
0ca9ea6
fix flag checking for test_olnly
poonehmousavi Jan 25, 2023
3f16d23
bug fix for ignoring padded tokens for loss calculation
poonehmousavi Jan 28, 2023
a4b2390
loss func
poonehmousavi Jan 28, 2023
fcb7617
add comments
poonehmousavi Jan 29, 2023
51b19c5
final refactoring
poonehmousavi Feb 5, 2023
d1d3042
final refactoring
poonehmousavi Feb 5, 2023
9b0a4da
minor refactoring(removing blank line,..)
poonehmousavi Feb 12, 2023
a6ba1b8
remove blank lines
poonehmousavi Feb 12, 2023
f71c2b5
minor refactor
poonehmousavi Feb 12, 2023
73e22c3
apply pre-commit changes
poonehmousavi Feb 12, 2023
2f4d83d
fix precommit bugs
poonehmousavi Feb 12, 2023
b7815e4
Merge branch 'speechbrain:develop' into whisper-finetunng-common-voice
poonehmousavi Feb 12, 2023
46a41b6
add test, fix pre-commits error
poonehmousavi Feb 13, 2023
4503074
fix CL test erros and precommit error for complicated method
poonehmousavi Feb 14, 2023
0622477
fix link issue For CL workflow
poonehmousavi Feb 14, 2023
bdca54f
test
poonehmousavi Feb 15, 2023
a136f63
fix cl symlink bug
poonehmousavi Feb 17, 2023
67008c7
Merge branch 'whisper-finetunng-common-voice' of https://github.com/p…
poonehmousavi Feb 17, 2023
b997e6f
Merge branch 'speechbrain:develop' into whisper-finetunng-common-voice
poonehmousavi Feb 17, 2023
2034e26
remove whitespace
poonehmousavi Feb 17, 2023
fe0af9b
fix for CL
poonehmousavi Feb 17, 2023
43ed4d1
remove doc_str example for whisper interface
poonehmousavi Feb 17, 2023
a80c979
fi readme file problem
poonehmousavi Feb 17, 2023
afbd5af
remove HF link from readme file
poonehmousavi Feb 17, 2023
4198fff
remove datasets from dpendencies
poonehmousavi Feb 17, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions recipes/CommonVoice/ASR/transformer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,12 @@ This folder contains scripts necessary to run an ASR experiment with the CommonV
# How to run
python train.py hparams/{hparam_file}.py

## For Whisper finetuning:

python train_with_whisper.py hparams/train_<locale>_hf_whisper.yaml e.g. train_<locale>_hf_whisper

Note: When using whisper large model, to improve memory usage during model recovery. You could use (see https://github.com/speechbrain/speechbrain/pull/1743)

# Data preparation
It is important to note that CommonVoice initially offers mp3 audio files at 42Hz. Hence, audio files are downsampled on the fly within the dataio function of the training script.

Expand All @@ -12,12 +18,31 @@ Here is a list of the different languages that we tested within the CommonVoice
with our transformers:
- French

For Whisper-large-v2 finetuning, here is list of the different language that we tested within the CommonVoice.10_0 dataset:
- Hindi
- Arabic
- Persian
- Serbian
- Mongolian
- French


# Results

| Language | Release | hyperparams file | LM | Val. CER | Val. WER | Test CER | Test WER | Model link | GPUs |
| ------------- |:-------------:|:---------------------------:| -----:| -----:| -----:| -----:| -----:| :-----------:| :-----------:|
| French | 2020-06-22 | train_fr.yaml | No | 5.15 | 17.80 | 6.01 | 19.21 | [model](https://drive.google.com/drive/folders/12ny6daoz1Ze1MmgLrsqf352AXvhwob6d?usp=sharing) | 1xV100 16GB |

## Whisper Finetuning Result:
Following table contains whisper-finetuning results for 1 epoch using whisper_large_v2 model, freezing encoder and finetuning decoder.
| Language | Release | hyperparams file | LM | Val. CER | Val. WER | Test CER | Test WER | Model link | GPUs |
| ------------- |:-------------:|:---------------------------:| -----:| -----:| -----:| -----:| -----:| :-----------:| :-----------:|
| Arabic | 2023-01-10 | train_ar_hf_whisper.yaml | No | 4.02 | 12.47 | 5.20 | 16.96 | [model](https://drive.google.com/drive/folders/10mYPYfj9NpDNAa0nO16Zd_K1bIEUOIpx?usp=sharing) | 1xV100 16GB |
| Persian | 2023-01-10 | train_fa_hf_whisper.yaml | No | 6.91 | 25.30 | 9.38 | 31.75 | [model](https://drive.google.com/drive/folders/1nzMMYmB5SxMKsFUk-rM9_ijcqzia8pX7?usp=sharing) | 1xV100 16GB |
| Mongolian | 2023-01-10 | train_mn_hf_whisper.yaml | No | 24.05 | 62.37 | 25.73 | 64.92 | [model](https://drive.google.com/drive/folders/10E2xclgNx_6BFxNmv9i1HorBNnsMveP_?usp=sharing) | 1xV100 16GB |
| Hindi | 2023-01-10 | train_hi_hf_whisper.yaml | No | 4.54 | 10.46 | 7.00 | 15.27 | [model](https://drive.google.com/drive/folders/11PKCsyIE703mmDv6n6n_UnD0bUgMPbg_?usp=sharing) | 1xV100 16GB |
| Serbian | 2023-01-10 | train_sr_hf_whisper.yaml | No | 8.92 | 27.12 | 7.60 | 23.63 | [model](https://drive.google.com/drive/folders/1QG67qoekEB29jBd9knt8stLJD4T_xgG7?usp=sharing) | 1xV100 16GB |
| French | 2023-01-10 | train_fr_hf_whisper.yaml | No | 3.00 | 8.95 | 3.83 | 10.62 | [model](https://drive.google.com/drive/folders/1_iI_G-pMYNeyLsvmHPgNR6gPi8zazkF4?usp=sharing) | 1xV100 16GB |

The output folders with checkpoints and logs can be found [here](https://drive.google.com/drive/folders/11NMzY0zV-NqJmPMyZfC3RtT64bYe-G_O?usp=sharing).

Expand Down
1 change: 1 addition & 0 deletions recipes/CommonVoice/ASR/transformer/extra_requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
transformers
142 changes: 142 additions & 0 deletions recipes/CommonVoice/ASR/transformer/hparams/train_ar_hf_whisper.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# ################################
# Model: Whisper (Encoder-Decoder) + NLL
# Augmentation: TimeDomainSpecAugment
# Authors: Pooneh Mousavi 2022
# ################################

# Seed needs to be set at top of yaml, before objects with parameters are made
seed: 1986
__set_seed: !apply:torch.manual_seed [!ref <seed>]
output_folder: !ref results/train_whisper/<seed>/<locale>
wer_file: !ref <output_folder>/wer.txt
save_folder: !ref <output_folder>/save
train_log: !ref <output_folder>/train_log.txt

# URL for the biggest Fairseq english whisper model.
whisper_hub: openai/whisper-tiny
test_only: False # Set it to True if you only want to do the evaluation

# Normalize inputs with the same normalization done in the paper (https://cdn.openai.com/papers/whisper.pdf). Refer to Appendix C for further information.
normalized_transcripts: True

# Data files
locale: ar # use 'it' for italian, 'fr' for french, 'en' for english , It is a language for common-voice data.
data_folder: !PLACEHOLDER
train_tsv_file: !ref <data_folder>/train.tsv # Standard CommonVoice .tsv files
dev_tsv_file: !ref <data_folder>/dev.tsv # Standard CommonVoice .tsv files
test_tsv_file: !ref <data_folder>/test.tsv # Standard CommonVoice .tsv files
accented_letters: True
train_csv: !ref <save_folder>/train.csv
valid_csv: !ref <save_folder>/dev.csv
test_csv: !ref <save_folder>/test.csv
skip_prep: False # Skip data preparation

# We remove utterance slonger than 10s in the train/dev/test sets as
# longer sentences certainly correspond to "open microphones".
avoid_if_longer_than: 10.0

ckpt_interval_minutes: 30 # save checkpoint every N min

# Training parameters
number_of_epochs: 1
lr_whisper: 0.00003
sorting: ascending
auto_mix_prec: False
sample_rate: 16000

# With data_parallel batch_size is split into N jobs
# With DDP batch_size is multiplied by N jobs
batch_size: 12
test_batch_size: 8

# These values are only used for the searchers.
# They needs to be hardcoded and should not be changed with Whisper.
# They are used as part of the searching process.
# The bos token of the searcher will be timestamp_index
# and will be concatenated with the bos, language and task tokens.
timestamp_index: 50363
eos_index: 50257
bos_index: 50258

# Decoding parameters
min_decode_ratio: 0.0
max_decode_ratio: 0.1
test_beam_size: 8

# Model parameters
freeze_whisper: False
freeze_encoder: True

train_loader_kwargs:
batch_size: !ref <batch_size>

valid_loader_kwargs:
batch_size: !ref <batch_size>

test_loader_kwargs:
batch_size: !ref <test_batch_size>

#
# Functions and classes
#
epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
limit: !ref <number_of_epochs>

augmentation: !new:speechbrain.lobes.augment.TimeDomainSpecAugment
sample_rate: !ref <sample_rate>
speeds: [95, 100, 105]

whisper: !new:speechbrain.lobes.models.huggingface_whisper.HuggingFaceWhisper
source: !ref <whisper_hub>
freeze: !ref <freeze_whisper>
freeze_encoder: !ref <freeze_encoder>
save_path: !ref <save_folder>/whisper_checkpoint
encoder_only: False

log_softmax: !new:speechbrain.nnet.activations.Softmax
apply_log: True

nll_loss: !name:speechbrain.nnet.losses.nll_loss

modules:
whisper: !ref <whisper>

whisper_opt_class: !name:torch.optim.AdamW
lr: !ref <lr_whisper>
weight_decay: 0.000000001

valid_greedy_searcher: !new:speechbrain.decoders.seq2seq.S2SWhisperGreedySearch
model: !ref <whisper>
bos_index: !ref <timestamp_index>
eos_index: !ref <eos_index>
min_decode_ratio: !ref <min_decode_ratio>
max_decode_ratio: !ref <max_decode_ratio>

test_beam_searcher: !new:speechbrain.decoders.seq2seq.S2SWhisperBeamSearch
module: [!ref <whisper>]
bos_index: !ref <timestamp_index>
eos_index: !ref <eos_index>
min_decode_ratio: !ref <min_decode_ratio>
max_decode_ratio: !ref <max_decode_ratio>
beam_size: !ref <test_beam_size>

lr_annealing_whisper: !new:speechbrain.nnet.schedulers.NewBobScheduler
initial_value: !ref <lr_whisper>
improvement_threshold: 0.0025
annealing_factor: 0.9
patient: 0

checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
checkpoints_dir: !ref <save_folder>
recoverables:
whisper: !ref <whisper>
scheduler_whisper: !ref <lr_annealing_whisper>
counter: !ref <epoch_counter>

train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
save_file: !ref <train_log>

error_rate_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats

cer_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats
split_tokens: True
142 changes: 142 additions & 0 deletions recipes/CommonVoice/ASR/transformer/hparams/train_fa_hf_whisper.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# ################################
# Model: Whisper (Encoder-Decoder) + NLL
# Augmentation: TimeDomainSpecAugment
# Authors: Pooneh Mousavi 2022
# ################################

# Seed needs to be set at top of yaml, before objects with parameters are made
seed: 1986
__set_seed: !apply:torch.manual_seed [!ref <seed>]
output_folder: !ref results/train_whisper/<seed>/<locale>
wer_file: !ref <output_folder>/wer.txt
save_folder: !ref <output_folder>/save
train_log: !ref <output_folder>/train_log.txt

# URL for the biggest Fairseq english whisper model.
whisper_hub: openai/whisper-tiny
test_only: False # Set it to True if you only want to do the evaluation

# Normalize inputs with the same normalization done in the paper (https://cdn.openai.com/papers/whisper.pdf). Refer to Appendix C for further information.
normalized_transcripts: True

# Data files
locale: fa # use 'it' for italian, 'fr' for french, 'en' for english , It is a language for common-voice data.
data_folder: !PLACEHOLDER
train_tsv_file: !ref <data_folder>/train.tsv # Standard CommonVoice .tsv files
dev_tsv_file: !ref <data_folder>/dev.tsv # Standard CommonVoice .tsv files
test_tsv_file: !ref <data_folder>/test.tsv # Standard CommonVoice .tsv files
accented_letters: True
train_csv: !ref <save_folder>/train.csv
valid_csv: !ref <save_folder>/dev.csv
test_csv: !ref <save_folder>/test.csv
skip_prep: False # Skip data preparation

# We remove utterance slonger than 10s in the train/dev/test sets as
# longer sentences certainly correspond to "open microphones".
avoid_if_longer_than: 10.0

ckpt_interval_minutes: 30 # save checkpoint every N min

# Training parameters
number_of_epochs: 1
lr_whisper: 0.00003
sorting: ascending
auto_mix_prec: False
sample_rate: 16000

# With data_parallel batch_size is split into N jobs
# With DDP batch_size is multiplied by N jobs
batch_size: 12
test_batch_size: 8

# These values are only used for the searchers.
# They needs to be hardcoded and should not be changed with Whisper.
# They are used as part of the searching process.
# The bos token of the searcher will be timestamp_index
# and will be concatenated with the bos, language and task tokens.
timestamp_index: 50363
eos_index: 50257
bos_index: 50258

# Decoding parameters
min_decode_ratio: 0.0
max_decode_ratio: 0.1
test_beam_size: 8

# Model parameters
freeze_whisper: False
freeze_encoder: True

train_loader_kwargs:
batch_size: !ref <batch_size>

valid_loader_kwargs:
batch_size: !ref <batch_size>

test_loader_kwargs:
batch_size: !ref <test_batch_size>

#
# Functions and classes
#
epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
limit: !ref <number_of_epochs>

augmentation: !new:speechbrain.lobes.augment.TimeDomainSpecAugment
sample_rate: !ref <sample_rate>
speeds: [95, 100, 105]

whisper: !new:speechbrain.lobes.models.huggingface_whisper.HuggingFaceWhisper
source: !ref <whisper_hub>
freeze: !ref <freeze_whisper>
freeze_encoder: !ref <freeze_encoder>
save_path: !ref <save_folder>/whisper_checkpoint
encoder_only: False

log_softmax: !new:speechbrain.nnet.activations.Softmax
apply_log: True

nll_loss: !name:speechbrain.nnet.losses.nll_loss

modules:
whisper: !ref <whisper>

whisper_opt_class: !name:torch.optim.AdamW
lr: !ref <lr_whisper>
weight_decay: 0.000000001

valid_greedy_searcher: !new:speechbrain.decoders.seq2seq.S2SWhisperGreedySearch
model: !ref <whisper>
bos_index: !ref <timestamp_index>
eos_index: !ref <eos_index>
min_decode_ratio: !ref <min_decode_ratio>
max_decode_ratio: !ref <max_decode_ratio>

test_beam_searcher: !new:speechbrain.decoders.seq2seq.S2SWhisperBeamSearch
module: [!ref <whisper>]
bos_index: !ref <timestamp_index>
eos_index: !ref <eos_index>
min_decode_ratio: !ref <min_decode_ratio>
max_decode_ratio: !ref <max_decode_ratio>
beam_size: !ref <test_beam_size>

lr_annealing_whisper: !new:speechbrain.nnet.schedulers.NewBobScheduler
initial_value: !ref <lr_whisper>
improvement_threshold: 0.0025
annealing_factor: 0.9
patient: 0

checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
checkpoints_dir: !ref <save_folder>
recoverables:
whisper: !ref <whisper>
scheduler_whisper: !ref <lr_annealing_whisper>
counter: !ref <epoch_counter>

train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
save_file: !ref <train_log>

error_rate_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats

cer_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats
split_tokens: True
Loading