-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Change needed in Whisper fine-tuning recipe to accommodate transformers4.30.0 #2016
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hello @sangeet2020, Thanks for the PR. Is it backwards compatible with torch 1.13? |
|
Hello @Adel-Moumen, So after quite a lot of experiments, it seems that the issue is not with the torch, but transformers. Here are my observations on
I would request to you to try these setting once, so to be assured. Let me know if you need log files for each. Thank You |
|
|
||
| filters = self._mel_filters | ||
| mel_spec = filters @ magnitudes | ||
| mel_spec = filters.transpose(0, 1) @ magnitudes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why it is coming from the transformers library.... For me, it seems to be related to the torch library...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whats the torch and trasnformers version are you using? I'll try the exact same version.
Can you try installing torch 2.0 and transformers 4.26 vs transformers 4.32. I can confirm that I maintain the torch version throughout my expts and vary the transformer version, and the error vanishes as the per the table above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log_transformer==4.30.2_script_changed.txt
log_transformer==4.30.2_script_notchanged.txt
log_transformer==4.26.0_script_notchanged.txt
I attach the log files for your reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in transformers >= 4.30, they change how they compute feature_extractor.mel_filters. As a result in transformers >= 4.30, the shape of feature_extractor.mel_filters is (201,80) while in the prev version, it was (80,201). It causes a problem in our _log_mel_spectrogram function ,we copy from openAI, when we calculate mel_spec = filters @ magnitudes. It expects the filter to be (80,201).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Pooneh,I see. Do you know why they changed their 'feature_extractor.mel_filters' ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is related to this PR on transformers:
huggingface/transformers#21998
Basically what they do is replacing the hand-rolled STFT in the different models including whisper with the one from audio_util.
In Transformers version <= 4.28:
In order to get self.mel_filters , they use function that is specific for whisper --> "get_mel_filters "in /transformers/models/whisper/feature_extraction_whisper.py which returns (n_mels, n_freqs, ) --> (80, 210)
then, to calculate the log-Mel spectrogram, they call their own stft function in transformers/models/whisper/feature_extraction_whisper.py and finally call _np_extract_fbank_features which is basically the same as our function _log_mel_spectrogram in /speechbrain/lobes/models/huggingface_whisper.py which is the same as open_ai function.
In version >= 4.29:
In order to get the self.mel_filters they use "mel_filter_bank" function in transformers/audio_utils.py.
It is the same as following pytorchaudio function:
https://pytorch.org/audio/main/generated/torchaudio.functional.melscale_fbanks.html
This will return (n_freqs, n_mels) --> (210, 80).
Then they call spectrogram function in audio_utils.py to generate log_mel_spectrogram and they transpose mel_filters.
spectrogram = np.maximum(mel_floor, np.dot(mel_filters.T, spectrogram)
This function is quite similar to melscale in Pytorchaudio
https://pytorch.org/audio/main/generated/torchaudio.transforms.MelScale.html
Final Points:
- It seems they try to unify all audio functions across different models and use the same implementation as pytorchaudio and Librosa. These models are affected by this change:
- CLAP
- M-CTC-T
- SpeechT5
- TVLT
- Whisper
- Based on the implementation, I think the suggested fix by @sangeet2020 makes sense.
- I am going to run all comon_voice recipes for the major release. Maybe it would be better to merge this change so I could test whisper recipes and make sure there won't be any bugs introduced by this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Adel-Moumen and @sangeet2020 I am thinking about applying transpose only if the shape is not n_mels, n_freqs, So it could also work for an older version of transformers.
|
I think we need to proceed with that and merge it after doing some final tests. My understanding is that the problem is not pytorch 2.0 but the version of the transformer used. In that case, I think we should also change the extra-requirement file in the recipes that use whisper (e.g, https://github.com/speechbrain/speechbrain/blob/develop/recipes/CommonVoice/ASR/transformer/extra_requirements.txt). @poonehmousavi could you please proceed with it and do the last tests? When you think everything is fine we can merge it (we need it for CL-MASR as well). |
|
@mravanelli Yes the problem is transformers, with the proposed solution, it works with the new version of transformers. I have tested it on a small dataset. I will test that more thoroughly while running whisper experiments fro common-voice. |
|
Also, it looks like there are many places were transformer is added as an additional dependency (e.g., https://github.com/speechbrain/speechbrain/blob/develop/recipes/LibriSpeech/ASR/CTC/extra_requirements.txt). Do you think all these recipes might be affected somehow by the change done in transformers >= 0.30? |
|
These transformer models are affected by this change: |
|
Hello @poonehmousavi, the reason was that it was extremely slow. This is why we decided to use the OpenAI's implementation, because it was way waster. |
They change their function and basically use the touch audio function, I guess it might be faster now. One other solution is to use their own function instead of transposing. |
Yes, I agree. Could you please try and compare? |
|
Sure. I will try that. |
|
I think the proposed solution is not the best solution. When trying to infer a Whisper model trained on SpeechBrain using transformers4.30.2, I get a size mismatch while loading the state dictionary for the "HuggingFaceWhisper" module of the SB pre-trained model. And this error comes when doing parameter transfer in. The error is |
@poonehmousavi |
For sure it won't work, because it tries to load the whisper-checkpoint trained with an older version of transformers which saves the _mel_filters and it caused the mismatch. I am going to update the checkpoints soon using new versions. |
|
No even whisper checkpoint trained on latest transformer version doesn't work for me. |
|
|
I did a push the solution implemented by @lucadellalib in the CL_MASR benchmark. This solution is more backward compatible.
|
|
I would like to take this opportunity to bring attention to the poor performance I observed while using our Whisper interface. For example, when using the French language (https://huggingface.co/speechbrain/asr-whisper-large-v2-commonvoice-fr), the following code snippet demonstrates the issue: The transcription output, The same issue occurs with the German language model: Once again, the transcription These observations lead me to suspect that we may have a consistent issue across all the Whisper HF models we release. I believe there might be a problem with our interface that prevents it from performing the same computations during training and validation as in the fine-tuning recipes. @poonehmousavi , any idea? This is pretty crucial as our Whisper interfaces appear to be functioning incorrectly. |
|
@mravanelli @poonehmousavi I just pushed a better fix that should solve the remaining backward compatibility issues even with previously trained models |
|
I have checked the proposed solution. It works with the new version and it is backward compatible with prev versions. The previously trained models can be used by this change as well, Therefore, I think we could merge this pull request. |
|
I did some final tests and I confirm that everything is working and backward compatible. Thank you all for contributing to this fix! |
|
Thanks a lot @lucadellalib for making the fix and @poonehmousavi for doing the needed tests. |

Change needed in Whisper fine-tuning recipe to accommodate the latest release of
transformers>4.30Hi SB team,
I initiate a PR where I perform minor fixes in the main whisper fine-tuning script to accommodate changes in the latest Pytorch release of 2.0.
After the most recent pull of develop branch and latest torch version I have been experiencing this issue.
To reproduce this issue (choose any dataset)
with
torch==2.0.0. The detailedenv.logisNOTE
This issue does not occur (hence, not reproducible) in
torch==1.11.0+cu113The changes I propose have been tested with the exact version as mentioned above in the
env.log. Following these changes, the error mentioned above vanishes.thank you
Note: when merged, we desire to include your PR title in our contributions list, check out one of our past version releases
—https://github.com/speechbrain/speechbrain/releases/tag/v0.5.14
Tip: below, on the « Create Pull Request » use the drop-down to select: « Create Draft Pull Request » – your PR will be in draft mode until you declare it « Ready for review »