-
Notifications
You must be signed in to change notification settings - Fork 1.6k
wav2vec2 pretraining implemented with speechbrain #1312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wav2vec2 pretraining implemented with speechbrain #1312
Conversation
f93e06d to
ed3c983
Compare
|
Latest commit is result of merging mine and guille's implementations. Best to ignore commonvoice files as those are not up-to-date and I'd like to focus on getting the librespeech implementation done before we update commonvoice. |
TParcollet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the huge and awesome work ! Please find my major comments on this review :-)
117e6e9 to
4caef3c
Compare
5d9c6cf to
e4d33d6
Compare
|
I've refactored a lot, cleaned things up, added more docstrings, think it's worth going through another review. Some things I still plan to do:
|
TParcollet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @RuABraun for this amazing job! Here is my review. I think after that, I can start a big training and see. We are close to something.
|
Also please, resolve the conflict so that I can run the tests :p |
|
There are definitely some changes needed for |
|
Hi @RuABraun what are the next steps to follow in your view? |
|
There's the PR #1449 that needs to be merged first. I recently figured out one thing that was causing bad performance, but there's still one more issue remaining (get significantly better performance training with 2 GPUs vs with 4). Hope to figure that out in the next ~2 weeks (was busy with interviews in the last ~6 weeks but thankfully that's over now). Got a few interesting results from trying out different things, will post about them in the slack soon. |
019733b to
b1d5c15
Compare
|
waiting on #1518 |
|
@TParcollet Noting two things down here to look at in the future for better performance:
|
(this is not meant for merge yet but to share the code)
This allows one to pretrain a wav2vec2 model without relying on fairseq or huggingface. It follows the fairseq implementation though there are various differences.
Here is a plot showing WER after finetuning vs different number of pretraining steps and comparing this implementation to fairseq (as well as without and with the quantisation). This is on italian commonvoice 7.0, using validated as the pretraining set, train for finetuning
I made a wrapper around the whole w2v2 object to make it easy to hold connecting objects like the masking tensor and various projection layers. The feature extractor and encoder are arguments that can be overridden. Using vector quantisation is toggle-able. The implementation follows the most recent fairseq implementation (but not the very recent conformer stuff) so uses normalise_before=True and layer norm is used instead of group norm.
One can use the existing finetuning script inside
/ASR/CTCto train the pretrained model.Minor notes and TODOs: