Skip to content

Conversation

@pradnya-git-dev
Copy link
Collaborator

@pradnya-git-dev pradnya-git-dev commented Aug 13, 2023

Contribution in a nutshell

Hey, this could help our community work with zero-shot multi-speaker text-to-speech

Scope

  • Extending single-speaker Tacotron2 with speaker encoding capabilities to generate speech based on input text and speaker identity (embedding)

@mravanelli mravanelli added the enhancement New feature or request label Aug 13, 2023
@mravanelli mravanelli self-requested a review August 15, 2023 14:35
@mravanelli mravanelli marked this pull request as ready for review September 22, 2023 13:35
@mravanelli
Copy link
Collaborator

Thank you @pradnya-git-dev for submitting this PR! Your contribution is greatly appreciated as it adds a valuable feature to SpeechBrain. Below are my comments and suggestions:

Readme Updates:

  1. Training Time and GPU Requirements: Please consider adding information in the README regarding the expected training time and GPU requirements for both Zero-Shot Multi-Speaker Tacotron2 and HiFi GAN (Vocoder). This information will be helpful for users who want to utilize these features efficiently.

  2. Best Model in SpeechBrain HF Repo: It would be beneficial to place the current best model on the SpeechBrain Hugging Face (HF) repository and mark it as a work in progress. This will help pretraining in our future work. Additionally, please upload the current best model to the SpeechBrain Dropbox. For detailed instructions on this, please contact me privately.

Recipe Test Failures:

  1. Test Failures: There are test failures in the recipe tests. Specifically, there's an issue with the LibriTTS recipe test, which is failing due to a KeyError ('LJ050-0131'). Please investigate and resolve this issue.
python -c 'from tests.utils.recipe_tests import run_recipe_tests; print("TEST FAILED!") if not(run_recipe_tests(filters_fields=["Dataset"], filters=[["LibriTTS"]], do_checks=True, run_opts="--device=cuda")) else print("TEST PASSED")'
ERROR: Error in LibriTTS_row_03 (recipes/LibriTTS/TTS/mstacotron2/hparams/train.yaml). Check tests/tmp/LibriTTS_row_03/stderr.txt and tests/tmp/LibriTTS_row_03/stdout.txt for more info.
TEST FAILED!
spk_emb = speaker_embeddings[raw_batch[idx]["uttid"]]
KeyError: 'LJ050-0131'

Script Redundancy:

  1. Redundancy in Training Scripts: I'm curious why we need to redefine the training script for VoxCeleb/SpeakerRec (e.g., train_ecapa_tdnn_mel_spec.yaml and train_speaker_embeddings_mel_spec.py). If possible, the best option would be of reusing the existing train_speaker_embeddings.py script with minor modifications. This can help reduce code duplication and maintenance efforts.

Code Optimization:

  1. Minimize Code Redundancy: There might be a significant overlap between speechbrain/lobes/models/MSTacotron2.py and Tacotron2.py. If not already done, consider redefining in speechbrain/lobes/models/MSTacotron2.py only the classes that need to be modified for the injection of speaker embeddings. This approach will help us minimize code redundancy and maintain cleaner code.

@mravanelli
Copy link
Collaborator

Thank you @pradnya-git-dev for working on this PR! It is an important first step toward zero-shot TTS in SpeechBrain. The quality of the generated speech can be improved, but we will do that in a follow up PR.

@mravanelli mravanelli merged commit e0d43d9 into speechbrain:develop Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants