Skip to content

Conversation

@Adel-Moumen
Copy link
Collaborator

Hello,

In the file distributed.py, some errors can be raised suggesting the use of --distributed_launch=True which is misleading because it is not what the user is supposed to do. According to the official documentation of SpeechBrain, if one wants to use DDP then he needs to use --distributed_launch instead of --distributed_launch=True -> https://speechbrain.readthedocs.io/en/v0.5.8/multigpu.html

Moreover, --distributed_launch=True will not work. I tried to do a little example on my GPUs cluster where I'm training an LSTM on the CommonVoice French recipe of SpeechBrain using 2x v100s:

with --distributed_launch=True I got an error: torch.distributed.elastic.multiprocessing.errors.ChildFailedError ......
whearas --distributed_launch do everything that is expected, i.e train the LSTM.

I got the same return from a colleague who tried to use DDP and was stuck due to the errors thrown that were suggesting to set to True.

P.S.: I also need to change every comment in the recipes that suggests that. E.g. CommonVoice ASR seq2seq in train.py -> # If distributed_launch=True but I'm waiting for an official reviewer to confirm the problem.

Best,

@mravanelli mravanelli requested a review from TParcollet July 19, 2022 13:52
@TParcollet
Copy link
Collaborator

I am aware of that :-( I will fix the doc, but for the recipes ... well ...

Copy link
Collaborator

@TParcollet TParcollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants