Fix multi-node DDP training #2101

lucadellalib · 2023-08-06T17:46:58Z

NOTE: requires further testing.

speechbrain.utils.distributed.if_main_process defines the main process as the one with global rank (RANK environment variable) equal to 0. This works when using DDP on a single node, because the global rank of each process is the same as the local rank within the node. However, this fails when using multiple nodes. Indeed, I/O operations like data preparation, fitting SentencePiece tokenizer, etc. are run only on the master node (where the process with global rank 0 runs), but not on the worker nodes (where processes with global rank > 0 run). Therefore intermediate artifacts such as the data manifest files and SentencePiece checkpoint are created only on the master node but not on the worker nodes, which makes the processes on worker nodes fail (e.g. FileNotFoundError). Checking against the local rank (LOCAL_RANK environment variable) should fix the issue (this way I/O operations are run on the main process of each node).

YunchaoYang/Blogs#3

mravanelli · 2023-08-07T13:06:24Z

@pplantinga, could you please take a look at this?

pplantinga

I agree with this change. Although there may be some cases where you wish code to run only on the master node (e.g. saving a checkpoint) it seems like in the majority of cases the preferred behavior would be to run on every node, and the cost of running on all nodes even in those cases where you might not want to is small.

Fix distributed training

cb5fc1d

Adel-Moumen requested a review from pplantinga August 6, 2023 17:52

mravanelli assigned lucadellalib Aug 7, 2023

mravanelli added the bug Something isn't working label Aug 7, 2023

pplantinga approved these changes Aug 8, 2023

View reviewed changes

mravanelli merged commit 0e8b81e into develop Aug 18, 2023

mravanelli deleted the lucadellalib-distributed-training branch August 18, 2023 22:46

lucadellalib mentioned this pull request Apr 13, 2024

fix LOCAL_RANK to be RANK in if_main_process #2506

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix multi-node DDP training #2101

Fix multi-node DDP training #2101

Uh oh!

lucadellalib commented Aug 6, 2023

Uh oh!

mravanelli commented Aug 7, 2023

Uh oh!

pplantinga left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix multi-node DDP training #2101

Fix multi-node DDP training #2101

Uh oh!

Conversation

lucadellalib commented Aug 6, 2023

Uh oh!

mravanelli commented Aug 7, 2023

Uh oh!

pplantinga left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants