Fix multi-node DDP training #2101
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
NOTE: requires further testing.
speechbrain.utils.distributed.if_main_processdefines the main process as the one with global rank (RANKenvironment variable) equal to 0. This works when using DDP on a single node, because the global rank of each process is the same as the local rank within the node. However, this fails when using multiple nodes. Indeed, I/O operations like data preparation, fitting SentencePiece tokenizer, etc. are run only on the master node (where the process with global rank 0 runs), but not on the worker nodes (where processes with global rank > 0 run). Therefore intermediate artifacts such as the data manifest files and SentencePiece checkpoint are created only on the master node but not on the worker nodes, which makes the processes on worker nodes fail (e.g.FileNotFoundError). Checking against the local rank (LOCAL_RANKenvironment variable) should fix the issue (this way I/O operations are run on the main process of each node).YunchaoYang/Blogs#3