-
Notifications
You must be signed in to change notification settings - Fork 1.6k
fix sorting bug #1730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix sorting bug #1730
Conversation
|
To what is this variable connected? a yaml parameter? |
yeah kinda - but it might be in conflict with some other concept beneath. well, actually, the sorting flag is the flag? |
|
just put it to "False" instead; this simply reverts the other PR that caused the bug w/o needing to rethink recipe & hparam logic. |
|
Added a test and an initial fix outline. The test uses a CPU ddp backend, I run into these nearby-sortings w/ sorting & ddp: edit: also, only 38/100 examples are sampled when DDP uses sorting, which should not be. W/o ddp (and shuffling ddp), all works fine. |
|
The test needed to be adjusted to DDP & pytorch's DistributedSampler which has: Test works for the proposed fix on DDP & sorting. Bc of the drop_last & replicas, the final elements of the last batch won't be sorted within that batch. Yet, DDP seems to distribute the batches w/ replicas BUT
The sorting asserts are thus tailored to the dummy data (and a bit lazy with "[:-1]" subset testing across all batches, instead of final batch only). Ready for review. |
TParcollet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments.
| "Killing process " + str() + "\n" | ||
| "Not enough GPUs available!" | ||
| ) | ||
| if not run_opts["distributed_backend"] == "gloo": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, why is gloo treated like that? :p
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CPU vs GPU - there is no GPU testing on github, or pay for it :p
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the way it is implemented for GPU is incompatible with CPU: gloo will crash when run in cpu only.
This pytorch example for gloo is incompatible with SB, as it was before the contributed fix
https://pytorch.org/tutorials/intermediate/dist_tuto.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for this part, if there is only CPU, ofc there are not enough GPUs ;)
TParcollet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
As described in #1722 , there is an issue with sorting and DDP.
This is to open up a PR. The goal is not fixing all possibilities - the goal is to restore functionality.
A testing is not yet implemented (needs DDP set-up; that might be a different sort of github workflow altogether).
speechbrain/speechbrain/core.py
Line 726 in c229dbc
defines a shuffle variable that is simply re-used for the flags that were redefined in #1518 ; defaults to
False, such that sorting is kept if wanted througout.