Skip to content

Conversation

@anautsch
Copy link
Collaborator

As described in #1722 , there is an issue with sorting and DDP.

This is to open up a PR. The goal is not fixing all possibilities - the goal is to restore functionality.
A testing is not yet implemented (needs DDP set-up; that might be a different sort of github workflow altogether).

shuffle = loader_kwargs.get("shuffle", False)

defines a shuffle variable that is simply re-used for the flags that were redefined in #1518 ; defaults to False, such that sorting is kept if wanted througout.

@TParcollet
Copy link
Collaborator

To what is this variable connected? a yaml parameter?

@anautsch
Copy link
Collaborator Author

To what is this variable connected? a yaml parameter?

yeah kinda - but it might be in conflict with some other concept beneath.
maybe the idea is bogus. need to test - let's use this as a playground to get sth working

well, actually, the sorting flag is the flag?

@anautsch
Copy link
Collaborator Author

just put it to "False" instead; this simply reverts the other PR that caused the bug w/o needing to rethink recipe & hparam logic.

@anautsch
Copy link
Collaborator Author

anautsch commented Nov 30, 2022

Added a test and an initial fix outline. The test uses a CPU ddp backend, gloo.

I run into these nearby-sortings w/ sorting & ddp:

tensor([2.2050, 2.3400, 2.6000, 2.8700, 3.1050, 3.1500, 3.2200, 3.2800, 3.4550,
        3.6500, 3.7000, 4.2650, 4.4701, 4.7600, 4.8350, 4.8900, 5.1600, 6.1450,
        1.7750], dtype=torch.float64)

tensor([6.1450, 5.1600, 4.8900, 4.8350, 4.7600, 4.4701, 4.2650, 3.7000, 3.6500,
        3.4550, 3.2800, 3.2200, 3.1500, 3.1050, 2.8700, 2.6000, 2.3400, 2.2050,
        6.2000], dtype=torch.float64)

edit: also, only 38/100 examples are sampled when DDP uses sorting, which should not be.

W/o ddp (and shuffling ddp), all works fine.

@anautsch
Copy link
Collaborator Author

anautsch commented Dec 1, 2022

The test needed to be adjusted to DDP & pytorch's DistributedSampler which has:

        drop_last (bool, optional): if ``True``, then the sampler will drop the
            tail of the data to make it evenly divisible across the number of
            replicas. If ``False``, the sampler will add extra indices to make
            the data evenly divisible across the replicas. Default: ``False``.

Test works for the proposed fix on DDP & sorting. Bc of the drop_last & replicas, the final elements of the last batch won't be sorted within that batch. Yet, DDP seems to distribute the batches w/ replicas BUT

  • for sorted (filtered) datasets, the total number of examples does not change
  • for random (shuffling), the number of examples increases, as replicas are included.

The sorting asserts are thus tailored to the dummy data (and a bit lazy with "[:-1]" subset testing across all batches, instead of final batch only).

Ready for review.

Copy link
Collaborator

@TParcollet TParcollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments.

"Killing process " + str() + "\n"
"Not enough GPUs available!"
)
if not run_opts["distributed_backend"] == "gloo":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, why is gloo treated like that? :p

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CPU vs GPU - there is no GPU testing on github, or pay for it :p

Copy link
Collaborator Author

@anautsch anautsch Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the way it is implemented for GPU is incompatible with CPU: gloo will crash when run in cpu only.

This pytorch example for gloo is incompatible with SB, as it was before the contributed fix
https://pytorch.org/tutorials/intermediate/dist_tuto.html

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this part, if there is only CPU, ofc there are not enough GPUs ;)

Copy link
Collaborator

@TParcollet TParcollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@TParcollet TParcollet merged commit b5c223c into speechbrain:develop Dec 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants