Fixes slow iteration over tensor in GPU #2683

nie3e · 2024-09-11T11:05:58Z

What does this PR do?

Before iteration over batches it simply pushes tensor to cpu and convert it to numpy array for faster iteration.
Then it converts it back to tensor on cpu.

Before submitting

Did you read the contributor guideline?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Does your code adhere to project-specific code style and conventions?

PR review

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified
Confirm that the changes adhere to compatibility requirements (e.g., Python version, platform)
Review the self-review checklist to ensure the code is ready for review

asumagic

The workaround seems OK in itself but I am wary of this change because it does mean the return value lives on the CPU device rather than CUDA. Would have to test that it doesn't break any consumer of apply_threshold.

Maybe there is a way to express the loop as a tensor operation of some sort? Will have to look into it.

nie3e · 2024-09-16T22:28:44Z

I am not sure if this is possible without looping.

vad_th = vad_activation + vad_deactivation

is a tensor consisting of 0, 1, 2. The loop after getting a value of 2, changes subsequent 1 into 2. Same for 0: changes 1 into 0.

I can check the device that the input tensor is on and put it back on that device before returning the result.

asumagic · 2024-09-18T13:59:43Z

Could you check if something like this solves your performance issue?

# whether the n-th frame falls below threshold and triggers deactivation
frame_does_not_deactivate = values < deactivation_th

# always start keeping frames over activation threshold activated
vad_th = values >= activation_th

for i in range(1, len(values)):
    vad_th[..., i] |= vad_th[..., i - 1]  # if past frame was activated then keep activated
    vad_th[..., i] &= frame_does_not_deactivate[..., i]  # ... unless the i-th frame is below threshold

It eliminates the batch iteration and should avoid CPU/GPU syncs. It probably fires a ton of kernel launches on CUDA but that might be ok...

nie3e · 2024-09-18T15:42:11Z

Could you check if something like this solves your performance issue?

# whether the n-th frame falls below threshold and triggers deactivation
frame_does_not_deactivate = values < deactivation_th

# always start keeping frames over activation threshold activated
vad_th = values >= activation_th

for i in range(1, len(values)):
    vad_th[..., i] |= vad_th[..., i - 1]  # if past frame was activated then keep activated
    vad_th[..., i] &= frame_does_not_deactivate[..., i]  # ... unless the i-th frame is below threshold

It eliminates the batch iteration and should avoid CPU/GPU syncs. It probably fires a ton of kernel launches on CUDA but that might be ok...

It works! Execution times are very close to my workaround!

asumagic · 2024-09-18T16:10:08Z

Ok, great to hear. I'll sanity check it before merging later but it seemed correct from my testing.

Returning a bool tensor doesn't break the consumers of `apply_threshold` inside of `VAD`. External consumers should largely be unaffected if they even exist, and it would be a trivial fix for them.

asumagic · 2024-09-19T10:17:06Z

Heads up that I've tested this with speechbrain/vad-crdnn-libriparty and it seems to alter the detected boundaries slightly. I'll look into it.

asumagic · 2024-09-19T10:44:59Z

I have pushed some fixed code and the code now emits the same boundaries as before. There were issues around shape handling.

Could you check if the performance is still OK? It was not iterating properly with the previous code. Sorry for the trouble.

nie3e · 2024-09-19T11:09:00Z

I have pushed some fixed code and the code now emits the same boundaries as before. There were issues around shape handling.

Could you check if the performance is still OK? It was not iterating properly with the previous code. Sorry for the trouble.

I made some measurements for my 6.5h file:
Workaround with offload to cpu: ~29s
First loop elimination: ~31s
Last fix: ~115s

The performance has decreased a bit, but it is still much better than the original ~20minutes :)

The previous code did result in much less kernel invocations but it seems doing it on CPU is still very significantly faster.

Fixes slow iteration over tensor in GPU

d2863f6

asumagic self-assigned this Sep 11, 2024

asumagic reviewed Sep 11, 2024

View reviewed changes

nie3e and others added 2 commits September 18, 2024 17:27

Merge branch 'develop' into apply_threshold_gpu

9d82bc4

Changes VAD apply_threshold computation

1d40bfc

Document val_th as now returning BoolTensor

83124fe

Returning a bool tensor doesn't break the consumers of `apply_threshold` inside of `VAD`. External consumers should largely be unaffected if they even exist, and it would be a trivial fix for them.

Fix proper shape handling for VAD apply_threshold

86acac4

Perform apply_threshold on CPU

573925a

The previous code did result in much less kernel invocations but it seems doing it on CPU is still very significantly faster.

asumagic approved these changes Oct 9, 2024

View reviewed changes

asumagic merged commit df160c7 into speechbrain:develop Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixes slow iteration over tensor in GPU #2683

Fixes slow iteration over tensor in GPU #2683

Uh oh!

nie3e commented Sep 11, 2024 •

edited

Loading

Uh oh!

asumagic left a comment

Uh oh!

nie3e commented Sep 16, 2024

Uh oh!

asumagic commented Sep 18, 2024

Uh oh!

nie3e commented Sep 18, 2024 •

edited

Loading

Uh oh!

asumagic commented Sep 18, 2024

Uh oh!

asumagic commented Sep 19, 2024 •

edited

Loading

Uh oh!

asumagic commented Sep 19, 2024

Uh oh!

nie3e commented Sep 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fixes slow iteration over tensor in GPU #2683

Fixes slow iteration over tensor in GPU #2683

Uh oh!

Conversation

nie3e commented Sep 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

PR review

Uh oh!

asumagic left a comment

Choose a reason for hiding this comment

Uh oh!

nie3e commented Sep 16, 2024

Uh oh!

asumagic commented Sep 18, 2024

Uh oh!

nie3e commented Sep 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asumagic commented Sep 18, 2024

Uh oh!

asumagic commented Sep 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asumagic commented Sep 19, 2024

Uh oh!

nie3e commented Sep 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nie3e commented Sep 11, 2024 •

edited

Loading

nie3e commented Sep 18, 2024 •

edited

Loading

asumagic commented Sep 19, 2024 •

edited

Loading