Skip to content

Conversation

@nie3e
Copy link
Contributor

@nie3e nie3e commented Sep 11, 2024

What does this PR do?

Fixes #2638

Before iteration over batches it simply pushes tensor to cpu and convert it to numpy array for faster iteration.
Then it converts it back to tensor on cpu.

Before submitting
  • Did you read the contributor guideline?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Does your code adhere to project-specific code style and conventions?

PR review

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified
  • Confirm that the changes adhere to compatibility requirements (e.g., Python version, platform)
  • Review the self-review checklist to ensure the code is ready for review

@asumagic asumagic self-assigned this Sep 11, 2024
Copy link
Collaborator

@asumagic asumagic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workaround seems OK in itself but I am wary of this change because it does mean the return value lives on the CPU device rather than CUDA. Would have to test that it doesn't break any consumer of apply_threshold.

Maybe there is a way to express the loop as a tensor operation of some sort? Will have to look into it.

@nie3e
Copy link
Contributor Author

nie3e commented Sep 16, 2024

I am not sure if this is possible without looping.

vad_th = vad_activation + vad_deactivation

is a tensor consisting of 0, 1, 2. The loop after getting a value of 2, changes subsequent 1 into 2. Same for 0: changes 1 into 0.

I can check the device that the input tensor is on and put it back on that device before returning the result.

@asumagic
Copy link
Collaborator

Could you check if something like this solves your performance issue?

# whether the n-th frame falls below threshold and triggers deactivation
frame_does_not_deactivate = values < deactivation_th

# always start keeping frames over activation threshold activated
vad_th = values >= activation_th

for i in range(1, len(values)):
    vad_th[..., i] |= vad_th[..., i - 1]  # if past frame was activated then keep activated
    vad_th[..., i] &= frame_does_not_deactivate[..., i]  # ... unless the i-th frame is below threshold

It eliminates the batch iteration and should avoid CPU/GPU syncs. It probably fires a ton of kernel launches on CUDA but that might be ok...

@nie3e
Copy link
Contributor Author

nie3e commented Sep 18, 2024

Could you check if something like this solves your performance issue?

# whether the n-th frame falls below threshold and triggers deactivation
frame_does_not_deactivate = values < deactivation_th

# always start keeping frames over activation threshold activated
vad_th = values >= activation_th

for i in range(1, len(values)):
    vad_th[..., i] |= vad_th[..., i - 1]  # if past frame was activated then keep activated
    vad_th[..., i] &= frame_does_not_deactivate[..., i]  # ... unless the i-th frame is below threshold

It eliminates the batch iteration and should avoid CPU/GPU syncs. It probably fires a ton of kernel launches on CUDA but that might be ok...

It works! Execution times are very close to my workaround!

@asumagic
Copy link
Collaborator

Ok, great to hear. I'll sanity check it before merging later but it seemed correct from my testing.

Returning a bool tensor doesn't break the consumers of `apply_threshold` inside of `VAD`. External consumers should largely be unaffected if they even exist, and it would be a trivial fix for them.
@asumagic
Copy link
Collaborator

asumagic commented Sep 19, 2024

Heads up that I've tested this with speechbrain/vad-crdnn-libriparty and it seems to alter the detected boundaries slightly. I'll look into it.

@asumagic
Copy link
Collaborator

I have pushed some fixed code and the code now emits the same boundaries as before. There were issues around shape handling.

Could you check if the performance is still OK? It was not iterating properly with the previous code. Sorry for the trouble.

@nie3e
Copy link
Contributor Author

nie3e commented Sep 19, 2024

I have pushed some fixed code and the code now emits the same boundaries as before. There were issues around shape handling.

Could you check if the performance is still OK? It was not iterating properly with the previous code. Sorry for the trouble.

I made some measurements for my 6.5h file:
Workaround with offload to cpu: ~29s
First loop elimination: ~31s
Last fix: ~115s

The performance has decreased a bit, but it is still much better than the original ~20minutes :)

The previous code did result in much less kernel invocations but it seems doing it on CPU is still very significantly faster.
@asumagic asumagic merged commit df160c7 into speechbrain:develop Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

VAD - very slow "apply_threshold" when running on cuda device + possible solution

2 participants