-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Fixes slow iteration over tensor in GPU #2683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
asumagic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workaround seems OK in itself but I am wary of this change because it does mean the return value lives on the CPU device rather than CUDA. Would have to test that it doesn't break any consumer of apply_threshold.
Maybe there is a way to express the loop as a tensor operation of some sort? Will have to look into it.
|
I am not sure if this is possible without looping. vad_th = vad_activation + vad_deactivationis a tensor consisting of 0, 1, 2. The loop after getting a value of 2, changes subsequent 1 into 2. Same for 0: changes 1 into 0. I can check the device that the input tensor is on and put it back on that device before returning the result. |
|
Could you check if something like this solves your performance issue? # whether the n-th frame falls below threshold and triggers deactivation
frame_does_not_deactivate = values < deactivation_th
# always start keeping frames over activation threshold activated
vad_th = values >= activation_th
for i in range(1, len(values)):
vad_th[..., i] |= vad_th[..., i - 1] # if past frame was activated then keep activated
vad_th[..., i] &= frame_does_not_deactivate[..., i] # ... unless the i-th frame is below thresholdIt eliminates the batch iteration and should avoid CPU/GPU syncs. It probably fires a ton of kernel launches on CUDA but that might be ok... |
It works! Execution times are very close to my workaround! |
|
Ok, great to hear. I'll sanity check it before merging later but it seemed correct from my testing. |
Returning a bool tensor doesn't break the consumers of `apply_threshold` inside of `VAD`. External consumers should largely be unaffected if they even exist, and it would be a trivial fix for them.
|
Heads up that I've tested this with |
|
I have pushed some fixed code and the code now emits the same boundaries as before. There were issues around shape handling. Could you check if the performance is still OK? It was not iterating properly with the previous code. Sorry for the trouble. |
I made some measurements for my 6.5h file: The performance has decreased a bit, but it is still much better than the original ~20minutes :) |
The previous code did result in much less kernel invocations but it seems doing it on CPU is still very significantly faster.
What does this PR do?
Fixes #2638
Before iteration over batches it simply pushes tensor to cpu and convert it to numpy array for faster iteration.
Then it converts it back to tensor on cpu.
Before submitting
PR review
Reviewer checklist