-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Describe the bug
When using VAD on a 6-hour audio with CUDA, the function apply_threshold is notably slow. I've observed that the loop within this function is the primary bottleneck.
#Loop over batches and time steps
for batch in range(vad_th.shape[0]):
for time_step in range(vad_th.shape[1] - 1):
if (
vad_th[batch, time_step] == 2
and vad_th[batch, time_step + 1] == 1
):
vad_th[batch, time_step + 1] = 2Expected behaviour
It will be quicker if we first move vad_th tensor to cpu: vad_th = vad_th.cpu()
And even faster if we use numpy array and convert it back to a tensor before returning the result:
vad_th = vad_th.cpu().numpy()
[...]
vad_th = torch.from_numpy(vad_th)Fixed function:
def apply_threshold(
self, vad_prob, activation_th=0.5, deactivation_th=0.25
):
"""Scans the frame-level speech probabilities and applies a threshold
on them. Speech starts when a value larger than activation_th is
detected, while it ends when observing a value lower than
the deactivation_th.
Arguments
---------
vad_prob: torch.Tensor
Frame-level speech probabilities.
activation_th: float
Threshold for starting a speech segment.
deactivation_th: float
Threshold for ending a speech segment.
Returns
-------
vad_th: torch.Tensor
Tensor containing 1 for speech regions and 0 for non-speech regions.
"""
vad_activation = (vad_prob >= activation_th).int()
vad_deactivation = (vad_prob >= deactivation_th).int()
vad_th = vad_activation + vad_deactivation
# Move tensor to cpu, make numpy array
vad_th = vad_th.cpu().numpy()
#Loop over batches and time steps
for batch in range(vad_th.shape[0]):
for time_step in range(vad_th.shape[1] - 1):
if (
vad_th[batch, time_step] == 2
and vad_th[batch, time_step + 1] == 1
):
vad_th[batch, time_step + 1] = 2
# Get tensor from numpy
vad_th = torch.from_numpy(vad_th)
vad_th[vad_th == 1] = 0
vad_th[vad_th == 2] = 1
return vad_thTo Reproduce
This is my code:
from speechbrain.inference.VAD import VAD
from time import perf_counter
VAD = VAD.from_hparams(
source="speechbrain/vad-crdnn-libriparty",
savedir="pretrained_models/vad-crdnn-libriparty",
huggingface_cache_dir="./",
run_opts={"device": "cuda"}
)
s = perf_counter()
boundaries = VAD.get_speech_segments(
"path_to_my_6h_audio.wav"
)
e = perf_counter()
# Print the output
VAD.save_boundaries(boundaries)
print(f"It took {e-s}")Environment Details
audioread==3.0.1
certifi==2024.7.4
cffi==1.16.0
charset-normalizer==3.3.2
colorama==0.4.6
decorator==5.1.1
filelock==3.15.4
fsspec==2024.6.1
huggingface-hub==0.24.5
HyperPyYAML==1.2.2
idna==3.7
Jinja2==3.1.4
joblib==1.4.2
lazy_loader==0.4
librosa==0.10.2.post1
llvmlite==0.43.0
MarkupSafe==2.1.5
mpmath==1.3.0
msgpack==1.0.8
networkx==3.3
numba==0.60.0
numpy==1.26.4
packaging==24.1
pillow==10.2.0
platformdirs==4.2.2
pooch==1.8.2
pycparser==2.22
PyYAML==6.0.1
requests==2.32.3
ruamel.yaml==0.18.6
ruamel.yaml.clib==0.2.8
scikit-learn==1.5.1
scipy==1.14.0
sentencepiece==0.2.0
soundfile==0.12.1
soxr==0.4.0
speechbrain==1.0.0
sympy==1.13.1
threadpoolctl==3.5.0
torch==2.4.0+cu121
torchaudio==2.4.0
torchvision==0.19.0+cu121
tqdm==4.66.4
typing_extensions==4.12.2
urllib3==2.2.2
Relevant Log Output
No response
Additional Context
No response