VAD - very slow "apply_threshold" when running on cuda device + possible solution

### Describe the bug

When using VAD on a 6-hour audio with CUDA, the function `apply_threshold` is notably slow. I've observed that the loop within this function is the primary bottleneck.
```py
        #Loop over batches and time steps
        for batch in range(vad_th.shape[0]):
            for time_step in range(vad_th.shape[1] - 1):
                if (
                    vad_th[batch, time_step] == 2
                    and vad_th[batch, time_step + 1] == 1
                ):
                    vad_th[batch, time_step + 1] = 2
```

### Expected behaviour

It will be quicker if we first move vad_th tensor to cpu: `vad_th = vad_th.cpu()`
And even faster if we use numpy array and convert it back to a tensor before returning the result:
```py
vad_th = vad_th.cpu().numpy()
[...]
vad_th = torch.from_numpy(vad_th)
```

Fixed function:
```py
    def apply_threshold(
        self, vad_prob, activation_th=0.5, deactivation_th=0.25
    ):
        """Scans the frame-level speech probabilities and applies a threshold
        on them. Speech starts when a value larger than activation_th is
        detected, while it ends when observing a value lower than
        the deactivation_th.

        Arguments
        ---------
        vad_prob: torch.Tensor
            Frame-level speech probabilities.
        activation_th:  float
            Threshold for starting a speech segment.
        deactivation_th: float
            Threshold for ending a speech segment.

        Returns
        -------
        vad_th: torch.Tensor
            Tensor containing 1 for speech regions and 0 for non-speech regions.
        """
        vad_activation = (vad_prob >= activation_th).int()
        vad_deactivation = (vad_prob >= deactivation_th).int()
        vad_th = vad_activation + vad_deactivation

        # Move tensor to cpu, make numpy array
        vad_th = vad_th.cpu().numpy()

        #Loop over batches and time steps
        for batch in range(vad_th.shape[0]):
            for time_step in range(vad_th.shape[1] - 1):
                if (
                    vad_th[batch, time_step] == 2
                    and vad_th[batch, time_step + 1] == 1
                ):
                    vad_th[batch, time_step + 1] = 2

        # Get tensor from numpy
        vad_th = torch.from_numpy(vad_th)
        vad_th[vad_th == 1] = 0
        vad_th[vad_th == 2] = 1
        return vad_th
```

### To Reproduce

This is my code:
```py
from speechbrain.inference.VAD import VAD

from time import perf_counter

VAD = VAD.from_hparams(
    source="speechbrain/vad-crdnn-libriparty",
    savedir="pretrained_models/vad-crdnn-libriparty",
    huggingface_cache_dir="./",
    run_opts={"device": "cuda"}
)

s = perf_counter()
boundaries = VAD.get_speech_segments(
    "path_to_my_6h_audio.wav"
)
e = perf_counter()

# Print the output
VAD.save_boundaries(boundaries)
print(f"It took {e-s}")
```

### Environment Details

audioread==3.0.1
certifi==2024.7.4
cffi==1.16.0
charset-normalizer==3.3.2
colorama==0.4.6
decorator==5.1.1
filelock==3.15.4
fsspec==2024.6.1
huggingface-hub==0.24.5
HyperPyYAML==1.2.2
idna==3.7
Jinja2==3.1.4
joblib==1.4.2
lazy_loader==0.4
librosa==0.10.2.post1
llvmlite==0.43.0
MarkupSafe==2.1.5
mpmath==1.3.0
msgpack==1.0.8
networkx==3.3
numba==0.60.0
numpy==1.26.4
packaging==24.1
pillow==10.2.0
platformdirs==4.2.2
pooch==1.8.2
pycparser==2.22
PyYAML==6.0.1
requests==2.32.3
ruamel.yaml==0.18.6
ruamel.yaml.clib==0.2.8
scikit-learn==1.5.1
scipy==1.14.0
sentencepiece==0.2.0
soundfile==0.12.1
soxr==0.4.0
speechbrain==1.0.0
sympy==1.13.1
threadpoolctl==3.5.0
torch==2.4.0+cu121
torchaudio==2.4.0
torchvision==0.19.0+cu121
tqdm==4.66.4
typing_extensions==4.12.2
urllib3==2.2.2

### Relevant Log Output

_No response_

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

VAD - very slow "apply_threshold" when running on cuda device + possible solution #2638

Describe the bug

Expected behaviour

To Reproduce

Environment Details

Relevant Log Output

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

VAD - very slow "apply_threshold" when running on cuda device + possible solution #2638

Description

Describe the bug

Expected behaviour

To Reproduce

Environment Details

Relevant Log Output

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions