SLU Media recipe #1172

GaelleLaperriere · 2021-11-29T09:56:33Z

Hello there,

I added in this new recipe:

the processing of the Media SLU dataset
the necessary csv and txt files do to so
python/yaml scripts with a wav2vec encoder
python/yaml scripts without the wav2vec encoder
the Concept Error Rate in dataio.py / metric_stats.py
the Concept Value Error Rate in dataio.py / metric_stats.py

I still need to make the README.md. An ASR recipe is coming.
The dataset Media is free for academic purposes, but must be requested from ELRA in order to retrieve it.

Thanks in advance.

EDIT

Tasks TODO:

TParcollet · 2021-11-29T10:02:45Z

Thanks @GaelleLaperriere !!! We will discuss that directly at the lab, but I will certainly ask for a few changes ;-)

TParcollet

Hey @GaelleLaperriere thank you so much for your work. Comments are a bit short, but we have so many things to do on SB ... Once the changes are done, I will test the recipes!

recipes/Media/ASR/seq2seq/README.md

recipes/Media/ASR/seq2seq/hparams/train.yaml

recipes/Media/channels.csv

speechbrain/utils/metric_stats.py

mravanelli · 2021-12-15T02:26:53Z

Hi @GaelleLaperriere and @TParcollet, at this stage we are with this PR? I'm wondering just because I would like to figure out if we want to include it in the upcoming new version of speechbrain.

TParcollet · 2021-12-15T08:57:19Z

Hey @mravanelli unsure that this will be possible.

Adel-Moumen

Hello,

Many thanks for the PR. You are doing a really great job!

This review will only be on the data prepare file. When the other files will be updated, I will then review everything.

First, could you please remove the .csv files ? I have uploaded on our Google Drive: https://drive.google.com/drive/u/1/folders/1z2zFZp3c0NYLFaUhhghhBakGcFdXVRyf the files so that you can put the links where to download the external data in the README.md as well as in the docstring of the prepare functions when it's needed

Please fix the pre-commit/testing as well.

recipes/MEDIA/media_prepare.py

anautsch · 2023-03-01T08:40:17Z

Hi @GaelleLaperriere after running the recipe testing for MEDIA

python -c 'from tests.utils.recipe_tests import run_recipe_tests; print("TEST FAILED!") if not(run_recipe_tests(filters_fields=["Dataset"], filters=[["MEDIA"]], do_checks=False, run_opts="--device=cuda")) else print("TEST PASSED")'

all tests failed with

ValueError: 'channels_path' is a !PLACEHOLDER and must be replaced.

When I put empty strings (it seems only relevant to the dataset preparation script, which this test skips), it gets stuck here:

    test_data = sb.dataio.dataset.DynamicItemDataset.from_csv(
        csv_path=hparams["csv_test"], replacements={"data_root": csv_folder}
    )

Please add an extra_requirements.txt file (to know which additional packages to install, like transformers).

anautsch · 2023-03-01T08:58:19Z

Got curious, after using this for recipe testing config

Task,Dataset,Script_file,Hparam_file,Data_prep_file,Readme_file,Result_url,HF_repo,test_debug_flags,test_debug_checks
SLU,MEDIA,recipes/MEDIA/SLU/CTC/train_hf_wav2vec.py,recipes/MEDIA/SLU/CTC/hparams/train_hf_wav2vec_full.yaml,recipes/MEDIA/media_prepare.py,recipes/MEDIA/SLU/CTC/README.md,https://drive.google.com/drive/folders/1LHKmtQ8Roz85GfwkYXHRv_zz-Z-2Qurf,,--data_folder=tests/samples/ASR/ --csv_train=tests/samples/annotation/ASR_train.csv --csv_valid=tests/samples/annotation/ASR_train.csv --csv_test=tests/samples/annotation/ASR_train.csv --number_of_epochs=2 --skip_prep=True --channels_path="" --concepts_path="",
SLU,MEDIA,recipes/MEDIA/SLU/CTC/train_hf_wav2vec.py,recipes/MEDIA/SLU/CTC/hparams/train_hf_wav2vec_relax.yaml,recipes/MEDIA/media_prepare.py,recipes/MEDIA/SLU/CTC/README.md,https://drive.google.com/drive/folders/1ALtwmk3VUUM0XRToecQp1DKAh9FsGqMA,,--data_folder=tests/samples/ASR/ --csv_train=tests/samples/annotation/ASR_train.csv --csv_valid=tests/samples/annotation/ASR_train.csv --csv_test=tests/samples/annotation/ASR_train.csv --number_of_epochs=2 --skip_prep=True --channels_path="" --concepts_path="",
ASR,MEDIA,recipes/MEDIA/ASR/CTC/train_hf_wav2vec.py,recipes/MEDIA/ASR/CTC/hparams/train_hf_wav2vec.yaml,recipes/MEDIA/media_prepare.py,recipes/MEDIA/ASR/CTC/README.md,https://drive.google.com/drive/folders/1qJUKxsTKrYwzKz0LHzq67M4G06Mj-9fl,,--data_folder=tests/samples/ASR/ --csv_train=tests/samples/annotation/ASR_train.csv --csv_valid=tests/samples/annotation/ASR_train.csv --csv_test=tests/samples/annotation/ASR_train.csv --number_of_epochs=2 --skip_prep=True --channels_path="" --concepts_path="",

I got these logs for the three recipes tests:

RuntimeError: These keys are still unaccounted for in the data pipeline: start_seg, end_seg

FileNotFoundError: [Errno 2] No such file or directory: 'tests/tmp/MEDIA_row_2/save/labelencoder.txt'

The last one is a consequence of the first, or?

    lab_enc_file = hparams["save_folder"] + "/labelencoder.txt"
    label_encoder.load_or_create(
        path=lab_enc_file,

which should create the labelencoder.txt file.

The earlier

    # 2. Define audio pipeline:
    @sb.utils.data_pipeline.takes("wav", "start_seg", "end_seg")
    @sb.utils.data_pipeline.provides("sig")
    def audio_pipeline(wav, start_seg, end_seg):

Needs to have start_seg & end_seg in the CSV files it is reading from. For recipe testing, we use dummy files. Your database preparation will have done everything correctly, I suppose you and Adel went through it already.

In the config above, the train CSV dummy is:
https://github.com/speechbrain/speechbrain/blob/develop/tests/samples/annotation/ASR_train.csv

and it does not have these two fields.

Please take a look at other sample annotations, which you could use to satisfy you dataio pipeline, we collect them here:
https://github.com/speechbrain/speechbrain/tree/develop/tests/samples/annotation

If none is there, there are several options:

extend one which then does what you need (just add the two comlumns with almost-ANY data).
create a new one which is fairly general (so others can use it too)

GaelleLaperriere · 2023-03-02T09:15:57Z

Hello,

Tank you for your help testing the recipe.

I solved the first issue. I just changed the tags "start_seg" and "end_seg" for "start" and "stop" as done in the testing samples.
For the second issue, with the label_encoder, I guess it solved itself now that the pipeline is fixed.

Please keep me updated when you run the tests again. Many thanks.

anautsch · 2023-03-02T09:51:03Z

Hello @GaelleLaperriere,

Tank you for your help testing the recipe.

Welcome; no worries.

I solved the first issue. I just changed the tags "start_seg" and "end_seg" for "start" and "stop" as done in the testing samples.

Ok. Works. Adding to the test samples here would've been no big deal. All tests are passing now, your solution works !

For the second issue, with the label_encoder, I guess it solved itself now that the pipeline is fixed.

Yes.

Please keep me updated when you run the tests again. Many thanks.

After running tests with the above mentioned tests/recipes/MEDIA.csv edits, I checked it out as its latest version on git.

python -c 'from tests.utils.recipe_tests import run_recipe_tests; print("TEST FAILED!") if not(run_recipe_tests(filters_fields=["Dataset"], filters=[["MEDIA"]], do_checks=False, run_opts="--device=cuda")) else print("TEST PASSED")'

fails with:

ValueError: 'channels_path' is a !PLACEHOLDER and must be replaced.

Please:

add extra_requirements.txt files (to know which additional packages to install, like transformers).
adjust tests/recipes/MEDIA.csv so it works

lmk if the above python -c command works for you when executed from your local repo root folder.

GaelleLaperriere · 2023-03-02T11:36:29Z

Placeholders are now overridden in MEDIA.csv with "Null". I don't know if it will work this way.
I added the extra_requirements.txt files.

I don't know if Adel told you about the error I get when running python -c for any recipe :

KeyError: "Override 'debug_persistently' not found in: ['seed', '__set_seed', 'output_folder', 'cer_file_test', 'ctc_file_test', 'save_folder', 'train_log', 'data_folder', 'channels_path', 'concepts_path', 'skip_wav', 'method', 'task', 'skip_prep', 'process_test2', 'wav2vec_url', 'csv_train', 'csv_valid', 'csv_test', 'batch_size', 'test_batch_size', 'avoid_if_longer_than', 'avoid_if_smaller_than', 'num_workers', 'dataloader_options', 'test_dataloader_options', 'sample_rate', 'feats_dim', 'number_of_epochs', 'lr', 'lr_wav2vec', 'annealing_factor', 'annealing_factor_wav2vec', 'improvement_threshold', 'improvement_threshold_wav2vec', 'patient', 'patient_wav2vec', 'sorting', 'activation', 'dnn_blocks', 'dnn_neurons', 'freeze', 'blank_index', 'output_neurons', 'epoch_counter', 'wav2vec2', 'enc', 'output_lin', 'log_softmax', 'ctc_cost', 'modules', 'model', 'model_wav2vec2', 'opt_class', 'opt_class_wav2vec', 'lr_annealing', 'lr_annealing_wav2vec', 'label_encoder', 'checkpointer', 'train_logger', 'ctc_computer', 'cer_computer']"

anautsch · 2023-03-02T12:24:19Z

Placeholders are now overridden in MEDIA.csv with "Null". I don't know if it will work this way. I added the extra_requirements.txt files.

Thank you @GaelleLaperriere !
Looks good :)

I don't know if Adel told you about the error I get when running python -c for any recipe :
KeyError: "Override 'debug_persistently' not found in: ...

It's good you write about it here (others might help it). You merged the latest develop on Feb 17; this one has all testing tools required. The flag is used here:

speechbrain/tests/utils/recipe_tests.py

Lines 502 to 505 in e8cd583

    
           # add --debug if no do_checks to save testing time 
        
           if not do_checks: 
        
               cmd += " --debug --debug_persistently"

Same in your repo/PR.

The log that it isn't found, is weird, since it is a run_opts:

speechbrain/speechbrain/core.py

Lines 455 to 460 in e8cd583

    
           # Arguments passed via the run opts dictionary 
        
           run_opt_defaults = { 
        
               "debug": False, 
        
               "debug_batches": 2, 
        
               "debug_epochs": 2, 
        
               "debug_persistently": False,

What might have happened: your python environment is not operating from your local SpeechBrain repo, which has the change. As for now, these testing tools are not available via the pip version of SpeechBrain. We will release a new version for PyPI soon after merging your PR.

On my end, all fail now with

Traceback (most recent call last):
  File "recipes/MEDIA/ASR/CTC/train_hf_wav2vec.py", line 337, in <module>
    train_data, valid_data, test_data, label_encoder = dataio_prepare(hparams)
  File "recipes/MEDIA/ASR/CTC/train_hf_wav2vec.py", line 230, in dataio_prepare
    test_data = sb.dataio.dataset.DynamicItemDataset.from_csv(
  File "speechbrain/dataio/dataset.py", line 365, in from_csv
    data = load_data_csv(csv_path, replacements)
  File "speechbrain/dataio/dataio.py", line 129, in load_data_csv
    with open(csv_path, newline="") as csvfile:
TypeError: expected str, bytes or os.PathLike object, not list

You still defined --csv_test=[tests/samples/annotation/ASR_train.csv] which makes it a list of test partitions, while your train code expects a specific file path, so w/o the '[' & ']'. I removed them and pushed (please pull).

(1/3) Running test for MEDIA_row_2...
	... 62.91s
(2/3) Running test for MEDIA_row_3...
	... 69.42s
(3/3) Running test for MEDIA_row_4...
	... 71.78s
TEST PASSED

note: I reduced the epoch number from 10 to 2—it's just a speed-up; 10 we need to obtain somewhat stable debug resutls for comparing with (we do this only for a few core recipes, since - it takes time yet the objective: is does the code crash -> yes/no). It's an fyi; you already suffered enough through this PR...

please let me know if testing works on your end, too, now.

GaelleLaperriere · 2023-03-02T12:37:00Z

What might have happened: your python environment is not operating from your local SpeechBrain repo, which has the change. As for now, these testing tools are not available via the pip version of SpeechBrain. We will release a new version for PyPI soon after merging your PR.

Thank you, it was the case. Now the tests work fine. 👍

anautsch

lgtm

Concerns have been addressed.

TParcollet self-assigned this Nov 29, 2021

TParcollet self-requested a review November 29, 2021 10:02

TParcollet previously requested changes Dec 3, 2021

View reviewed changes

mravanelli added the enhancement New feature or request label Dec 15, 2021

mravanelli force-pushed the develop branch from 25d399a to 421fb46 Compare May 31, 2022 16:58

anautsch changed the base branch from develop to develop-v2 June 1, 2022 15:47

Adel-Moumen self-assigned this Dec 2, 2022

Adel-Moumen requested changes Jan 23, 2023

View reviewed changes

GaelleLaperriere added 19 commits January 24, 2023 14:17

add a lot of doc

87b837c

Update README.md

2464f13

Update README.md

282b309

Update README.md

f8ca192

Update README.md

0df13b0

Update README.md

f968246

Update README.md

a185371

Update README.md

1f576ae

Update README.md

1a4147d

Update README.md

71f754f

Update README.md

0c24a5e

Update README.md

c32aebf

Update README.md

1c8638e

Update README.md

b273ab7

Update README.md

f9d2cb0

Update README.md

65e4f25

Update README.md

29c98b8

Update README.md

12ad914

Update README.md

add82f9

GaelleLaperriere added 12 commits February 22, 2023 16:10

time per epoch

1faf158

time per epoch

d0ef0e9

Update MEDIA.csv

9fba0a3

Update MEDIA.csv

b01f34c

label encoder

fff1d4d

label encoder

5255317

Update train_hf_wav2vec.yaml

ce8e79d

tokenizer

f51d588

tokenizer

435ef8c

Update README.md

87dfc45

Update README.md

df1387c

Update MEDIA.csv

a7faf5f

GaelleLaperriere added 3 commits March 2, 2023 10:07

start & stop

264fe2a

start stop

2a7926f

start stop

8436d7c

GaelleLaperriere added 3 commits March 2, 2023 12:23

placeholders

5a4aacf

Create extra_requirements.txt

5398609

Create extra_requirements.txt

b88b54b

media recipe testing

23634c2

anautsch approved these changes Mar 2, 2023

View reviewed changes

anautsch merged commit 205e523 into speechbrain:develop Mar 2, 2023

SLU Media recipe #1172

SLU Media recipe #1172

Uh oh!

Conversation

GaelleLaperriere commented Nov 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TParcollet commented Nov 29, 2021

Uh oh!

TParcollet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mravanelli commented Dec 15, 2021

Uh oh!

TParcollet commented Dec 15, 2021

Uh oh!

Adel-Moumen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anautsch commented Mar 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anautsch commented Mar 1, 2023

Uh oh!

GaelleLaperriere commented Mar 2, 2023

Uh oh!

anautsch commented Mar 2, 2023

Uh oh!

GaelleLaperriere commented Mar 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anautsch commented Mar 2, 2023

Uh oh!

GaelleLaperriere commented Mar 2, 2023

Uh oh!

anautsch left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

GaelleLaperriere commented Nov 29, 2021 •

edited

Loading

Adel-Moumen left a comment •

edited

Loading

anautsch commented Mar 1, 2023 •

edited

Loading

GaelleLaperriere commented Mar 2, 2023 •

edited

Loading