Skip to content

Conversation

@GaelleLaperriere
Copy link
Collaborator

@GaelleLaperriere GaelleLaperriere commented Nov 29, 2021

Hello there,

I added in this new recipe:

  • the processing of the Media SLU dataset
  • the necessary csv and txt files do to so
  • python/yaml scripts with a wav2vec encoder
  • python/yaml scripts without the wav2vec encoder
  • the Concept Error Rate in dataio.py / metric_stats.py
  • the Concept Value Error Rate in dataio.py / metric_stats.py

I still need to make the README.md. An ASR recipe is coming.
The dataset Media is free for academic purposes, but must be requested from ELRA in order to retrieve it.

Thanks in advance.


EDIT

Tasks TODO:

  • make pre-commit
  • write author everywhere
  • one parser only
  • load in memory for the csv / txt files (what can be)
  • keep_concepts_value -> extract_concepts_value
  • one uniq function for keep_concepts and keep_concepts_values
  • make a general tag for concepts
  • process every error
  • test code
  • write contributors
  • redo CER / CVER

@TParcollet
Copy link
Collaborator

Thanks @GaelleLaperriere !!! We will discuss that directly at the lab, but I will certainly ask for a few changes ;-)

@TParcollet TParcollet self-assigned this Nov 29, 2021
@TParcollet TParcollet self-requested a review November 29, 2021 10:02
Copy link
Collaborator

@TParcollet TParcollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @GaelleLaperriere thank you so much for your work. Comments are a bit short, but we have so many things to do on SB ... Once the changes are done, I will test the recipes!

@mravanelli mravanelli added the enhancement New feature or request label Dec 15, 2021
@mravanelli
Copy link
Collaborator

Hi @GaelleLaperriere and @TParcollet, at this stage we are with this PR? I'm wondering just because I would like to figure out if we want to include it in the upcoming new version of speechbrain.

@TParcollet
Copy link
Collaborator

Hey @mravanelli unsure that this will be possible.

@anautsch anautsch changed the base branch from develop to develop-v2 June 1, 2022 15:47
@Adel-Moumen Adel-Moumen self-assigned this Dec 2, 2022
Copy link
Collaborator

@Adel-Moumen Adel-Moumen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello,

Many thanks for the PR. You are doing a really great job!

This review will only be on the data prepare file. When the other files will be updated, I will then review everything.

First, could you please remove the .csv files ? I have uploaded on our Google Drive: https://drive.google.com/drive/u/1/folders/1z2zFZp3c0NYLFaUhhghhBakGcFdXVRyf the files so that you can put the links where to download the external data in the README.md as well as in the docstring of the prepare functions when it's needed

Please fix the pre-commit/testing as well.

@anautsch
Copy link
Collaborator

anautsch commented Mar 1, 2023

Hi @GaelleLaperriere after running the recipe testing for MEDIA

python -c 'from tests.utils.recipe_tests import run_recipe_tests; print("TEST FAILED!") if not(run_recipe_tests(filters_fields=["Dataset"], filters=[["MEDIA"]], do_checks=False, run_opts="--device=cuda")) else print("TEST PASSED")'

all tests failed with

ValueError: 'channels_path' is a !PLACEHOLDER and must be replaced.

When I put empty strings (it seems only relevant to the dataset preparation script, which this test skips), it gets stuck here:

    test_data = sb.dataio.dataset.DynamicItemDataset.from_csv(
        csv_path=hparams["csv_test"], replacements={"data_root": csv_folder}
    )

Please add an extra_requirements.txt file (to know which additional packages to install, like transformers).

@anautsch
Copy link
Collaborator

anautsch commented Mar 1, 2023

Got curious, after using this for recipe testing config

Task,Dataset,Script_file,Hparam_file,Data_prep_file,Readme_file,Result_url,HF_repo,test_debug_flags,test_debug_checks
SLU,MEDIA,recipes/MEDIA/SLU/CTC/train_hf_wav2vec.py,recipes/MEDIA/SLU/CTC/hparams/train_hf_wav2vec_full.yaml,recipes/MEDIA/media_prepare.py,recipes/MEDIA/SLU/CTC/README.md,https://drive.google.com/drive/folders/1LHKmtQ8Roz85GfwkYXHRv_zz-Z-2Qurf,,--data_folder=tests/samples/ASR/ --csv_train=tests/samples/annotation/ASR_train.csv --csv_valid=tests/samples/annotation/ASR_train.csv --csv_test=tests/samples/annotation/ASR_train.csv --number_of_epochs=2 --skip_prep=True --channels_path="" --concepts_path="",
SLU,MEDIA,recipes/MEDIA/SLU/CTC/train_hf_wav2vec.py,recipes/MEDIA/SLU/CTC/hparams/train_hf_wav2vec_relax.yaml,recipes/MEDIA/media_prepare.py,recipes/MEDIA/SLU/CTC/README.md,https://drive.google.com/drive/folders/1ALtwmk3VUUM0XRToecQp1DKAh9FsGqMA,,--data_folder=tests/samples/ASR/ --csv_train=tests/samples/annotation/ASR_train.csv --csv_valid=tests/samples/annotation/ASR_train.csv --csv_test=tests/samples/annotation/ASR_train.csv --number_of_epochs=2 --skip_prep=True --channels_path="" --concepts_path="",
ASR,MEDIA,recipes/MEDIA/ASR/CTC/train_hf_wav2vec.py,recipes/MEDIA/ASR/CTC/hparams/train_hf_wav2vec.yaml,recipes/MEDIA/media_prepare.py,recipes/MEDIA/ASR/CTC/README.md,https://drive.google.com/drive/folders/1qJUKxsTKrYwzKz0LHzq67M4G06Mj-9fl,,--data_folder=tests/samples/ASR/ --csv_train=tests/samples/annotation/ASR_train.csv --csv_valid=tests/samples/annotation/ASR_train.csv --csv_test=tests/samples/annotation/ASR_train.csv --number_of_epochs=2 --skip_prep=True --channels_path="" --concepts_path="",

I got these logs for the three recipes tests:

RuntimeError: These keys are still unaccounted for in the data pipeline: start_seg, end_seg

FileNotFoundError: [Errno 2] No such file or directory: 'tests/tmp/MEDIA_row_2/save/labelencoder.txt'

The last one is a consequence of the first, or?

    lab_enc_file = hparams["save_folder"] + "/labelencoder.txt"
    label_encoder.load_or_create(
        path=lab_enc_file,

which should create the labelencoder.txt file.

The earlier

    # 2. Define audio pipeline:
    @sb.utils.data_pipeline.takes("wav", "start_seg", "end_seg")
    @sb.utils.data_pipeline.provides("sig")
    def audio_pipeline(wav, start_seg, end_seg):

Needs to have start_seg & end_seg in the CSV files it is reading from. For recipe testing, we use dummy files. Your database preparation will have done everything correctly, I suppose you and Adel went through it already.

In the config above, the train CSV dummy is:
https://github.com/speechbrain/speechbrain/blob/develop/tests/samples/annotation/ASR_train.csv

and it does not have these two fields.

Please take a look at other sample annotations, which you could use to satisfy you dataio pipeline, we collect them here:
https://github.com/speechbrain/speechbrain/tree/develop/tests/samples/annotation

If none is there, there are several options:

  • extend one which then does what you need (just add the two comlumns with almost-ANY data).
  • create a new one which is fairly general (so others can use it too)

@GaelleLaperriere
Copy link
Collaborator Author

Hello,

Tank you for your help testing the recipe.

I solved the first issue. I just changed the tags "start_seg" and "end_seg" for "start" and "stop" as done in the testing samples.
For the second issue, with the label_encoder, I guess it solved itself now that the pipeline is fixed.

Please keep me updated when you run the tests again. Many thanks.

@anautsch
Copy link
Collaborator

anautsch commented Mar 2, 2023

Hello @GaelleLaperriere,

Tank you for your help testing the recipe.

Welcome; no worries.

I solved the first issue. I just changed the tags "start_seg" and "end_seg" for "start" and "stop" as done in the testing samples.

Ok. Works. Adding to the test samples here would've been no big deal. All tests are passing now, your solution works !

For the second issue, with the label_encoder, I guess it solved itself now that the pipeline is fixed.

Yes.

Please keep me updated when you run the tests again. Many thanks.

After running tests with the above mentioned tests/recipes/MEDIA.csv edits, I checked it out as its latest version on git.

python -c 'from tests.utils.recipe_tests import run_recipe_tests; print("TEST FAILED!") if not(run_recipe_tests(filters_fields=["Dataset"], filters=[["MEDIA"]], do_checks=False, run_opts="--device=cuda")) else print("TEST PASSED")'

fails with:

ValueError: 'channels_path' is a !PLACEHOLDER and must be replaced.

Please:

  1. add extra_requirements.txt files (to know which additional packages to install, like transformers).
  2. adjust tests/recipes/MEDIA.csv so it works

lmk if the above python -c command works for you when executed from your local repo root folder.

@GaelleLaperriere
Copy link
Collaborator Author

GaelleLaperriere commented Mar 2, 2023

Placeholders are now overridden in MEDIA.csv with "Null". I don't know if it will work this way.
I added the extra_requirements.txt files.

I don't know if Adel told you about the error I get when running python -c for any recipe :

KeyError: "Override 'debug_persistently' not found in: ['seed', '__set_seed', 'output_folder', 'cer_file_test', 'ctc_file_test', 'save_folder', 'train_log', 'data_folder', 'channels_path', 'concepts_path', 'skip_wav', 'method', 'task', 'skip_prep', 'process_test2', 'wav2vec_url', 'csv_train', 'csv_valid', 'csv_test', 'batch_size', 'test_batch_size', 'avoid_if_longer_than', 'avoid_if_smaller_than', 'num_workers', 'dataloader_options', 'test_dataloader_options', 'sample_rate', 'feats_dim', 'number_of_epochs', 'lr', 'lr_wav2vec', 'annealing_factor', 'annealing_factor_wav2vec', 'improvement_threshold', 'improvement_threshold_wav2vec', 'patient', 'patient_wav2vec', 'sorting', 'activation', 'dnn_blocks', 'dnn_neurons', 'freeze', 'blank_index', 'output_neurons', 'epoch_counter', 'wav2vec2', 'enc', 'output_lin', 'log_softmax', 'ctc_cost', 'modules', 'model', 'model_wav2vec2', 'opt_class', 'opt_class_wav2vec', 'lr_annealing', 'lr_annealing_wav2vec', 'label_encoder', 'checkpointer', 'train_logger', 'ctc_computer', 'cer_computer']"

@anautsch
Copy link
Collaborator

anautsch commented Mar 2, 2023

Placeholders are now overridden in MEDIA.csv with "Null". I don't know if it will work this way. I added the extra_requirements.txt files.

Thank you @GaelleLaperriere !
Looks good :)

I don't know if Adel told you about the error I get when running python -c for any recipe :

KeyError: "Override 'debug_persistently' not found in: ...

It's good you write about it here (others might help it). You merged the latest develop on Feb 17; this one has all testing tools required. The flag is used here:

# add --debug if no do_checks to save testing time
if not do_checks:
cmd += " --debug --debug_persistently"

Same in your repo/PR.

The log that it isn't found, is weird, since it is a run_opts:

# Arguments passed via the run opts dictionary
run_opt_defaults = {
"debug": False,
"debug_batches": 2,
"debug_epochs": 2,
"debug_persistently": False,

What might have happened: your python environment is not operating from your local SpeechBrain repo, which has the change. As for now, these testing tools are not available via the pip version of SpeechBrain. We will release a new version for PyPI soon after merging your PR.


On my end, all fail now with

Traceback (most recent call last):
  File "recipes/MEDIA/ASR/CTC/train_hf_wav2vec.py", line 337, in <module>
    train_data, valid_data, test_data, label_encoder = dataio_prepare(hparams)
  File "recipes/MEDIA/ASR/CTC/train_hf_wav2vec.py", line 230, in dataio_prepare
    test_data = sb.dataio.dataset.DynamicItemDataset.from_csv(
  File "speechbrain/dataio/dataset.py", line 365, in from_csv
    data = load_data_csv(csv_path, replacements)
  File "speechbrain/dataio/dataio.py", line 129, in load_data_csv
    with open(csv_path, newline="") as csvfile:
TypeError: expected str, bytes or os.PathLike object, not list

You still defined --csv_test=[tests/samples/annotation/ASR_train.csv] which makes it a list of test partitions, while your train code expects a specific file path, so w/o the '[' & ']'. I removed them and pushed (please pull).

(1/3) Running test for MEDIA_row_2...
	... 62.91s
(2/3) Running test for MEDIA_row_3...
	... 69.42s
(3/3) Running test for MEDIA_row_4...
	... 71.78s
TEST PASSED

note: I reduced the epoch number from 10 to 2—it's just a speed-up; 10 we need to obtain somewhat stable debug resutls for comparing with (we do this only for a few core recipes, since - it takes time yet the objective: is does the code crash -> yes/no). It's an fyi; you already suffered enough through this PR...


please let me know if testing works on your end, too, now.

@GaelleLaperriere
Copy link
Collaborator Author

What might have happened: your python environment is not operating from your local SpeechBrain repo, which has the change. As for now, these testing tools are not available via the pip version of SpeechBrain. We will release a new version for PyPI soon after merging your PR.

Thank you, it was the case. Now the tests work fine. 👍

Copy link
Collaborator

@anautsch anautsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@anautsch anautsch dismissed stale reviews from Adel-Moumen and TParcollet March 2, 2023 12:41

Concerns have been addressed.

@anautsch anautsch merged commit 205e523 into speechbrain:develop Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants