Skip to content

Conversation

@z-wony
Copy link
Contributor

@z-wony z-wony commented Aug 28, 2022

In case of 'batch_size' is not 1, evaluate_batch() returns tensor array (in separation recipes).
It occurs exception in _fit_valid() (#988)
So, this commit fixes return values as API reference of evaluate_batch()

But, I expect there are a lot of same issues in other recipes.
So, I'd like to suggest changing update_average() in core.py.
If 'list of tensors' are allowed to input argument, solution is more simple.

@z-wony
Copy link
Contributor Author

z-wony commented Aug 28, 2022

My callstack is below. (with batch_size: 2)

speechbrain.core - Exception:
Traceback (most recent call last):
  File "train.py", line 629, in <module>
    separator.fit(
  File "/home/jwkim/github/speechbrain/speechbrain/core.py", line 1154, in fit
    self._fit_valid(valid_set=valid_set, epoch=epoch, enable=enable)
  File "/home/jwkim/github/speechbrain/speechbrain/core.py", line 1057, in _fit_valid
    avg_valid_loss = self.update_average(loss, avg_valid_loss)
  File "/home/jwkim/github/speechbrain/speechbrain/core.py", line 1303, in update_average
    if torch.isfinite(loss):
RuntimeError: Boolean value of Tensor with more than one value is ambiguous

In the core.py

1045     def _fit_valid(self, valid_set, epoch, enable):
1046         # Validation stage
1047         if valid_set is not None:
1048             self.on_stage_start(Stage.VALID, epoch)
1049             self.modules.eval()
1050             avg_valid_loss = 0.0
1051             with torch.no_grad():
1052                 for batch in tqdm(
1053                     valid_set, dynamic_ncols=True, disable=not enable
1054                 ):
1055                     self.step += 1
1056                     loss = self.evaluate_batch(batch, stage=Stage.VALID)
1057                     avg_valid_loss = self.update_average(loss, avg_valid_loss)

evaluate_batch() returns tensor([19.2439, 24.9512], device='cuda:0') to loss
So, if torch.isfinite(loss): in update_average() throws exception.

@z-wony
Copy link
Contributor Author

z-wony commented Sep 5, 2022

Dear @mravanelli . Could you review this?

@ycemsubakan
Copy link
Collaborator

@z-wony Sorry for my late reply. I will take a look soon. Thanks for the PR.

@ycemsubakan ycemsubakan self-requested a review September 14, 2022 14:11
@z-wony
Copy link
Contributor Author

z-wony commented Oct 3, 2022

ping? : )

@ycemsubakan
Copy link
Collaborator

Guys, I'll take a look this week. Very sorry, I am swamped with several things.

@ycemsubakan
Copy link
Collaborator

Alright, I just tried this branch, it seems this doesn't cause anything else to break. We can merge.

@z-wony Do you know when this started to break the eval loop? Because originally this wasn't causing any issues. Like do you know which commit started to cause this issue? (Just out of curiosity)

@ycemsubakan ycemsubakan merged commit 479fc62 into speechbrain:develop Oct 3, 2022
@z-wony z-wony deleted the fix-batch-eval branch October 3, 2022 14:48
@z-wony
Copy link
Contributor Author

z-wony commented Oct 3, 2022

Alright, I just tried this branch, it seems this doesn't cause anything else to break. We can merge.

@z-wony Do you know when this started to break the eval loop? Because originally this wasn't causing any issues. Like do you know which commit started to cause this issue? (Just out of curiosity)

@ycemsubakan Thank you for review and sorry I have no idea about your question.
But some recipes separate batch_size configuration for evaluation and set to 1 in their hparam. (distinct with train_batch_size)
Thus, issue not be raised.

@mravanelli
Copy link
Collaborator

mravanelli commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants