Avoid loading checkpoint parameters on the target device #1743

lucadellalib · 2022-12-04T19:56:41Z

When 'map_location=device', the checkpoint parameters are loaded on the CPU first (see docs of torch.load) and then moved to the target device. This means that, before copying the checkpoint parameters into the model, we have 2 full independent copies of the model parameters on the target device (first copy are the parameters in the model, second copy are the parameters being recovered from the checkpoint). If the model is on the CPU, we cannot avoid the waste of memory. However, if the model is on a device different from CPU (e.g. "cuda"), we can avoid moving the loaded checkpoint parameters on the device and hence avoid wasting the device memory and potentially having an out-of-memory error when dealing with huge models. Since 'obj.load_state_dict' copies the loaded parameters into the model/optimizer/scheduler in-place one by one, it automatically takes care of moving them to the model's device, even if they are on the CPU.

mravanelli · 2022-12-04T21:49:05Z

@Gastron could you please take a look?

Gastron · 2022-12-13T13:21:06Z

This is a good idea, I think we should go through with this change. Like noted, for example torch.load documentation suggests this.

However, the device argument of the loading code becomes unnecessary with this approach. It should be removed - unfortunately it gets used (again, unnecessarily) in many places, e.g.

speechbrain/speechbrain/core.py

Line 827 in 46be2d1

self.checkpointer.recover_if_possible(

@lucadellalib would you be willing to go over the codebase and remove the unnecessary device argument? It should be removed from the torch_recovery, torch_parameter_transfer, as well as mark_as_loader and mark_as_transfer. And then it gets used in calls to recover_if_possible and possibly load_checkpoint? Note that this will take some time.

lucadellalib · 2022-12-30T01:12:12Z

@Gastron I think it would be better to leave it as it is to minimize the number of changes (easier to debug in case of problems, lower risk of breaking other components, etc.) and especially for backward compatibility. Some users might be using those functions and suddenly find their code not working anymore because they are passing an unexpected argument - device.

mravanelli · 2022-12-30T01:35:15Z

What about adding the change in the unstable branch? This way it will be available for the next major version (where some interface changes are planned).

…

On Thu, Dec 29, 2022, 8:12 PM Luca Della Libera ***@***.***> wrote: @Gastron <https://github.com/Gastron> I think it would be better to leave it as it is to minimize the number of changes (easier to debug in case of problems, lower risk of breaking other components, etc.) and especially for backward compatibility. Some users might be using those functions and suddenly find their code not working anymore because they are passing an unexpected argument - device. — Reply to this email directly, view it on GitHub <#1743 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEA2ZVRSZUO2DDN5K34GRGLWPYZHPANCNFSM6AAAAAASTRIUIQ> . You are receiving this because you commented.Message ID: ***@***.***>

Gastron · 2023-01-05T11:37:46Z

I think it would be better to take the whole step of removing the device argument. However, I agree with Mirco that this is then a breaking change, meaning we should only add it during the next major version.

Of course we understand if @lucadellalib doesn't want to take the time to remove the argument from so many places, or how do you feel Luca? Perhaps someone else should contribute that part?

mravanelli · 2023-01-05T15:02:41Z

Hi Luca, feel free to move on with it and ask a PR on the "unstable" branch (the one for the new version of speechbrain). Note that we will soon merge the recipe testing PR that allows us to test every single recipe. Best, Mirco

…

On Thu, Jan 5, 2023 at 6:37 AM Aku Rouhe ***@***.***> wrote: I think it would be better to take the whole step of removing the device argument. However, I agree with Mirco that this is then a breaking change, meaning we should only add it during the next major version. Of course we understand if @lucadellalib <https://github.com/lucadellalib> doesn't want to take the time to remove the argument from so many places, or how do you feel Luca? Perhaps someone else should contribute that part? — Reply to this email directly, view it on GitHub <#1743 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEA2ZVXZNRHD63IFMGRFZJDWQ2XBJANCNFSM6AAAAAASTRIUIQ> . You are receiving this because you commented.Message ID: ***@***.***>

lucadellalib · 2023-01-27T06:13:34Z

@Gastron @mravanelli I removed the unnecessary device argument from all the places as discussed, please take a look.

Gastron · 2023-02-01T13:58:15Z

Great work! I browsed all the changed, looks good to me. This touches a lot of recipes, is it ok to merge this now, before the recipe testing PR @anautsch ?

anautsch · 2023-02-01T15:00:47Z

lgtm!

@Gastron we'll see what breaks & fix it - but this PR's changes seem complementary. Thank you, @lucadellalib !

mravanelli requested a review from Gastron December 4, 2022 21:48

poonehmousavi mentioned this pull request Jan 20, 2023

Whisper finetunng common voice #1809

Merged

lucadellalib changed the base branch from develop to unstable-v0.6 January 27, 2023 01:00

lucadellalib added 5 commits January 26, 2023 23:26

Remove unnecessary device argument

b56c0f5

Reformat with black

e3a2940

Remove device argument from core

c4035f4

Fix input normalization device and checkpointing tests

1bbaa54

Fix checkpointing tests for improper interfaces

1b8c9cb

Gastron merged commit 02bead2 into speechbrain:unstable-v0.6 Feb 2, 2023

anautsch mentioned this pull request Feb 17, 2023

Fix InputNormalization device #1841

Merged

Gastron mentioned this pull request May 26, 2023

[Bug]: GPU OOM when start from checkpoints #1963

Closed

Adel-Moumen mentioned this pull request Aug 2, 2023

Refactoring of SB to HuggingFace interface #1596

Closed

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid loading checkpoint parameters on the target device #1743

Avoid loading checkpoint parameters on the target device #1743

Uh oh!

lucadellalib commented Dec 4, 2022

Uh oh!

mravanelli commented Dec 4, 2022

Uh oh!

Gastron commented Dec 13, 2022

Uh oh!

lucadellalib commented Dec 30, 2022

Uh oh!

mravanelli commented Dec 30, 2022 via email

Uh oh!

Gastron commented Jan 5, 2023

Uh oh!

mravanelli commented Jan 5, 2023 via email

Uh oh!

lucadellalib commented Jan 27, 2023 •

edited

Loading

Uh oh!

Gastron commented Feb 1, 2023

Uh oh!

anautsch commented Feb 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Avoid loading checkpoint parameters on the target device #1743

Avoid loading checkpoint parameters on the target device #1743

Uh oh!

Conversation

lucadellalib commented Dec 4, 2022

Uh oh!

mravanelli commented Dec 4, 2022

Uh oh!

Gastron commented Dec 13, 2022

Uh oh!

lucadellalib commented Dec 30, 2022

Uh oh!

mravanelli commented Dec 30, 2022 via email

Uh oh!

Gastron commented Jan 5, 2023

Uh oh!

mravanelli commented Jan 5, 2023 via email

Uh oh!

lucadellalib commented Jan 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gastron commented Feb 1, 2023

Uh oh!

anautsch commented Feb 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lucadellalib commented Jan 27, 2023 •

edited

Loading