Skip to content

Conversation

@rogiervd
Copy link
Contributor

@rogiervd rogiervd commented Oct 28, 2024

What does this PR do?

This PR makes calls to open use UTF-8 encoding.

The current behaviour is that the encoding depends on the local environment. Under Linux, if environment variable LC_ALL=C.UTF-8 is set, files are read as UTF-8. However, we have found that in practice, behaviour changes from user to user or machine to machine, and transcriptions with non-ASCII characters get messed up. This PR makes the behaviour deterministic.

When opening files for csv, this also add newline="", as required by https://docs.python.org/3/library/csv.html#csv.reader.

Change after review: the PR now also adds a check for the encoding to the pre-commit checks.

Breaking changes

For users/machines that have LC_ALL=C.UTF-8 set, nothing changes. For users/machines that have LC_ALL set differently, the behaviour changes to the correct one.

Remarks

  • I believe I have changed all relevant calls to open but it is possible I have missed some. This highlights that it might have been better to have all parsing, including calls to open, in one place instead of scattered around.
  • This is a bit of a big PR that touches many files. I'd be happy to split it up into multiple changes per subdirectory or something if that's more useful.
  • I have not been able to run all recipes.

cc @TParcollet for visibility.

@asumagic
Copy link
Collaborator

asumagic commented Oct 30, 2024

I didn't realize Linux could even be configured to anything else than UTF-8, but this is likely more useful for Windows, which seems to default to cp1252 from what I could find.

This is a bit of a big PR that touches many files. I'd be happy to split it up into multiple changes per subdirectory or something if that's more useful.

Shouldn't be. I'm doubtful this change can break anything (save for typos, and unless it was already broken).

I think the linter we use is able to detect the open with missing encoding, maybe we should look into making it into a CI error down the line.

@asumagic
Copy link
Collaborator

Although the CRLF thing wrt. CSV file reading is interesting... Didn't know about that gotcha.

If newline='' is not specified, newlines embedded inside quoted fields will not be interpreted correctly, and on platforms that use \r\n linendings on write an extra \r will be added. It should always be safe to specify newline='', since the csv module does its own (universal) newline handling.

@rogiervd
Copy link
Contributor Author

I think the linter we use is able to detect the open with missing encoding, maybe we should look into making it into a CI error down the line.

That would be best. I'm happy to try and switch this on, but I don't see it. Where can I do this?

@asumagic
Copy link
Collaborator

I remember encountering it while running linting tools locally, but upon closer inspection, I can only find it being implemented as part of a flake8 plugin, flake8-encodings. I didn't install it manually and it doesn't seem to be used in SpeechBrain, so I don't know what would have pulled it.

If I'm reading this right this might just be a matter of adding that package to the lint-requirements.txt?

TBH it wouldn't be a blocker for this PR but it would be better to have down the line.

@asumagic
Copy link
Collaborator

asumagic commented Oct 30, 2024

+ potentially in .flake8 adding ENC to select =

@rogiervd
Copy link
Contributor Author

Let me try this, first in a separate PR. It would be best to solve this once and for all, if I can.

@rogiervd
Copy link
Contributor Author

From local experimentation, it looks like installing flake8-encodings should be all that's necessary. c3d8b3b should switch this on, and find 64 (!) instances of open that I'd missed.

@rogiervd
Copy link
Contributor Author

rogiervd commented Oct 30, 2024

Is linting supposed to be run in CI? Because I don’t see where, and my commit attempting to switch on the detection of encodings hasn’t triggered a CI failure in the way I hoped it would.

@asumagic
Copy link
Collaborator

Seems like the pre-commit workflow (pre-commit.yaml) does not actually honor the lint-requirements unlike the test workflow (pythonapp.yml).

(Noting that requirements.txt does include lint-requirements.txt.)

Seems like an oversight, will address this in a separate PR quickly.

@asumagic
Copy link
Collaborator

Actually it might just be simpler to include the dependency in .pre-commit-config.yaml under flake8. pre-commit kind of does its own dependency management so it might be better to avoid pulling all SB dependencies there.

Then, it would probably make sense to remove it from lint-requirements.txt.

@asumagic
Copy link
Collaborator

Seems like this works as expected now.

@rogiervd
Copy link
Contributor Author

Great. The linting (e34eab0) is now causing the CI to fail as expected.

@asumagic asumagic added this to the v1.1.0 milestone Oct 30, 2024
Rogier van Dalen added 5 commits October 30, 2024 16:08
They now use UTF-8 encoding.
This is the same as the old behaviour if (on Linux) environment variable LC_ALL=C.UTF-8 was set.
However, often this environment variable was not set and behaviour changed from environment to environment.

For CSV files, this also uses newline="".
These had been forgotten, but the added flake8 test found them.
@TParcollet
Copy link
Collaborator

TParcollet commented Nov 4, 2024

@asumagic I also don't see how this could break anything serious from the files touched. Feel free to merge when you want. As this is touching the CI, maybe @pplantinga wants to have a very quick look at the new requirements.

@TParcollet TParcollet added the correctness Functionality not objectively broken, but may be surprising or wrong e.g. regarding literature label Nov 4, 2024
Copy link
Collaborator

@pplantinga pplantinga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good cleanup of the repo, making it more usable on Windows! LGTM

@pplantinga pplantinga merged commit c4a4243 into speechbrain:develop Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

correctness Functionality not objectively broken, but may be surprising or wrong e.g. regarding literature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants