-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Make calls to "open" deterministic (code from Samsung AI Center Cambridge) #2734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I didn't realize Linux could even be configured to anything else than UTF-8, but this is likely more useful for Windows, which seems to default to cp1252 from what I could find.
Shouldn't be. I'm doubtful this change can break anything (save for typos, and unless it was already broken). I think the linter we use is able to detect the |
|
Although the CRLF thing wrt. CSV file reading is interesting... Didn't know about that gotcha.
|
That would be best. I'm happy to try and switch this on, but I don't see it. Where can I do this? |
|
I remember encountering it while running linting tools locally, but upon closer inspection, I can only find it being implemented as part of a flake8 plugin, If I'm reading this right this might just be a matter of adding that package to the TBH it wouldn't be a blocker for this PR but it would be better to have down the line. |
|
+ potentially in |
|
Let me try this, first in a separate PR. It would be best to solve this once and for all, if I can. |
|
From local experimentation, it looks like installing |
|
Is linting supposed to be run in CI? Because I don’t see where, and my commit attempting to switch on the detection of encodings hasn’t triggered a CI failure in the way I hoped it would. |
|
Seems like the pre-commit workflow ( (Noting that Seems like an oversight, will address this in a separate PR quickly. |
|
Actually it might just be simpler to include the dependency in Then, it would probably make sense to remove it from |
|
Seems like this works as expected now. |
|
Great. The linting (e34eab0) is now causing the CI to fail as expected. |
They now use UTF-8 encoding. This is the same as the old behaviour if (on Linux) environment variable LC_ALL=C.UTF-8 was set. However, often this environment variable was not set and behaviour changed from environment to environment. For CSV files, this also uses newline="".
These had been forgotten, but the added flake8 test found them.
b7f6fd9 to
cfdc9aa
Compare
|
@asumagic I also don't see how this could break anything serious from the files touched. Feel free to merge when you want. As this is touching the CI, maybe @pplantinga wants to have a very quick look at the new requirements. |
pplantinga
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good cleanup of the repo, making it more usable on Windows! LGTM
What does this PR do?
This PR makes calls to
openuse UTF-8 encoding.The current behaviour is that the encoding depends on the local environment. Under Linux, if environment variable
LC_ALL=C.UTF-8is set, files are read as UTF-8. However, we have found that in practice, behaviour changes from user to user or machine to machine, and transcriptions with non-ASCII characters get messed up. This PR makes the behaviour deterministic.When opening files for
csv, this also addnewline="", as required by https://docs.python.org/3/library/csv.html#csv.reader.Change after review: the PR now also adds a check for the encoding to the pre-commit checks.
Breaking changes
For users/machines that have
LC_ALL=C.UTF-8set, nothing changes. For users/machines that haveLC_ALLset differently, the behaviour changes to the correct one.Remarks
openbut it is possible I have missed some. This highlights that it might have been better to have all parsing, including calls toopen, in one place instead of scattered around.cc @TParcollet for visibility.