Skip to content

Conversation

@drizt
Copy link
Contributor

@drizt drizt commented Jan 5, 2026

RFC 8259 doesn't force strings to be valid unicode stings. In real it allows to contain any \uxxxx values. It's possible to keep any binary data in JSON strings. This commit removes limitation for strings to be valid UTF-8 strings.

WTF-8 (Wobbly Transformation Format − 8-bit) is asuperset of UTF-8 that encodes surrogate code points if they are not in a pair. It represents, in a way compatible with UTF-8, text from systems such as JavaScript and Windows that use UTF-16 internally but don’t enforce the well-formedness invariant that surrogates must be paired.

WTF-8 strings are not compatible with current tests. Tests use some python code which works only with valid UTF-8 strings. Need to upgrade tests system or replace it with something another that has full JSON support.

@drizt
Copy link
Contributor Author

drizt commented Jan 5, 2026

This commit allows such strings to be parsed.

{
  "valid surrogate pair (😀 U+1F600)": "\uD83D\uDE00",
  "lone high surrogate": "\uD800",
  "lone low surrogate": "\uDC00",
  "high surrogate not followed by low surrogate": "\uD834\u0061",
  "low surrogate not preceded by high surrogate": "\u0061\uDD1E",
  "reversed surrogate order (low then high)": "\uDC00\uD800",
  "two high surrogates in a row": "\uD800\uD801",
  "two low surrogates in a row": "\uDC00\uDC01",
  "surrogate pair split by space": "\uD83D\u0020\uDE00",
  "surrogate halves separated by text": "\uD83Dtest\uDE00",
  "high surrogate followed by another escape": "\uD83D\u000A",
  "high surrogate at end of string": "ABC\uD800"
}

@drizt
Copy link
Contributor Author

drizt commented Jan 5, 2026

Also my commit fix #58.

@drizt drizt force-pushed the wtf8 branch 2 times, most recently from 02431a9 to 2165f27 Compare January 5, 2026 15:12
RFC 8259 doesn't force strings to be valid unicode stings. In real it
allows to contain any \uxxxx values. It's possible to keep any binary
data in JSON strings. This commit removes limitation for strings to be
valid UTF-8 strings.

WTF-8 (Wobbly Transformation Format − 8-bit) is asuperset of UTF-8
that encodes surrogate code points if they are not in a pair. It
represents, in a way compatible with UTF-8, text from systems such as
JavaScript and Windows that use UTF-16 internally but don’t enforce the
well-formedness invariant that surrogates must be paired.

WTF-8 strings are not compatible with current tests. Tests use some
python code which works only with valid UTF-8 strings. Need to upgrade
tests system or replace it with something another that has full JSON
support.
@LB--
Copy link
Member

LB-- commented Jan 5, 2026

Did you use any form of generative AI while authoring these changes or PRs?

@drizt
Copy link
Contributor Author

drizt commented Jan 5, 2026

Code wrote with helping of ChatGPT. Edited and tested (in my own project) manually. I learned JSON RFC and WTF-8 doc before apply this changes in my own code. Test wrote with ChatGPT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants