Add a utf8() mode that allows byte/UTF-8 strings as input & output.#78
Open
FGasper wants to merge 1 commit intomakamaka:masterfrom
Open
Add a utf8() mode that allows byte/UTF-8 strings as input & output.#78FGasper wants to merge 1 commit intomakamaka:masterfrom
FGasper wants to merge 1 commit intomakamaka:masterfrom
Conversation
JSON::PP has a number of options that indicate a desire to facilitate
different applications’ nonstandard needs. For example, latin1() caters
to applications that use Latin-1 encoding rather than UTF-8, which
violates the JSON specification.
Some nontrivial Perl applications forgo character decoding. Their
authors/maintainers may not know “perlunitut”’s recommended workflow,
or the application may simply not care about Unicode. Either way, in
such applications it’s ideal for a JSON encoder & decoder to forgo
the usual UTF-8 decode/encode steps.
utf8(0) almost achieves this. It falls over, though, if the JSON
document contains a Unicode character escape (e.g., "\u00e9"), which
JSON::PP decodes as Perl "\xe9". This causes an inconsistency in the
decode logic: "é" in UTF-8 will yield a different result from "\u00e9".
Ordinarily it works to do encode_utf8( JSON::PP->new->utf8->decode(..) ),
but that falls over if applications need to allow non-UTF-8 sequences
in JSON inputs.
In short, a need exists for this Perl string:
qq<"\xff\xc3\xa9\xc3\xa9\\u00e9">
… to decode to "\xff\xc3\xa9\xc3\xa9".
This changeset adds a solution to this problem by changing utf8() from
a simple flag to an enum: the existing chars-in-chars-out (0) and
bytes-in-chars-out (1) options, plus a new bytes-in-bytes-out option.
Named constants are added to avoid “magic numbers”.
Collaborator
|
I understand your point but if the change breaks compatibility with JSON::XS, it's unacceptable because JSON::PP is basically a fallback module of it. I am also reluctant to add a new mode if it's for JSON::PP only. Could you discuss this with the JSON::XS maintainer first? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
MAINTAINER: See what you think of this. I’ll add documentation updates if you’re amenable to the change itself.
JSON::PP has a number of options that indicate a desire to facilitate different applications’ nonstandard needs. For example, latin1() caters to applications that use Latin-1 encoding rather than UTF-8, which violates the JSON specification.
Some nontrivial Perl applications forgo character decoding. Their authors/maintainers may not know “perlunitut”’s recommended workflow, or the application may simply not care about Unicode. Either way, in such applications it’s ideal for a JSON encoder & decoder to forgo the usual UTF-8 decode/encode steps.
utf8(0) almost achieves this. It falls over, though, if the JSON document contains a Unicode character escape (e.g., "\u00e9"), which JSON::PP decodes as Perl "\xe9". This causes an inconsistency in the decode logic: "é" in UTF-8 will yield a different result from "\u00e9".
Ordinarily it works to do encode_utf8( JSON::PP->new->utf8->decode(..) ), but that falls over if applications need to allow non-UTF-8 sequences in JSON inputs.
In short, a need exists for this Perl string:
… to decode to "\xff\xc3\xa9\xc3\xa9".
This changeset adds a solution to this problem by changing utf8() from a simple flag to an enum: the existing chars-in-chars-out (0) and bytes-in-chars-out (1) options, plus a new bytes-in-bytes-out option. Named constants are added to avoid “magic numbers”.