gh-74668: Fix encoded unicode in url byte string #93757
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When
urllib.parse.parse_qsis called with a byte string, an encoding can be provided which is used to decode the byte string. However, when parsing,parse.pyuses 'ascii' encoding to re-encode the parsed data. This breaks utf-8 encoded URLs received as byte strings.This change uses the encoding passed to
parse_qsto re-encode the parsed data.Caveat
This is probably not the correct solution, but gets us closer to a working implementation that, at worst case, can be dictated by the caller of
parse_qs.My understanding of the problem is as follows.
parse_qsdetects theqsis a byte string and decodes it according to theencodingparameter.After parsing the decoded input, it then re-encodes it (because it detected the input was a byte string), but instead of using the value of the
encodingparameter, it uses 'ascii'. The decoding and encoding thus uses different encoders. This PR fixes that, in that it uses theencodingparameter value to re-encode the parsed data.However, this is not a complete solution.
The problem is that there are in essence 2 encodings involved.
One encoder used to encode/decode the byte string and another to encode/decode the URL (which is utf-8 encoded).
Take
b"a=a%E2%80%99b"as an example.This is a valid
asciiencoded byte string and can be decoded with anasciidecoder.However, after decoding and parsing, it will not produce valid ascii. I.e. parsing of
a%E2%80%99bwill producea’bwhich can not beasciiencoded.A possible solution would be to pass a
reencodingparameter, but since it's unlikely callers will have differentencodingandreencodingparameters, this PR opts for reusing theencodingparameter.