Skip to content

Conversation

@nalimilan
Copy link
Member

@nalimilan nalimilan commented Apr 6, 2018

Strings are no guaranteed to contain valid UTF-8, and PCRE documentation says that the behavior is undefined in that case. See Discourse post.

This is WIP for two reasons:

  • match(r"a", "\U11000") is supposed to fail according to the docs (look for PCRE2_ERROR_UTF8_ERR14), but it doesn't. Am I missing something? EDIT: turns out this was a typo in the PCRE docs, one zero was missing.
  • For an obscure reason, r"\xff" does not throw an error even though Regex("\xff") and @r_str("\xff") do. This means that the UTF-8 validity check does not work in the most common cases.

Given that regexes are often written as literals, it could make sense for performance to check them ourselves at compile time in r_str, and keep track of that in the object to disable the PCRE check. But that optimization isn't breaking, so it can be applied later.

@nalimilan nalimilan added the unicode Related to unicode characters and encodings label Apr 6, 2018
Strings are no guaranteed to contain valid UTF-8, and PCRE documentation says
that the behavior is undefined in that case.
@StefanKarpinski
Copy link
Member

Insisting that regular expressions be valid UTF-8 seems ok to me but I don't think we should be throwing errors when the string contains invalid UTF-8. That violates the principle that bad data should not cause errors, only a bad program should cause errors. For example, match(r"a", "\xe2\x88") should just be false: the string does not contain the pattern r"a" anywhere.

@nalimilan
Copy link
Member Author

In principle I agree, but PCRE doesn't seem to support it. It's clearly described as undefined behavior which could crash the program.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Apr 6, 2018

Yes, but in practice, does it actually do so with the PCRE.NO_UTF_CHECK flag set?

@nalimilan
Copy link
Member Author

Hard to tell. We'd need to run stress tests, as the most simple cases may not crash but more complex ones could. We could also inspect the code. However, even if it doesn't crash now, it could do so in the future, yet changing the behavior would be breaking. If we want to keep the check disabled, at the minimum we should ask PCRE developers about that.

@StefanKarpinski
Copy link
Member

We can always implement our own regex engine that uses LLVM to generate fast matchers. This would be far easier in Julia than in most systems. It's actually a project I've long wanted to try out 😸

@bicycle1885
Copy link
Member

bicycle1885 commented Apr 7, 2018

Setting NO_UTF_CHECK without validating the encoding of a subject would be too dangerous to have in Base, since the docs of PCRE2 clearly state:

When PCRE2_NO_UTF_CHECK is set, the effect of passing an invalid string as a subject, or an invalid value of startoffset, is undefined. Your program may crash or loop indefinitely.

In my quick benchmarks, the performance degradation by UTF-8 check is roughly 5%. I think this is a reasonable cost to pay for safety.

@nalimilan
Copy link
Member Author

@bicycle1885 I don't think the question is whether it's too costly: it's rather that we would like to allow working with invalid strings without ever throwing an error, and unfortunately PCRE doesn't seem to share that objective at all.

I have no idea how hard it would be to implement PCRE in Julia. If that's the long-term goal, in the short term we should check whether the current PCRE codebase is robust in the presence of invalid UTF-8 despite the claims about undefined behavior. Asking its developers sounds like the best way to do that (I can write a message).

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Apr 8, 2018

Let's put aside, for the moment what PCRE actually does and consider what the ideal behavior would be given our general approach to strings.

Any pattern of characters can be converted to a pattern of bytes, and if the pattern is valid UTF-8, then bytes that match a character are guaranteed to match a valid UTF-8 subsequence. Wildcard patterns like . are a bit subtle: should . match only valid characters or invalid ones as well? My inclination is that it should match anything. This actually makes lowering it to a byte pattern very simple since it becomes . as a byte pattern as well; otherwise you'd have to make it only match byte sequences that match a valid UTF-8 character, which can be done with zero-width negative lookaheads, but it's tricky. On the other hand, since UTF-8 is self-synchronizing, there's no actual problem with letting . match arbitrary bytes: . matching invalid data doesn't affect the matching of parts of the pattern that are limited to particular characters and must be valid. Moreover, I suspect that when people write .* they don't actually mean "match as many valid UTF-8 characters as you can", they actually mean "match whatever—I don't care what it is".

On the other hand, if your regex contains invalid UTF-8, that's a much bigger problem. A partial character pattern could match the same or it could match a complete character. Which means that the behavior when matching as a sequence of characters or a sequence of bytes would be different. This could still be handled by converting to a byte sequence with zero-width negative lookaheads to guarantee that the two levels match in the same way, but that seems unnecessary. It's much simpler to just disallow invalid UTF-8 in patterns.

Returning to what PCRE actually does, it seems like as long as we disallow invalid UTF-8 in regular expressions, the library most likely does exactly what we want if we tell it not to check for UTF-8 validity of the input string. The reason I say that is that the behavior we want is exactly the behavior you would get if you converted a character pattern to a byte pattern as straightforwardly as possible. We should verify that this is the case, but it seems based on all the data we have that it probably is.

@StefanKarpinski
Copy link
Member

In short, I think we should require regular expressions to be valid UTF-8 and leave the PCRE.NO_UTF_CHECK flag set as we have been. There doesn't seem to be much reason for PCRE not to support this mode correctly, so if we do find any bugs, we can probably make a case to fix them upstream.

@nalimilan
Copy link
Member Author

Yes, that makes sense to me, assuming that PCRE works correctly with invalid haystacks. But I think we'd better get in touch with PCRE developers now rather than waiting until we find bugs (which may or may not be easily fixable). For example it's possible/likely that for performance reasons PCRE relies on assumptions regarding the values of codeunits/codepoints when doing table lookups. In that case invalid UTF-8 could cause crashes and I'm afraid they wouldn't be interested in fixing it.

@StefanKarpinski
Copy link
Member

Yes, we should probably get in touch with them.

@samoconnor
Copy link
Contributor

when people write .* they don't actually mean "match as many valid UTF-8 characters as you can", they actually mean "match whatever—I don't care what it is".

👍

disallow invalid UTF-8 in patterns

👍

Some thoughts:

  • It might be useful to have the option to compile a Regex in ASCII mode (i.e. without the PCRE.UTF option). This would support the use case where the input might be "invalid" UTF8 or contain random binary sequences, but the expression only cares about finding sequences of plain ASCII characters. (e.g. the HTTP parser). It seems safe to assume that PCRE won't hang or crash on random input if it is not in UTF mode.

  • It might be useful to build a test that feeds random haystack bytes into a set of valid expressions to look for cases crash/hang cases.

  • "may crash or loop indefinitely" is fairly broad disclaimer language. It would be useful to know what this means in detail. May read random address? May write to random address? May return an error code? May loop forever if the expression contains X, Y or Z?

  • perhaps the end game could be developing a PR for PCRE to implement a PCRE2_LAZY_UTF_CHECK option that allows invalid bytes in the haystack.

@StefanKarpinski
Copy link
Member

perhaps the end game could be developing a PR for PCRE to implement a PCRE2_LAZY_UTF_CHECK option that allows invalid bytes in the haystack.

You don't really want to do lazy UTF-8 validation, you want to not assume UTF-8 validity in order for the PCRE code to work correctly and find valid UTF-8 patterns within data which may not be valid.

@nalimilan
Copy link
Member Author

I've sent a message to the PCRE development list: https://lists.exim.org/lurker/message/20180412.091038.080cf6f2.en.html

BTW, I've found simple cases where matches fail for invalid haystacks. So I think for now we'd better throw errors, until we are able to ensure a correct behavior:

julia> match(r".*", "a\xffb")
ERROR: BoundsError: attempt to access "a\xffb"
  at index [8]
Stacktrace:
 [1] prevind(::String, ::Int64, ::Int64) at ./strings/basic.jl:428
 [2] prevind at ./strings/basic.jl:423 [inlined]
 [3] prevind at ./strings/basic.jl:422 [inlined]
 [4] match(::Regex, ::String, ::Int64, ::UInt32) at ./regex.jl:193
 [5] match at ./regex.jl:186 [inlined]
 [6] match(::Regex, ::String) at ./regex.jl:201
 [7] top-level scope

julia> match(r"a.*b", "a\xffb")
# nothing

@samoconnor
Copy link
Contributor

attempt to access "a\xffb" at index [8]

Do you read this as saying that PCRE read past the end of the input array? (returning 7 in ovec[2])

@nalimilan
Copy link
Member Author

Yes, I guess it's something like that. I haven't investigated what exactly is going on, though.

@StefanKarpinski
Copy link
Member

For an obscure reason, r"\xff" does not throw an error even though Regex("\xff") and @r_str("\xff") do.

julia> r"\xff".pattern
"\\xff"

julia> Regex("\xff").pattern
"\xff"

julia> findfirst(r"\xff", "hi \uff there")
4:4

julia> findfirst(Regex("\xff"), "hi \uff there")

The r"\xff" form relies on PCRE interpreting the \xff escape whereas the other two rely on Julia interpreting it. PCRE interprets it as a valid escape for U+ff whereas Julia requires writing that as \uff and interprets \xff as encoding a 0xff byte in the middle of a string.

samoconnor added a commit to JuliaWeb/HTTP.jl that referenced this pull request Apr 16, 2018
@nalimilan nalimilan added the triage This should be discussed on a triage call label Apr 25, 2018
@JeffBezanson
Copy link
Member

Proposal:

  • Throw an error in the Regex constructor if the pattern isn't valid UTF-8
  • Turn off the NO_UTF_CHECK flag when matching.

@JeffBezanson
Copy link
Member

I guess that's basically the same as this PR. Probably doesn't matter much whether we pass the flag at regex compile time, or do the validity check ourselves.

@Keno
Copy link
Member

Keno commented May 17, 2018

Possible future enhancement: Make isvalid on string cache its result in a bit of the string, but continue considering string egal independent of the bit.

@JeffBezanson JeffBezanson added this to the 1.0 milestone May 17, 2018
@JeffBezanson JeffBezanson removed the triage This should be discussed on a triage call label May 17, 2018
@mbauman
Copy link
Member

mbauman commented May 17, 2018

I think this PR is ready to go as it is, then. With regards to Milan's second bullet-point, ("For an obscure reason, r"\xff" does not throw an error…"), that's simply because r"\xff" is constructing Regex("\\xff") — so it is working just fine. I also sure would hope that PCRE only checks the needle validity once (at compile time), so that shouldn't be an issue, either.

@nalimilan
Copy link
Member Author

Yes, I was confused about how \xx was interpreted. It may still make sense to check that regexes don't contain \xff since the Julia equivalent is actually \uff, but that's for a separate PR (and is much less breaking).

@nalimilan nalimilan changed the title WIP: Enable PCRE UTF-8 validity string checks Enable PCRE UTF-8 validity string checks May 17, 2018
@nalimilan nalimilan closed this May 17, 2018
@nalimilan nalimilan reopened this May 17, 2018
@JeffBezanson JeffBezanson merged commit 627173b into master May 18, 2018
@JeffBezanson JeffBezanson deleted the nl/pcre branch May 18, 2018 14:03
haampie pushed a commit to haampie/julia that referenced this pull request May 18, 2018
Strings are no guaranteed to contain valid UTF-8, and PCRE documentation says
that the behavior is undefined in that case.
@KristofferC
Copy link
Member

This caused the following performance regression:

julia> @btime findnext(r,seq,1)
  60.168 μs (1 allocation: 32 bytes)

julia> r.match_options = Base.PCRE.NO_UTF_CHECK
0x40000000

julia> @btime findnext(r,seq,1)
  108.639 ns (1 allocation: 32 bytes)
1:22

Ref: #27030 (comment)

@KristofferC
Copy link
Member

This also caused the following line in BaseBenchmarks to go from 0.5 seconds to many minutes: https://github.com/JuliaCI/BaseBenchmarks.jl/blob/master/src/problem/SpellCheck.jl#L34

(putting the option back and it is ok again)

julia> w.regex.match_options = Base.PCRE.NO_UTF_CHECK;

julia> @time train(w);
  0.540739 seconds (4.45 M allocations: 254.833 MiB, 22.07% gc time)

@aviks
Copy link
Member

aviks commented Mar 8, 2022

For anyone coming across this old issue, this was eventually fixed by #39524

@KristofferC
Copy link
Member

That issue number looks wrong.

@nalimilan
Copy link
Member Author

I guess he meant #39524.

@aviks
Copy link
Member

aviks commented Mar 9, 2022

Yeah, sorry, edited.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

unicode Related to unicode characters and encodings

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants