-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Enable PCRE UTF-8 validity string checks #26731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Strings are no guaranteed to contain valid UTF-8, and PCRE documentation says that the behavior is undefined in that case.
|
Insisting that regular expressions be valid UTF-8 seems ok to me but I don't think we should be throwing errors when the string contains invalid UTF-8. That violates the principle that bad data should not cause errors, only a bad program should cause errors. For example, |
|
In principle I agree, but PCRE doesn't seem to support it. It's clearly described as undefined behavior which could crash the program. |
|
Yes, but in practice, does it actually do so with the |
|
Hard to tell. We'd need to run stress tests, as the most simple cases may not crash but more complex ones could. We could also inspect the code. However, even if it doesn't crash now, it could do so in the future, yet changing the behavior would be breaking. If we want to keep the check disabled, at the minimum we should ask PCRE developers about that. |
|
We can always implement our own regex engine that uses LLVM to generate fast matchers. This would be far easier in Julia than in most systems. It's actually a project I've long wanted to try out 😸 |
|
Setting
In my quick benchmarks, the performance degradation by UTF-8 check is roughly 5%. I think this is a reasonable cost to pay for safety. |
|
@bicycle1885 I don't think the question is whether it's too costly: it's rather that we would like to allow working with invalid strings without ever throwing an error, and unfortunately PCRE doesn't seem to share that objective at all. I have no idea how hard it would be to implement PCRE in Julia. If that's the long-term goal, in the short term we should check whether the current PCRE codebase is robust in the presence of invalid UTF-8 despite the claims about undefined behavior. Asking its developers sounds like the best way to do that (I can write a message). |
|
Let's put aside, for the moment what PCRE actually does and consider what the ideal behavior would be given our general approach to strings. Any pattern of characters can be converted to a pattern of bytes, and if the pattern is valid UTF-8, then bytes that match a character are guaranteed to match a valid UTF-8 subsequence. Wildcard patterns like On the other hand, if your regex contains invalid UTF-8, that's a much bigger problem. A partial character pattern could match the same or it could match a complete character. Which means that the behavior when matching as a sequence of characters or a sequence of bytes would be different. This could still be handled by converting to a byte sequence with zero-width negative lookaheads to guarantee that the two levels match in the same way, but that seems unnecessary. It's much simpler to just disallow invalid UTF-8 in patterns. Returning to what PCRE actually does, it seems like as long as we disallow invalid UTF-8 in regular expressions, the library most likely does exactly what we want if we tell it not to check for UTF-8 validity of the input string. The reason I say that is that the behavior we want is exactly the behavior you would get if you converted a character pattern to a byte pattern as straightforwardly as possible. We should verify that this is the case, but it seems based on all the data we have that it probably is. |
|
In short, I think we should require regular expressions to be valid UTF-8 and leave the |
|
Yes, that makes sense to me, assuming that PCRE works correctly with invalid haystacks. But I think we'd better get in touch with PCRE developers now rather than waiting until we find bugs (which may or may not be easily fixable). For example it's possible/likely that for performance reasons PCRE relies on assumptions regarding the values of codeunits/codepoints when doing table lookups. In that case invalid UTF-8 could cause crashes and I'm afraid they wouldn't be interested in fixing it. |
|
Yes, we should probably get in touch with them. |
👍
👍 Some thoughts:
|
You don't really want to do lazy UTF-8 validation, you want to not assume UTF-8 validity in order for the PCRE code to work correctly and find valid UTF-8 patterns within data which may not be valid. |
|
I've sent a message to the PCRE development list: https://lists.exim.org/lurker/message/20180412.091038.080cf6f2.en.html BTW, I've found simple cases where matches fail for invalid haystacks. So I think for now we'd better throw errors, until we are able to ensure a correct behavior: julia> match(r".*", "a\xffb")
ERROR: BoundsError: attempt to access "a\xffb"
at index [8]
Stacktrace:
[1] prevind(::String, ::Int64, ::Int64) at ./strings/basic.jl:428
[2] prevind at ./strings/basic.jl:423 [inlined]
[3] prevind at ./strings/basic.jl:422 [inlined]
[4] match(::Regex, ::String, ::Int64, ::UInt32) at ./regex.jl:193
[5] match at ./regex.jl:186 [inlined]
[6] match(::Regex, ::String) at ./regex.jl:201
[7] top-level scope
julia> match(r"a.*b", "a\xffb")
# nothing |
Do you read this as saying that PCRE read past the end of the input array? (returning |
|
Yes, I guess it's something like that. I haven't investigated what exactly is going on, though. |
julia> r"\xff".pattern
"\\xff"
julia> Regex("\xff").pattern
"\xff"
julia> findfirst(r"\xff", "hi \uff there")
4:4
julia> findfirst(Regex("\xff"), "hi \uff there")The |
|
Proposal:
|
|
I guess that's basically the same as this PR. Probably doesn't matter much whether we pass the flag at regex compile time, or do the validity check ourselves. |
|
Possible future enhancement: Make |
|
I think this PR is ready to go as it is, then. With regards to Milan's second bullet-point, ("For an obscure reason, |
|
Yes, I was confused about how |
Strings are no guaranteed to contain valid UTF-8, and PCRE documentation says that the behavior is undefined in that case.
|
This caused the following performance regression: Ref: #27030 (comment) |
|
This also caused the following line in BaseBenchmarks to go from 0.5 seconds to many minutes: https://github.com/JuliaCI/BaseBenchmarks.jl/blob/master/src/problem/SpellCheck.jl#L34 (putting the option back and it is ok again) |
This reverts commit 627173b.
This reverts commit 627173b.
|
For anyone coming across this old issue, this was eventually fixed by #39524 |
|
That issue number looks wrong. |
|
I guess he meant #39524. |
|
Yeah, sorry, edited. |
Strings are no guaranteed to contain valid UTF-8, and PCRE documentation says that the behavior is undefined in that case. See Discourse post.
This is WIP for two reasons:
EDIT: turns out this was a typo in the PCRE docs, one zero was missing.match(r"a", "\U11000")is supposed to fail according to the docs (look forPCRE2_ERROR_UTF8_ERR14), but it doesn't. Am I missing something?r"\xff"does not throw an error even thoughRegex("\xff")and@r_str("\xff")do. This means that the UTF-8 validity check does not work in the most common cases.Given that regexes are often written as literals, it could make sense for performance to check them ourselves at compile time in
r_str, and keep track of that in the object to disable the PCRE check. But that optimization isn't breaking, so it can be applied later.