Enable PCRE UTF-8 validity string checks #26731

nalimilan · 2018-04-06T13:56:05Z

Strings are no guaranteed to contain valid UTF-8, and PCRE documentation says that the behavior is undefined in that case. See Discourse post.

This is WIP for two reasons:

~~match(r"a", "\U11000") is supposed to fail according to the docs (look for PCRE2_ERROR_UTF8_ERR14), but it doesn't. Am I missing something?~~ EDIT: turns out this was a typo in the PCRE docs, one zero was missing.
For an obscure reason, r"\xff" does not throw an error even though Regex("\xff") and @r_str("\xff") do. This means that the UTF-8 validity check does not work in the most common cases.

Given that regexes are often written as literals, it could make sense for performance to check them ourselves at compile time in r_str, and keep track of that in the object to disable the PCRE check. But that optimization isn't breaking, so it can be applied later.

Strings are no guaranteed to contain valid UTF-8, and PCRE documentation says that the behavior is undefined in that case.

StefanKarpinski · 2018-04-06T15:11:11Z

Insisting that regular expressions be valid UTF-8 seems ok to me but I don't think we should be throwing errors when the string contains invalid UTF-8. That violates the principle that bad data should not cause errors, only a bad program should cause errors. For example, match(r"a", "\xe2\x88") should just be false: the string does not contain the pattern r"a" anywhere.

nalimilan · 2018-04-06T15:12:31Z

In principle I agree, but PCRE doesn't seem to support it. It's clearly described as undefined behavior which could crash the program.

StefanKarpinski · 2018-04-06T15:36:00Z

Yes, but in practice, does it actually do so with the PCRE.NO_UTF_CHECK flag set?

nalimilan · 2018-04-06T15:49:07Z

Hard to tell. We'd need to run stress tests, as the most simple cases may not crash but more complex ones could. We could also inspect the code. However, even if it doesn't crash now, it could do so in the future, yet changing the behavior would be breaking. If we want to keep the check disabled, at the minimum we should ask PCRE developers about that.

StefanKarpinski · 2018-04-06T18:19:35Z

We can always implement our own regex engine that uses LLVM to generate fast matchers. This would be far easier in Julia than in most systems. It's actually a project I've long wanted to try out 😸

bicycle1885 · 2018-04-07T06:34:45Z

Setting NO_UTF_CHECK without validating the encoding of a subject would be too dangerous to have in Base, since the docs of PCRE2 clearly state:

When PCRE2_NO_UTF_CHECK is set, the effect of passing an invalid string as a subject, or an invalid value of startoffset, is undefined. Your program may crash or loop indefinitely.

In my quick benchmarks, the performance degradation by UTF-8 check is roughly 5%. I think this is a reasonable cost to pay for safety.

nalimilan · 2018-04-07T11:23:03Z

@bicycle1885 I don't think the question is whether it's too costly: it's rather that we would like to allow working with invalid strings without ever throwing an error, and unfortunately PCRE doesn't seem to share that objective at all.

I have no idea how hard it would be to implement PCRE in Julia. If that's the long-term goal, in the short term we should check whether the current PCRE codebase is robust in the presence of invalid UTF-8 despite the claims about undefined behavior. Asking its developers sounds like the best way to do that (I can write a message).

StefanKarpinski · 2018-04-08T20:29:22Z

Let's put aside, for the moment what PCRE actually does and consider what the ideal behavior would be given our general approach to strings.

Any pattern of characters can be converted to a pattern of bytes, and if the pattern is valid UTF-8, then bytes that match a character are guaranteed to match a valid UTF-8 subsequence. Wildcard patterns like . are a bit subtle: should . match only valid characters or invalid ones as well? My inclination is that it should match anything. This actually makes lowering it to a byte pattern very simple since it becomes . as a byte pattern as well; otherwise you'd have to make it only match byte sequences that match a valid UTF-8 character, which can be done with zero-width negative lookaheads, but it's tricky. On the other hand, since UTF-8 is self-synchronizing, there's no actual problem with letting . match arbitrary bytes: . matching invalid data doesn't affect the matching of parts of the pattern that are limited to particular characters and must be valid. Moreover, I suspect that when people write .* they don't actually mean "match as many valid UTF-8 characters as you can", they actually mean "match whatever—I don't care what it is".

On the other hand, if your regex contains invalid UTF-8, that's a much bigger problem. A partial character pattern could match the same or it could match a complete character. Which means that the behavior when matching as a sequence of characters or a sequence of bytes would be different. This could still be handled by converting to a byte sequence with zero-width negative lookaheads to guarantee that the two levels match in the same way, but that seems unnecessary. It's much simpler to just disallow invalid UTF-8 in patterns.

Returning to what PCRE actually does, it seems like as long as we disallow invalid UTF-8 in regular expressions, the library most likely does exactly what we want if we tell it not to check for UTF-8 validity of the input string. The reason I say that is that the behavior we want is exactly the behavior you would get if you converted a character pattern to a byte pattern as straightforwardly as possible. We should verify that this is the case, but it seems based on all the data we have that it probably is.

StefanKarpinski · 2018-04-08T20:53:28Z

In short, I think we should require regular expressions to be valid UTF-8 and leave the PCRE.NO_UTF_CHECK flag set as we have been. There doesn't seem to be much reason for PCRE not to support this mode correctly, so if we do find any bugs, we can probably make a case to fix them upstream.

nalimilan · 2018-04-09T08:06:53Z

Yes, that makes sense to me, assuming that PCRE works correctly with invalid haystacks. But I think we'd better get in touch with PCRE developers now rather than waiting until we find bugs (which may or may not be easily fixable). For example it's possible/likely that for performance reasons PCRE relies on assumptions regarding the values of codeunits/codepoints when doing table lookups. In that case invalid UTF-8 could cause crashes and I'm afraid they wouldn't be interested in fixing it.

StefanKarpinski · 2018-04-09T12:54:20Z

Yes, we should probably get in touch with them.

samoconnor · 2018-04-12T04:46:41Z

when people write .* they don't actually mean "match as many valid UTF-8 characters as you can", they actually mean "match whatever—I don't care what it is".

👍

disallow invalid UTF-8 in patterns

👍

Some thoughts:

It might be useful to have the option to compile a Regex in ASCII mode (i.e. without the PCRE.UTF option). This would support the use case where the input might be "invalid" UTF8 or contain random binary sequences, but the expression only cares about finding sequences of plain ASCII characters. (e.g. the HTTP parser). It seems safe to assume that PCRE won't hang or crash on random input if it is not in UTF mode.
It might be useful to build a test that feeds random haystack bytes into a set of valid expressions to look for cases crash/hang cases.
"may crash or loop indefinitely" is fairly broad disclaimer language. It would be useful to know what this means in detail. May read random address? May write to random address? May return an error code? May loop forever if the expression contains X, Y or Z?
perhaps the end game could be developing a PR for PCRE to implement a PCRE2_LAZY_UTF_CHECK option that allows invalid bytes in the haystack.

StefanKarpinski · 2018-04-12T06:28:04Z

perhaps the end game could be developing a PR for PCRE to implement a PCRE2_LAZY_UTF_CHECK option that allows invalid bytes in the haystack.

You don't really want to do lazy UTF-8 validation, you want to not assume UTF-8 validity in order for the PCRE code to work correctly and find valid UTF-8 patterns within data which may not be valid.

nalimilan · 2018-04-13T06:31:08Z

I've sent a message to the PCRE development list: https://lists.exim.org/lurker/message/20180412.091038.080cf6f2.en.html

BTW, I've found simple cases where matches fail for invalid haystacks. So I think for now we'd better throw errors, until we are able to ensure a correct behavior:

julia> match(r".*", "a\xffb")
ERROR: BoundsError: attempt to access "a\xffb"
  at index [8]
Stacktrace:
 [1] prevind(::String, ::Int64, ::Int64) at ./strings/basic.jl:428
 [2] prevind at ./strings/basic.jl:423 [inlined]
 [3] prevind at ./strings/basic.jl:422 [inlined]
 [4] match(::Regex, ::String, ::Int64, ::UInt32) at ./regex.jl:193
 [5] match at ./regex.jl:186 [inlined]
 [6] match(::Regex, ::String) at ./regex.jl:201
 [7] top-level scope

julia> match(r"a.*b", "a\xffb")
# nothing

samoconnor · 2018-04-13T07:03:59Z

attempt to access "a\xffb" at index [8]

Do you read this as saying that PCRE read past the end of the input array? (returning 7 in ovec[2])

nalimilan · 2018-04-13T12:02:47Z

Yes, I guess it's something like that. I haven't investigated what exactly is going on, though.

StefanKarpinski · 2018-04-13T16:01:02Z

For an obscure reason, r"\xff" does not throw an error even though Regex("\xff") and @r_str("\xff") do.

julia> r"\xff".pattern
"\\xff"

julia> Regex("\xff").pattern
"\xff"

julia> findfirst(r"\xff", "hi \uff there")
4:4

julia> findfirst(Regex("\xff"), "hi \uff there")

The r"\xff" form relies on PCRE interpreting the \xff escape whereas the other two rely on Julia interpreting it. PCRE interprets it as a valid escape for U+ff whereas Julia requires writing that as \uff and interprets \xff as encoding a 0xff byte in the middle of a string.

JeffBezanson · 2018-05-17T19:46:55Z

Proposal:

Throw an error in the Regex constructor if the pattern isn't valid UTF-8
Turn off the NO_UTF_CHECK flag when matching.

JeffBezanson · 2018-05-17T19:50:34Z

I guess that's basically the same as this PR. Probably doesn't matter much whether we pass the flag at regex compile time, or do the validity check ourselves.

Keno · 2018-05-17T19:51:54Z

Possible future enhancement: Make isvalid on string cache its result in a bit of the string, but continue considering string egal independent of the bit.

mbauman · 2018-05-17T20:06:03Z

I think this PR is ready to go as it is, then. With regards to Milan's second bullet-point, ("For an obscure reason, r"\xff" does not throw an error…"), that's simply because r"\xff" is constructing Regex("\\xff") — so it is working just fine. I also sure would hope that PCRE only checks the needle validity once (at compile time), so that shouldn't be an issue, either.

nalimilan · 2018-05-17T20:42:34Z

Yes, I was confused about how \xx was interpreted. It may still make sense to check that regexes don't contain \xff since the Julia equivalent is actually \uff, but that's for a separate PR (and is much less breaking).

Strings are no guaranteed to contain valid UTF-8, and PCRE documentation says that the behavior is undefined in that case.

KristofferC · 2018-06-20T09:07:02Z

This caused the following performance regression:

julia> @btime findnext(r,seq,1)
  60.168 μs (1 allocation: 32 bytes)

julia> r.match_options = Base.PCRE.NO_UTF_CHECK
0x40000000

julia> @btime findnext(r,seq,1)
  108.639 ns (1 allocation: 32 bytes)
1:22

Ref: #27030 (comment)

KristofferC · 2018-06-20T09:45:03Z

This also caused the following line in BaseBenchmarks to go from 0.5 seconds to many minutes: https://github.com/JuliaCI/BaseBenchmarks.jl/blob/master/src/problem/SpellCheck.jl#L34

(putting the option back and it is ok again)

julia> w.regex.match_options = Base.PCRE.NO_UTF_CHECK;

julia> @time train(w);
  0.540739 seconds (4.45 M allocations: 254.833 MiB, 22.07% gc time)

This reverts commit 627173b.

aviks · 2022-03-08T22:17:06Z

For anyone coming across this old issue, this was eventually fixed by #39524

KristofferC · 2022-03-09T03:41:25Z

That issue number looks wrong.

nalimilan · 2022-03-09T11:16:51Z

I guess he meant #39524.

aviks · 2022-03-09T15:07:24Z

Yeah, sorry, edited.

nalimilan force-pushed the nl/pcre branch from 2c162df to 8196809 Compare April 6, 2018 13:57

nalimilan added the unicode Related to unicode characters and encodings label Apr 6, 2018

Enable PCRE UTF-8 validity string checks

bfd6591

Strings are no guaranteed to contain valid UTF-8, and PCRE documentation says that the behavior is undefined in that case.

nalimilan force-pushed the nl/pcre branch from 8196809 to bfd6591 Compare April 6, 2018 14:23

nalimilan mentioned this pull request Apr 13, 2018

Improve regex documentation #26802

Merged

samoconnor added a commit to JuliaWeb/HTTP.jl that referenced this pull request Apr 16, 2018

added note re regex safety JuliaLang/julia#26731 (comment)

48a7c29

nalimilan added the triage This should be discussed on a triage call label Apr 25, 2018

JeffBezanson added this to the 1.0 milestone May 17, 2018

JeffBezanson removed the triage This should be discussed on a triage call label May 17, 2018

nalimilan changed the title ~~WIP: Enable PCRE UTF-8 validity string checks~~ Enable PCRE UTF-8 validity string checks May 17, 2018

nalimilan closed this May 17, 2018

nalimilan reopened this May 17, 2018

JeffBezanson merged commit 627173b into master May 18, 2018

JeffBezanson deleted the nl/pcre branch May 18, 2018 14:03

haampie pushed a commit to haampie/julia that referenced this pull request May 18, 2018

Enable PCRE UTF-8 validity string checks (JuliaLang#26731)

b4ecb46

Strings are no guaranteed to contain valid UTF-8, and PCRE documentation says that the behavior is undefined in that case.

KristofferC mentioned this pull request Jun 20, 2018

Benchmarks vs 0.6 in prep for 0.7 release [do not merge] #27030

Closed

KristofferC mentioned this pull request Jul 23, 2018

filter packages based on current julia version JuliaLang/Pkg.jl#522

Merged

KristofferC added a commit that referenced this pull request Jul 24, 2018

Revert "Enable PCRE UTF-8 validity string checks (#26731)"

88b2971

This reverts commit 627173b.

KristofferC mentioned this pull request Jul 24, 2018

RFC Revert "Enable PCRE UTF-8 validity string checks (#26731)" #28259

Merged

KristofferC added a commit that referenced this pull request Jul 24, 2018

Revert "Enable PCRE UTF-8 validity string checks (#26731)"

ec0387e

This reverts commit 627173b.

StefanKarpinski pushed a commit that referenced this pull request Jul 25, 2018

Revert "Enable PCRE UTF-8 validity string checks (#26731)" (#28259)

1033b88

This reverts commit 627173b.

Uh oh!

Enable PCRE UTF-8 validity string checks #26731

Enable PCRE UTF-8 validity string checks #26731

Uh oh!

Conversation

nalimilan commented Apr 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

StefanKarpinski commented Apr 6, 2018

Uh oh!

nalimilan commented Apr 6, 2018

Uh oh!

StefanKarpinski commented Apr 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nalimilan commented Apr 6, 2018

Uh oh!

StefanKarpinski commented Apr 6, 2018

Uh oh!

bicycle1885 commented Apr 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nalimilan commented Apr 7, 2018

Uh oh!

StefanKarpinski commented Apr 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

StefanKarpinski commented Apr 8, 2018

Uh oh!

nalimilan commented Apr 9, 2018

Uh oh!

StefanKarpinski commented Apr 9, 2018

Uh oh!

samoconnor commented Apr 12, 2018

Uh oh!

StefanKarpinski commented Apr 12, 2018

Uh oh!

nalimilan commented Apr 13, 2018

Uh oh!

samoconnor commented Apr 13, 2018

Uh oh!

nalimilan commented Apr 13, 2018

Uh oh!

StefanKarpinski commented Apr 13, 2018

Uh oh!

JeffBezanson commented May 17, 2018

Uh oh!

JeffBezanson commented May 17, 2018

Uh oh!

Keno commented May 17, 2018

Uh oh!

mbauman commented May 17, 2018

Uh oh!

nalimilan commented May 17, 2018

Uh oh!

KristofferC commented Jun 20, 2018

Uh oh!

KristofferC commented Jun 20, 2018

Uh oh!

aviks commented Mar 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KristofferC commented Mar 9, 2022

Uh oh!

nalimilan commented Mar 9, 2022

Uh oh!

aviks commented Mar 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

nalimilan commented Apr 6, 2018 •

edited

Loading

StefanKarpinski commented Apr 6, 2018 •

edited

Loading

bicycle1885 commented Apr 7, 2018 •

edited

Loading

StefanKarpinski commented Apr 8, 2018 •

edited

Loading

aviks commented Mar 8, 2022 •

edited

Loading