Skip to content

Conversation

Andrew-William-Smith
Copy link

At present, Pcre2.exec_all does not return the full set of captures when matching patterns containing non-consuming captured subpatterns such as lookahead assertions, falsely detecting the end of the matched set before reaching the end of the string. In order to resolve this issue, I have implemented the algorithm recommended by the PCRE2 project for global regex matching, allowing for zero-length captures to be properly matched and returned by exec_all.

As a demonstration, prior to this patch:

# Pcre2.exec_all ~pat:"(?=(hello|ocaml))" "hellocaml" |> Array.map Pcre2.get_substrings;;
[|[|""; "hello"|]|]

And after this patch:

# Pcre2.exec_all ~pat:"(?=(hello|ocaml))" "hellocaml" |> Array.map Pcre2.get_substrings;;
[|[|""; "hello"|]; [|""; "ocaml"|]|]

Additionally, the test suite has been expanded to cover a number of evaluation scenarios for exec_all, including zero-length matches such as those shown above.

@Andrew-William-Smith Andrew-William-Smith changed the title Fix Pcre2.exec_all with non-consuming matches Fix Pcre2.exec_all for non-consuming patterns Sep 27, 2024
@Andrew-William-Smith
Copy link
Author

Andrew-William-Smith commented Sep 28, 2024

For what it's worth, I now have a branch on my fork that supersedes this patch and adds Seq.t variants of all functions that perform multiple matches (exec, extract, split): if you have an interest in merging that functionality, then I'm happy to close this PR in favour of a prospective new one that would include the added sequence functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant