-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Finish new code gen approach for RegexCompiler / source generator #62268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsContributes to #61902 This enables RegexCompiler / source generator to use the new source gen approach for all expressions except those that employ RegexOptions.RightToLeft and lookbehinds (which are implemented with RightToLeft). In a subsequent PR, I'll delete all of the now defunct code for the old approach and also clean up the new approach, including adding a lot more comments. I wanted to get this in now though to unblock concurrent efforts to move the code over to be span-based. I also want to then spend some time looking at the generated code and finding ways to clean it up, e.g. ensuring its as pay-for-play, such as not emitting code for undoing captures if we can prove there won't be any to undo. I also want to clean up some of the algorithms in a few places to avoid walking around the tree so much. I expect a bit of a tail of bugs from this, so it'll be good to also prioritize porting over more tests from other engines, to help fill any gaps we may have. cc: @joperezr, @danmoseley
|
6e796d1 to
9dc7dc5
Compare
|
(Investigating why this is failing only on mono) |
Is it reasonable (and possible) to also do a perf test dry run over this PR so that we can catch any potential regressions in perf given this is a bigger re-write? I know that in theory we expect that not to be the case, but it might be good to do a quick run in case we can catch any mishaps in terms of perf for a specific scenario. |
Yup. I did one yesterday just for https://github.com/dotnet/performance/blob/1d541b2a6ea90ee2abc84d7ab876079d97fb1d1c/src/benchmarks/micro/libraries/System.Text.RegularExpressions/Perf.Regex.Industry.cs#L44 and the three tests there improved a few percent. I can kick off a larger one. |
Since this specific text will be replaced later, are there any concerns in case there is a wild case where this specific string is part of the consumer's regex? #Resolved Refers to: src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs:31 in 9dc7dc5. [](commit_id = 9dc7dc5eb30db64a9c4749677f879d2d9db93383, deletion_comment = False) |
That's fair. I could add a Guid.NewGuid() to it :) Or if you have other suggestions, I'm open to them. I might be able to just note the position we're at in the underlying builder and insert the text at that position; that might be simpler. |
|
I think this is already a super edge case, but adding a GUID would make it basically impossible to ever collide so that might be a good idea. The other thought is to not force all declarations to be on the top of the method, I understand that it makes the code more readable, but would it be possible instead to just declare the variables we need at the time we need them? |
I actually prefer the readability when they are declared on first use, and that's how I initially had it. But the branching that ends up being employed by backtracking results in cases where, even though we know variables will have already been initialized, the C# compiler still complains. Consider a case like: using System;
public class C
{
public void M()
{
int a = 0;
goto Switch;
Zero:
a = 1;
int value = 42;
goto Switch;
One:
Console.WriteLine(value);
a = 2;
goto Switch;
Switch:
switch (a)
{
case 0: goto Zero;
case 1: goto One;
}
}
}There's no way to end up at that My solution then was to force the declarations to the beginning of the method and default initialize them. |
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs
Outdated
Show resolved
Hide resolved
9dc7dc5 to
bbf5921
Compare
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs
Show resolved
Hide resolved
...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs
Show resolved
Hide resolved
joperezr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that there are a lot of changes here it is possible I may have missed something, but after following the changes going into the Emitter and compiler, things look good and I also tested at least the source generator changes here and even though we could probably simplify some things in the future, I'm happy with the way backtracking is looking now.
|
Great, thanks for reviewing. I'll merge this and have another PR out soon. |
|
Thanks for getting this in, I will start working on the Go methods for these two engines. |
Contributes to #61902
This enables RegexCompiler / source generator to use the new source gen approach for all expressions except those that employ RegexOptions.RightToLeft and lookbehinds (which are implemented with RightToLeft).
In a subsequent PR, I'll delete all of the now defunct code for the old approach and also clean up the new approach, including adding a lot more comments. I wanted to get this in now though to unblock concurrent efforts to move the code over to be span-based. I also want to then spend some time looking at the generated code and finding ways to clean it up, e.g. ensuring its as pay-for-play, such as not emitting code for undoing captures if we can prove there won't be any to undo. I also want to clean up some of the algorithms in a few places to avoid walking around the tree so much.
I expect a bit of a tail of bugs from this, so it'll be good to also prioritize porting over more tests from other engines, to help fill any gaps we may have.
cc: @joperezr, @danmoseley