Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Oct 22, 2025

Closes #117605

Summary

This PR adds comprehensive regression tests to verify that captures made within positive lookbehind assertions are preserved correctly across all regex engines (Interpreter, Compiled, and SourceGenerated).

Background

Issue #117605 reported that in Compiled mode, the .NET regex engine was incorrectly removing captures made by capturing groups within positive lookbehind assertions, while the Interpreted mode preserved them correctly. For example:

string pattern = "(?<=(abc)+?123)a";
string input = "abcabc123a";

var compiled = new Regex(pattern, RegexOptions.Compiled);
var interpreted = new Regex(pattern, RegexOptions.None);

var matchCompiled = compiled.Match(input);
var matchInterpreted = interpreted.Match(input);

// Expected: Both should have 1 capture in Group[1]
// Bug: Compiled showed 0 captures, Interpreted showed 1 capture

Changes

Upon investigation, I found that the issue has already been fixed in the current codebase. Both Compiled and Interpreted modes now produce identical results for all test cases mentioned in the issue.

However, the existing tests in Match_MemberData_Cases only verified match success and matched values, not the actual capture counts and values. This PR adds comprehensive tests to Match_Advanced_TestData that specifically verify:

  • Capture counts match across all engines
  • Capture values, indices, and lengths are correct
  • Various quantifier patterns work correctly (+?, {2,}?, etc.)
  • Multiple capturing groups in lookbehinds are handled properly

Tests Added

  1. Main issue case: (?<=(abc)+?123)a - Verifies lazy quantifier in lookbehind
  2. Multiple captures: (?<=(abc){2,}?123)a - Verifies bounded quantifier with 2 captures
  3. Simple lookbehind: (?<=(abc)+?)a - Verifies basic lazy quantifier behavior
  4. Multiple groups: (?<=(?'1'abc)+?(?'2')123)a - Verifies multiple named groups in lookbehind

All tests properly handle the NonBacktracking engine (which doesn't support lookbehinds) and run against all other available engines.

Test Results

All 30,335 regex tests pass successfully with no regressions.

Original prompt

This section details on the original issue you should resolve

<issue_title>Compiled regex incorrectly removes the captures which captured in positive lookbehind assertions</issue_title>
<issue_description>### Description

I saw u added a test yield return (@"(?<=(abc)+?123)", "abcabc123", RegexOptions.None, 0, 9, true, ""); on line 135 in file Regex.Match.Tests.cs.
Based on this, I found a new bug:
In Compiled mode, when leaving positive lookbehind assertions, the .NET regex engine will remove all the captures which captured by the last direct group construction(the last direct child node).


By the way, I found that when leaving negative assertions, the .NET regex engine would not delete captures which captured in the assertion on .NET8, but this bug fixed on .NET9.
For example:

using System;
using System.Text.RegularExpressions;

Console.WriteLine(new Regex("(?<!x())a",RegexOptions.Compiled).Match("a").Groups[1].Captures.Count);
Console.WriteLine(new Regex("(?!()x)a",RegexOptions.Compiled).Match("a").Groups[1].Captures.Count);

Output on .NET8:

1
1

Maybe they are related?

Reproduction Steps

using System;
using System.Text.RegularExpressions;

string input = "abcabc123a";
string pattern="(?<=(abc)+?123)a";
//pattern="(?'1')(?<=(?'1'abc)+?(?#<=just delete the left capture)(?'1'abc)+?123)a";
//pattern="(?'1')(?<=(?'1'abc)+?(?'2')123)a";//the capture of group#1 was removed, but the capture of group#2 wasn't
//pattern="(?<=(abc){2,}?123)a";//two captures of group#1 which captured in the positive lookbehind assertions(?<=) are both deleted

//pattern="(?<=(abc){2}?123)a";//not deleted
//pattern=@"(?<=\1?(abc){2,}?123)a";//not deleted
//pattern="(?<=(?(1))(abc){2,}?123)a";//not deleted
//pattern="(?<=(?'-1')?(abc){2,}?123)a";//not deleted

//pattern="1(?=x(abc)+?)";input="1xabc";//not deleted in positinve lookahead assertions
//pattern="1(?=(abc)+?x)";input="1abcx";//not deleted in positinve lookahead assertions

try
{
	Console.WriteLine("----------Compiled----------");
	Match matchCompiled = new Regex(pattern, RegexOptions.Compiled).Match(input);
	Console.WriteLine(matchCompiled.Value);
	Console.WriteLine(matchCompiled.Groups[1].Captures.Count);
	Console.WriteLine("Groups#1 Captures:");
	foreach(Capture c in matchCompiled.Groups[1].Captures)
		Console.WriteLine($"{c.Value}{c.Index}{c.Length}");
	
	Console.WriteLine("----------Interpreted----------");
	Match Interpreted= new Regex(pattern, RegexOptions.None).Match(input);
	Console.WriteLine(Interpreted.Value);
	Console.WriteLine(Interpreted.Groups[1].Captures.Count);
	Console.WriteLine("Groups#1 Captures:");
	foreach(Capture c in Interpreted.Groups[1].Captures)
		Console.WriteLine($"{c.Value}{c.Index}{c.Length}");
	
	Console.WriteLine("----------result comparion----------");
	Console.WriteLine($"Result equal:{matchCompiled.Groups[1].Captures.Count == Interpreted.Groups[1].Captures.Count}");
}
catch(Exception ex)
{
	Console.WriteLine(ex.Message);
}

Expected behavior

Output:

----------Compiled----------
a
1
Groups#1 Captures:
abc,3,3
----------Interpreted----------
a
1
Groups#1 Captures:
abc,3,3
----------result comparion----------
Result equal:True

Actual behavior

Output:

----------Compiled----------
a
0
Groups#1 Captures:
----------Interpreted----------
a
1
Groups#1 Captures:
abc,3,3
----------result comparion----------
Result equal:False

Regression?

No response

Known Workarounds

No response

Configuration

No response

Other information

No response</issue_description>

<agent_instructions>Please check whether this issue still repros. If it does, please fix it and add a test. If it doesn't, please just ensure there's one or more tests covering the cited issues.</agent_instructions>

Comments on the Issue (you are @copilot in this section)

@stephentoub > Did it fixed by https://github.com//pull/120856 yet?

No, that issue is unrelated.

That label issue still exists in the source generator. It's adding braces in a way that's preventing the goto from seeing the target label, which exists, just not in scope. I'd need to debug through to see why the braces are being added.</comment_new>

Fixes #117651


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

@stephentoub stephentoub marked this pull request as ready for review October 22, 2025 20:15
Copilot AI review requested due to automatic review settings October 22, 2025 20:15
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses a bug where compiled regex engines incorrectly removed captures made within positive lookbehind assertions. The issue has already been fixed in the current codebase, and this PR adds comprehensive regression tests to prevent future regressions.

Key Changes:

  • Added 4 new test cases specifically validating capture counts and values in lookbehind patterns
  • Tests cover lazy quantifiers, bounded quantifiers, simple lookbehinds, and multiple groups in lookbehinds
  • All tests properly skip the NonBacktracking engine which doesn't support lookbehinds

Copilot AI changed the title [WIP] Fix compiled regex incorrectly removing captures Add regression tests for captures in positive lookbehind assertions Oct 22, 2025
Copilot AI requested a review from stephentoub October 22, 2025 20:17
Copilot finished work on behalf of stephentoub October 22, 2025 20:17
@stephentoub stephentoub requested a review from tarekgh October 23, 2025 00:41
@stephentoub stephentoub enabled auto-merge (squash) October 23, 2025 00:41
@tarekgh tarekgh added this to the 11.0.0 milestone Oct 23, 2025
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

@stephentoub
Copy link
Member

/ba-g unrelated wasm failure (" session not created: DevToolsActivePort file doesn't exist (SessionNotCreated)")

@stephentoub stephentoub merged commit 8dc0941 into main Oct 23, 2025
88 of 91 checks passed
@stephentoub stephentoub deleted the copilot/fix-regex-capture-bug branch October 23, 2025 03:16
@longxya
Copy link

longxya commented Oct 23, 2025

However, the existing tests in Match_MemberData_Cases only verified match success and matched values, not the actual capture counts and values.

@stephentoub What does it mean?

In the past regression tests, did u only check whether a match succeeds and the overall matched values (the Index and Length of the overall match), but not the Index and Length of all captures in all groups?

For slightly complex regex, even if all values in the final result (the matched value and all captures value) are exactly the same, the underlying matching process might still be incorrect. The safest approach is to verify that the captured group content changes synchronously during the intermediate steps of matching.

For me, getting the captured values during the intermediate matching process isn’t very convenient.

I have tried using the following steps to obtain the values of an intermediate matching process:

  1. Modify the regex (?'1')((\d)+\D)+a to (?'1')((\d)+\D(?'g1_count'))+a to track how many times the quantifier + loops.
  2. Then use (?'1')(?>(?!(?'-stop'))((\d)+\D(?'g1_count')(?>(?'-g1_count'){3}(?'stop')|))|)+(?>(?!(?'-stop'))a|) to terminate the match after the 3rd loop and achieve a successful match.
  3. Ultimately, this is equivalent to getting the captured values of the original regex (?'1')((\d)+\D)+a during the intermediate matching steps.

However, (?'-g1_count'){n} isn’t a structure that can fail quickly when n is too large.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Compiled regex incorrectly removes the captures which captured in positive lookbehind assertions [Fuzzlyn] Assert in isEmbeddedMaskingCompatible

4 participants