Skip to content

Conversation

@alekitto
Copy link
Contributor

I tried to add PHP8 attributes parsing functionalities in PHP tokenizer.
Hope this could help completing PHP8 support on this great project.

PS: Since its my first PR here, please indicate me if there's something wrong with it. Thanks

@alekitto alekitto force-pushed the feature/tokenizer_attributes branch from 24288d1 to 3c01426 Compare January 29, 2021 22:33
@jrfnl
Copy link
Contributor

jrfnl commented Jan 29, 2021

Hi @alekitto Thank you so much for submitting this and for your willingness to take this on!

I'd like to take some time to go through this in detail and also do some benchmark checks, as I'm sure @gsherwood will as well, but at first glance it looks like a great step forward.

Some suggestions/feedback based on first impressions only:

  • Any particular reason not to turn the $findCloser closure in the parsePhpAttribute() method into a full fledged function as well ?
  • As some of the array_...() functions used can have a negative impact on performance, have you run any benchmarks on the code ?
  • Could/Should the setting of the opener/closer array keys be handled by the Tokenizer::createTokenMap() method instead of via PHP::processAdditional() ? Or even potentially straight away when the opener/closer is being retrieved ?
  • Looks like the code handling array return type tokens was touched - should a unit test be added to safeguard that change or is this already covered in some way ?
  • Looks like the getMemberProperties() method has also been touched - should a unit test be added to the Tests\Core\File\GetMemberPropertiesTest for this ?
  • Do the tests still pass when strict assertions - assertSame() rather than assertEquals() - are used ?
  • You may want to use assertArrayHasKey() in various places.
  • Would it be an idea to also test that the end token is correctly tokenized as T_ATTRIBUTE_END and that both the start and end token have the expected attribute_opener and attribute_closer array keys ?
    I realize this is implicitly tested via the attribute_closer array key, but I personally tend to err on the side of abundant redundancy in the tests for anything tokenizer related as if something breaks in the tokenizer, the fall out in the sniffs can be large.
  • Would it be an idea to add some tests ensuring that the tokens between the start and end markers for the attributes are tokenized the same cross-version ?
  • The parsePhpAttribute() function can potentially return null, but I don't see a test covering that part.
  • Nitpick: the define('T_ATTRIBUTE_END', 'PHPCS_T_ATTRIBUTE_END'); in the Tokens.php file should probably be moved up to line 80 as this appears to be a new PHPCS specific token. It is now grouped with the backfills for PHP 8.0 specific tokens.

@alekitto alekitto force-pushed the feature/tokenizer_attributes branch from 3c01426 to 40b9c23 Compare January 30, 2021 00:08
@alekitto
Copy link
Contributor Author

@jrfnl Thanks for the quick feedback. I'll answer here one question at a time:

  • Any particular reason not to turn the $findCloser closure in the parsePhpAttribute() method into a full fledged function as well ?

Not really, it changed four or five times while I was coding this and this was the final form. I just transformed it into a proper method.

  • As some of the array_...() functions used can have a negative impact on performance, have you run any benchmarks on the code ?

No, I think I need some help for this point: what kind of tests can I execute to have some useful information?

  • Could/Should the setting of the opener/closer array keys be handled by the Tokenizer::createTokenMap() method instead of via PHP::processAdditional() ? Or even potentially straight away when the opener/closer is being retrieved ?

Yes, you're right. I moved that part in createTokenMap. I think it cannot be done directly on parsing as indexes could be different.

  • Looks like the code handling array return type tokens was touched - should a unit test be added to safeguard that change or is this already covered in some way ?

Sorry, can't see it, maybe I'm just too tired 😅

  • Looks like the getMemberProperties() method has also been touched - should a unit test be added to the Tests\Core\File\GetMemberPropertiesTest for this ?

👍 Done.

  • Do the tests still pass when strict assertions - assertSame() rather than assertEquals() - are used ?
  • You may want to use assertArrayHasKey() in various places.

Correct. I converted assertEquals and assertTrue(array_has_key(...)) to assertSame and assertArrayHasKey.

  • Would it be an idea to also test that the end token is correctly tokenized as T_ATTRIBUTE_END and that both the start and end token have the expected attribute_opener and attribute_closer array keys ?
    I realize this is implicitly tested via the attribute_closer array key, but I personally tend to err on the side of abundant redundancy in the tests for anything tokenizer related as if something breaks in the tokenizer, the fall out in the sniffs can be large.

Tried to do it on testAttribute. Is this enough in your opinion?

  • Would it be an idea to add some tests ensuring that the tokens between the start and end markers for the attributes are tokenized the same cross-version ?

Ok, I've added lists of tokens in testAttribute to check that the content of the attribute remains the same on different PHP versions.

  • The parsePhpAttribute() function can potentially return null, but I don't see a test covering that part.

I missed that! Added testInvalidAttribute to check this case.

  • Nitpick: the define('T_ATTRIBUTE_END', 'PHPCS_T_ATTRIBUTE_END'); in the Tokens.php file should probably be moved up to line 80 as this appears to be a new PHPCS specific token. It is now grouped with the backfills for PHP 8.0 specific tokens.

Moved

@jrfnl
Copy link
Contributor

jrfnl commented Jan 30, 2021

@alekitto Thanks a lot for your quick turn around and the updates. I'm sort of between PCs at the moment, but will try to do some additional testing as soon as I'm properly back up and running again. I may have more questions after that.

Re: benchmarking - you could try to do some runs with BlackFire to see the performance impact of the changes and try some variations to see if it becomes better/worse. You'll need to create an account and install the Blackfire agent and phar locally - see: https://blackfire.io/docs/up-and-running/installation

I'll leave a comment in-line to pinpoint the array return type tokenizer code change.

@jrfnl
Copy link
Contributor

jrfnl commented Jan 30, 2021

I'll leave a comment in-line to pinpoint the array return type tokenizer code change.

Never mind that. Looks like that change got removed in the mean time.

@jrfnl
Copy link
Contributor

jrfnl commented Jan 30, 2021

@gsherwood As this introduces a new token, this PR - if accepted - will have to go into a minor rather than a patch release. With that in mind, could this PR still be earmarked for 3.6.0 ? Would be a shame to leave it until 3.7.0 which still may be a while.

@alekitto
Copy link
Contributor Author

alekitto commented Jan 30, 2021

@jrfnl I tried some variations and benchmarked them with Blackfire.

PHPUnit executing AttributesTest.php only, finishes in 82ms (first version of my PR with three array_splice).
To avoid array_ calls I've tried to reconstruct the array, but that solution was less performant (90ms trying to remove the first array_splice, 102ms removing all three).
Other array calls have sub-millisecond differences with the original version.

I chose to replace only one array_splice with array_merge for clarity. Its variation per Blackfire benchmark was 0.2ms on 10 run.

I think this is the most performant path since token_get_all creates the array while parsing the code.

Anyway I was thinking that the parsePhpAttribute call path is unlikely to be called in real case scenario: whoever uses attributes, uses PHP 8 and probably it wants to execute phpcs on PHP 8 too. In that environment, the token_get_all parses the attributes natively and the method call is skipped. What do you think about it?

@jrfnl
Copy link
Contributor

jrfnl commented Jan 30, 2021

@alekitto Thanks for running those tests & benchmarking.

Anyway I was thinking that the parsePhpAttribute call path is unlikely to be called in real case scenario: whoever uses attributes, uses PHP 8 and probably it wants to execute phpcs on PHP 8 too. In that environment, the token_get_all parses the attributes natively and the method call is skipped. What do you think about it?

We can hope, but that's not necessarily true, especially for open source projects.
Packages may have various versions of a feature available and may load the one most appropriate for the PHP version the code is being run on. Along the same lines, there may be different branches of a project targetting different PHP versions.

So a project (potentially) containing PHP 8 code, doesn't necessarily mean that a PHPCS run for such a package will be done using PHP 8.x. Different people in a team may be working on different versions of the code and because of that may use a different local PHP version, CI may use yet another PHP version.

This is all the more reason why the tokenizer should be stable and return the same tokens independently of the PHP version PHPCS is being run on as otherwise sniffs may return different results when run on different PHP versions.

@alekitto
Copy link
Contributor Author

This is all the more reason why the tokenizer should be stable and return the same tokens independently of the PHP version PHPCS is being run on as otherwise sniffs may return different results when run on different PHP versions.

Agree. My thoughts was meant on the performance side of the problem: a 3ms loss per file could be acceptable on a path that is unlikely to be called.

If a case of performance issues will be discovered, will be debugged and benchmarked to resolve that particular issue.

Anyway my thoughts are probably philosophy applied to code (and could be ignored easily).

I'll wait for your feedback if something new has to be fixed.

Thanks again for your work.

@gsherwood gsherwood added this to the 3.6.0 milestone Feb 1, 2021
@gsherwood
Copy link
Member

Just to help myself out with a review, below are the tokens produced by token_get_all() for the following code snippet (taken from the tests in the PR):

<?php
#[CustomAttribute, AttributeWithParams(\'foo\'), AttributeWithParams(\'foo\', bar: [\'bar\' => \'foobar\'])]
function attribute_grouping_test() {}

function multiple_attributes_on_parameter_test(#[ParamAttribute, AttributeWithParams(/* another comment */ \'foo\')] int $param) {}
Tokens in PHP < 8
0 : T_OPEN_TAG => <?php\n
1 : T_COMMENT => #[CustomAttribute, AttributeWithParams('foo'), AttributeWithParams('foo', bar: ['bar' => 'foobar'])]\n
2 : T_FUNCTION => function
3 : T_WHITESPACE =>  
4 : T_STRING => attribute_grouping_test
5 : (
6 : )
7 : T_WHITESPACE =>  
8 : {
9 : }
10 : T_WHITESPACE => \n\n
11 : T_FUNCTION => function
12 : T_WHITESPACE =>  
13 : T_STRING => multiple_attributes_on_parameter_test
14 : (
15 : T_COMMENT => #[ParamAttribute, AttributeWithParams(/* another comment */ 'foo')] int $param) {}\n
Tokens in PHP 8+
0 : T_OPEN_TAG => <?php\n
1 : T_ATTRIBUTE => #[
2 : T_STRING => CustomAttribute
3 : ,
4 : T_WHITESPACE =>  
5 : T_STRING => AttributeWithParams
6 : (
7 : T_CONSTANT_ENCAPSED_STRING => 'foo'
8 : )
9 : ,
10 : T_WHITESPACE =>  
11 : T_STRING => AttributeWithParams
12 : (
13 : T_CONSTANT_ENCAPSED_STRING => 'foo'
14 : ,
15 : T_WHITESPACE =>  
16 : T_STRING => bar
17 : :
18 : T_WHITESPACE =>  
19 : [
20 : T_CONSTANT_ENCAPSED_STRING => 'bar'
21 : T_WHITESPACE =>  
22 : T_DOUBLE_ARROW => =>
23 : T_WHITESPACE =>  
24 : T_CONSTANT_ENCAPSED_STRING => 'foobar'
25 : ]
26 : )
27 : ]
28 : T_WHITESPACE => \n
29 : T_FUNCTION => function
30 : T_WHITESPACE =>  
31 : T_STRING => attribute_grouping_test
32 : (
33 : )
34 : T_WHITESPACE =>  
35 : {
36 : }
37 : T_WHITESPACE => \n\n
38 : T_FUNCTION => function
39 : T_WHITESPACE =>  
40 : T_STRING => multiple_attributes_on_parameter_test
41 : (
42 : T_ATTRIBUTE => #[
43 : T_STRING => ParamAttribute
44 : ,
45 : T_WHITESPACE =>  
46 : T_STRING => AttributeWithParams
47 : (
48 : T_COMMENT => /* another comment */
49 : T_WHITESPACE =>  
50 : T_CONSTANT_ENCAPSED_STRING => 'foo'
51 : )
52 : ]
53 : T_WHITESPACE =>  
54 : T_STRING => int
55 : T_WHITESPACE =>  
56 : T_VARIABLE => $param
57 : )
58 : T_WHITESPACE =>  
59 : {
60 : }
61 : T_WHITESPACE => \n

When I get around to reviewing the PR properly, I'll use this output to make sure the backfill is working correctly. Given the backfill uses token_get_all() directly, I suspect it will be fine.

@jrfnl
Copy link
Contributor

jrfnl commented Feb 1, 2021

Just a thought about the attribute tokenization from a sniff writing perspective as (formatting) rules for tokens within an attribute may be different from the rules for (formatting) tokens in a non-attribute context.

#[CustomAttribute, AttributeWithParams(\'foo\'), AttributeWithParams(\'foo\', bar: [\'bar\' => \'foobar\'])]

In a sniff which registers the T_STRING token, something like the AttributeWithParams(\'foo\') could easily be confused with a function call.

Would it be a good idea to add the attribute_opener and attribute_closer index keys to all tokens within the attribute ?
That would allow to easily distinguish tokens used within an attribute from non-attribute tokens and for sniffs to adjust their behaviour accordingly (i.e. ignore tokens within an attribute òr examine the tokens within an attribute).

The alternative would be that sniffs would have to walk the token tree to see if a token is within an attribute, which I fear will quickly become a performance drain.

@alekitto
Copy link
Contributor Author

alekitto commented Feb 1, 2021

@jrfnl Ok, done.

@jrfnl
Copy link
Contributor

jrfnl commented Feb 1, 2021

@alekitto Thanks, you are quick ;-)

I actually meant for my remark to be the starting point for discussion about this.

Another things which comes to mind in that context: I also seem to remember that there is/was some discussion about possibly allowing nested attributes in PHP in the future.

Not sure what the current status is of that, but if that is a realistic possibility in the near future, this may need to be handled via a nested_attributes array key with the opener/closer as key/value pairs instead for the tokens within an attribute.

What do you think ?

@alekitto
Copy link
Contributor Author

alekitto commented Feb 1, 2021

I was thinking about it today. Probably there is this possibility already: attribute on closure or parameter of a closure passed as argument of attribute (don't know if it allowed, but I think it is).

A nested_attributes (stack of opener/closer arrays) property could handle this situation.

I'll try this tomorrow and write here the result.

@kukulich
Copy link
Contributor

kukulich commented Feb 2, 2021

I think you should test also attributes with namespaces, eg.#[Boo\QualifiedName, \Foo\FullyQualifiedName('foo')] because of the new tokens in PHP 8.

@jrfnl
Copy link
Contributor

jrfnl commented Feb 2, 2021

Good point by @kukulich. On that note and taking other tokenizer differences across PHP versions into account, the token_get_all() in the code handling the BC-retokenization, may need to be replaced with passing the tokens through a new instance of the PHPCS PHP tokenizer class ?

@kukulich
Copy link
Contributor

kukulich commented Feb 2, 2021

may need to be replaced with passing the tokens through a new instance of the PHPCS PHP tokenizer class

Yes, I think it's necessary.

@alekitto
Copy link
Contributor Author

alekitto commented Feb 2, 2021

may need to be replaced with passing the tokens through a new instance of the PHPCS PHP tokenizer class ?

Shouldn't be necessary. The tokens from token_get_all are added into the original token array replacing the T_COMMENT token starting with #[ for additional processing after the processAttribute has returned.

Only nested attributes can break something right now.

Anyway, I'll the test for FCQN token ASAP.

@jrfnl
Copy link
Contributor

jrfnl commented Feb 2, 2021

Additional unit tests with as many attribute code variants as possible, including verifying the token codes/types of the tokens within the attribute for all of them, should be able to verify whether or not it is necessary, as well as to safeguard it for the future.

Sorry to keep hammering on the test redundancy, but for things like this, I really feel it is important to cover this sufficiently.

@alekitto
Copy link
Contributor Author

alekitto commented Feb 3, 2021

@jrfnl Nothing to be sorry about :)
Tokenizer is the core of this tool, I understand the need of ensuring its correct behavior.

I just wanted to inform everyone of my thoughts while I was writing this: I tried to make this as little as possibile injecting the re-parsed tokens in the token stream to allow them to be parsed without losing the other tokenizer's changes/backfills.
That's why I said that a new instance of tokenizer to process the token array generated in parsePhpAttribute shouldn't be necessary (and a huge performance loss).

However, nested attributes tokenization was harder than I thought but it's done. It is not valid PHP code (the RFC does not allow nested attributes), but should be tokenized correctly.

EDIT: I've not added the nested_attributes stack to tokens yet. I'll do it as soon as possible

@alekitto
Copy link
Contributor Author

alekitto commented Feb 3, 2021

Ok, added nested_attributes property to tokens inside attributes in last commit

@kukulich
Copy link
Contributor

kukulich commented Feb 3, 2021

@alekitto Tested in slevomat/coding-standard#1158 and it works!

Nice work 👍

@jrfnl
Copy link
Contributor

jrfnl commented Feb 8, 2021

Just dropping a note here about a follow-up action item - this doesn't have to be addressed in this PR!

A scan/search needs to be done for sniffs which may need adjusting to allow for attributes.
Any necessary changes to be made to individual sniffs can only be done once the final attribute tokenization implementation is known.

Sniff(s) for which it is already known that they will most likely need adjusting:

@alekitto
Copy link
Contributor Author

alekitto commented Feb 8, 2021

Also Squiz.WhiteSpace.FunctionSpacing needs to be adjusted to not count attribute (before or after a doc block) as first token

@kukulich
Copy link
Contributor

@alekitto Can you solve the conflict please?

@alekitto alekitto force-pushed the feature/tokenizer_attributes branch from e464df8 to cd75b91 Compare February 15, 2021 09:08
@alekitto
Copy link
Contributor Author

@kukulich rebased on master

@vv12131415
Copy link
Contributor

How can I help with this one?
Really wait for 3.6.0 because it's the only thing that prevents us from PHP 8

@jrfnl
Copy link
Contributor

jrfnl commented Feb 22, 2021

@vladyslavstartsev Well, both this PR as well as #3226 (+ the follow-up PRs for sniff changes) still need to be merged for full PHP 8 support. AFAIK, the intention is to release 3.6.0 once these two PRs (+ sniff follow ups if they can be pulled quickly) have been merged.

Any testing you can do with either of those two currently open PRs would be helpful.
Similarly identifying sniffs which may need changes related to these two PRs would also be very helpful.

For match structures (PR #3226), you can find a list of sniffs already identified to need changes and their status here: #3037 (comment)

I don't think such a list exists yet for attributes.

@alekitto alekitto force-pushed the feature/tokenizer_attributes branch from cd75b91 to 539db72 Compare February 24, 2021 14:12
@alekitto alekitto force-pushed the feature/tokenizer_attributes branch from 539db72 to 6367e71 Compare March 19, 2021 09:32
@gsherwood gsherwood merged commit 013943d into squizlabs:master Mar 29, 2021
gsherwood added a commit that referenced this pull request Mar 29, 2021
@gsherwood
Copy link
Member

Thanks for all the effort you put into getting attribute support working. I've now reviewed and merged, and will start looking at sniffs that might be impacted, starting with the side effects sniff.

@alekitto alekitto deleted the feature/tokenizer_attributes branch March 31, 2021 08:17
jrfnl added a commit to jrfnl/box that referenced this pull request Oct 10, 2021
Up to now, when a Phar was being created using PHP < 8.0 with the PHP compactor enabled, PHP 8.0+ attributes would be incorrectly removed as if they were comments.

This has now been fixed.

Both the solution implemented, as well as the unit tests, are largely inspired by the solution for handling attributes when code is being tokenized on PHP < 8.0, as implemented in `PHP_CodeSniffer` in squizlabs/PHP_CodeSniffer#3203 and squizlabs/PHP_CodeSniffer#3299. Props to alekitto.

Fixes 567

Co-authored-by: Alessandro Chitolina <[email protected]>
jrfnl added a commit to jrfnl/box that referenced this pull request Oct 10, 2021
Up to now, when a Phar was being created using PHP < 8.0 with the PHP compactor enabled, PHP 8.0+ attributes would be incorrectly removed as if they were comments.

This has now been fixed.

Both the solution implemented, as well as the unit tests, are largely inspired by the solution for handling attributes when code is being tokenized on PHP < 8.0, as implemented in `PHP_CodeSniffer` in squizlabs/PHP_CodeSniffer#3203 and squizlabs/PHP_CodeSniffer#3299. Props to alekitto.

Fixes 567

Co-authored-by: Alessandro Chitolina <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants