Fix skipping parsing character #384

seaburg · 2016-03-17T20:52:49Z

Hi!
Example:
<<a href="https://github.com/htacg/tidy-html5">tidy-html5</a>>
After an unsuccessful attempt to parse << parsing continues with a character (instead of <)

Оutput:
<<a href="https://github.com/htacg/tidy-html5">tidy-html5</a>>
Instead of:
<<a href="https://github.com/htacg/tidy-html5">tidy-html5</a>>

geoffmcl · 2016-03-18T01:00:46Z

@seaburg, thank you for your PR. We love those ;=))

While your commit 7d28b21 certainly addresses an issue of fixing the bad html of <<a href="http://example.com">example</a>>, you seem to have also removed an earlier fix Bug 762102?

Now I have tried to reconstruct that old 762102, now sf 486, created 2003-06-27, by Igor Katraev, but I have not been very successful... not sure exactly what the input was supposed to be... too many 'escapes'!

Your fix certainly seems to produce a better output in the case of a duplicated leading <... and I certainly think the current output of <<a is quite stupid... and needs fixing... but your fix is not yet tried...

And maybe you have good reason why this old fix code was removed... in which case your PR is good... and will be merged... after testing...

Please explain...

seaburg · 2016-03-19T06:47:45Z

Bug 762102 is solution of special case of the problem (& after <).
After an unsuccessful attempt to parse a symbol in LEX_GT state we should try to parse current symbol in LEX_CONTENT state (instead of parse a next symbol in LEX_CONTENT)

Example:

Parse first < in LEX_CONTENT state. Put first < to lexer buffer and switch to LEX_GT state
Read from stream a next symbol (second <)
Unsuccessful parsing second < in LEX_GT state. Switch to LEX_CONTENT state.
Put second < to lexer buffer and parse next a symbol (a) in LEX_CONTENT state. (Bad. We should not put a current symbol (second <) into lexer buffer. We should try to parse a current symbol (second <) in LEX_CONTENT state)

smirn0v · 2016-03-25T16:19:31Z

Proposed solution seems to be more general fix for aforementioned https://sourceforge.net/p/tidy/bugs/486/
@geoffmcl do you have plans to merge it ?

geoffmcl · 2016-03-26T19:32:28Z

@seaburg, @smirn0v, yes I can see this is a more general change for every < + non-letter case, and still testing the full consequences of that... but so far things are looking good, but want to do a little more testing before merging to master...

Hopefully, if I get the time, in just a few more days...

geoffmcl · 2016-03-28T00:44:07Z

@seaburg, @smirn0v, ok after some more testing, I am re-thinking this...

Really, all we want here is to put back certain characters following the <, and back up the lexer, so that that particular character can be fetched again and assessed in the `LEX_CONTENT' state.

Previously we only put back an & 2nd character, so, in LEX_CONTENT, an entity can be tested...

And this issue specifically identifies also if the second character is another <, then this should be also put back, lexer backed up, and let content handle it... that much is agreed!

But should every next non-letter be put back in the stream to be re-assessed? The state is already being changed to LEX_CONTENT anyway... It is just a question whether this 2nd non-letter character after a < need be pushed back, and the lexer backed up each and every time...

So, at this point I am leaning towards a simpler fix of the current if (c == '&') being expanded to if ((c == '&') || (c == '<')).

Are there any other 2nd non-letter characters that need to be re-assessed?

Seek, and appreciate, more comment and testing on this...

If we can resolve this quickly, it can make it into 5.2, otherwise I will move the milestone to 5.3, since we hope to shortly issue a new 5.2 release... thanks...

seaburg · 2016-03-28T19:37:10Z

So, at this point I am leaning towards a simpler fix of the current if (c == '&') being expanded to if ((c == '&') || (c == '<')).

No, this is not enough. In this case necessary if ((c == '&') || (c == '<') || TY_(IsWhite)(c)) ("< "(< and two spaces) parsed as "< " (two spaces after <) instead "< "(one space after <) when mode != Preformatted && mode != IgnoreMarkup)

But a excess condition complicate the logic. Therefore it is better to put any non-letter character back to buffer?

geoffmcl · 2016-03-29T15:23:13Z

@seaburg yes, on tracing through it all again, and again, I am coming around to agreeing ;=))

It does seem better to put it back, back up the lexer, and let LEX_CONTENT deal with it, whatever it is, rather than trying to decide what extra logic need be added... can now not see any bad consequences with this...

Nearly out of time today, but should be able to get around to merging this tomorrow... thanks for hanging in there...

geoffmcl · 2016-03-30T12:21:18Z

@seaburg now merged... thanks... add comments, and bumped version to 5.1.49...

Fix skipping parsing character

7d28b21

geoffmcl added the Bug label Mar 18, 2016

geoffmcl added this to the 5.2 milestone Mar 18, 2016

geoffmcl merged commit 4b135d9 into htacg:master Mar 30, 2016

geoffmcl added a commit that referenced this pull request Mar 30, 2016

Issue #384 - insert comments

1830fdb

geoffmcl added a commit that referenced this pull request Mar 30, 2016

Issue #384 - bump to 5.1.49

c19d221

ralfjunker mentioned this pull request Sep 23, 2022

Noise inserted if "<" is followd by UTF-8 code points larger than 1 byte #1054

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix skipping parsing character #384

Fix skipping parsing character #384

Uh oh!

seaburg commented Mar 17, 2016

Uh oh!

geoffmcl commented Mar 18, 2016

Uh oh!

seaburg commented Mar 19, 2016

Uh oh!

smirn0v commented Mar 25, 2016

Uh oh!

geoffmcl commented Mar 26, 2016

Uh oh!

geoffmcl commented Mar 28, 2016

Uh oh!

seaburg commented Mar 28, 2016

Uh oh!

geoffmcl commented Mar 29, 2016

Uh oh!

geoffmcl commented Mar 30, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix skipping parsing character #384

Fix skipping parsing character #384

Uh oh!

Conversation

seaburg commented Mar 17, 2016

Uh oh!

geoffmcl commented Mar 18, 2016

Uh oh!

seaburg commented Mar 19, 2016

Uh oh!

smirn0v commented Mar 25, 2016

Uh oh!

geoffmcl commented Mar 26, 2016

Uh oh!

geoffmcl commented Mar 28, 2016

Uh oh!

seaburg commented Mar 28, 2016

Uh oh!

geoffmcl commented Mar 29, 2016

Uh oh!

geoffmcl commented Mar 30, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants