Skip to content

Conversation

@seaburg
Copy link
Contributor

@seaburg seaburg commented Mar 17, 2016

Hi!
Example:
<<a href="https://github.com/htacg/tidy-html5">tidy-html5</a>>
After an unsuccessful attempt to parse << parsing continues with a character (instead of <)

Оutput:
&lt;&lt;a href="https://github.com/htacg/tidy-html5"&gt;tidy-html5&lt;/a&gt;&gt;
Instead of:
&lt;<a href="https://github.com/htacg/tidy-html5">tidy-html5</a>&gt;

@geoffmcl
Copy link
Contributor

@seaburg, thank you for your PR. We love those ;=))

While your commit 7d28b21 certainly addresses an issue of fixing the bad html of <<a href="http://example.com">example</a>>, you seem to have also removed an earlier fix Bug 762102?

Now I have tried to reconstruct that old 762102, now sf 486, created 2003-06-27, by Igor Katraev, but I have not been very successful... not sure exactly what the input was supposed to be... too many 'escapes'!

Your fix certainly seems to produce a better output in the case of a duplicated leading <... and I certainly think the current output of &lt;&lt;a is quite stupid... and needs fixing... but your fix is not yet tried...

And maybe you have good reason why this old fix code was removed... in which case your PR is good... and will be merged... after testing...

Please explain...

@geoffmcl geoffmcl added the Bug label Mar 18, 2016
@geoffmcl geoffmcl added this to the 5.2 milestone Mar 18, 2016
@seaburg
Copy link
Contributor Author

seaburg commented Mar 19, 2016

Bug 762102 is solution of special case of the problem (& after <).
After an unsuccessful attempt to parse a symbol in LEX_GT state we should try to parse current symbol in LEX_CONTENT state (instead of parse a next symbol in LEX_CONTENT)

Example:

  1. Parse first < in LEX_CONTENT state. Put first < to lexer buffer and switch to LEX_GT state
  2. Read from stream a next symbol (second <)
  3. Unsuccessful parsing second < in LEX_GT state. Switch to LEX_CONTENT state.
  4. Put second < to lexer buffer and parse next a symbol (a) in LEX_CONTENT state. (Bad. We should not put a current symbol (second <) into lexer buffer. We should try to parse a current symbol (second <) in LEX_CONTENT state)

@smirn0v
Copy link

smirn0v commented Mar 25, 2016

Proposed solution seems to be more general fix for aforementioned https://sourceforge.net/p/tidy/bugs/486/
@geoffmcl do you have plans to merge it ?

@geoffmcl
Copy link
Contributor

@seaburg, @smirn0v, yes I can see this is a more general change for every < + non-letter case, and still testing the full consequences of that... but so far things are looking good, but want to do a little more testing before merging to master...

Hopefully, if I get the time, in just a few more days...

@geoffmcl
Copy link
Contributor

@seaburg, @smirn0v, ok after some more testing, I am re-thinking this...

Really, all we want here is to put back certain characters following the <, and back up the lexer, so that that particular character can be fetched again and assessed in the `LEX_CONTENT' state.

Previously we only put back an & 2nd character, so, in LEX_CONTENT, an entity can be tested...

And this issue specifically identifies also if the second character is another <, then this should be also put back, lexer backed up, and let content handle it... that much is agreed!

But should every next non-letter be put back in the stream to be re-assessed? The state is already being changed to LEX_CONTENT anyway... It is just a question whether this 2nd non-letter character after a < need be pushed back, and the lexer backed up each and every time...

So, at this point I am leaning towards a simpler fix of the current if (c == '&') being expanded to if ((c == '&') || (c == '<')).

Are there any other 2nd non-letter characters that need to be re-assessed?

Seek, and appreciate, more comment and testing on this...

If we can resolve this quickly, it can make it into 5.2, otherwise I will move the milestone to 5.3, since we hope to shortly issue a new 5.2 release... thanks...

@seaburg
Copy link
Contributor Author

seaburg commented Mar 28, 2016

So, at this point I am leaning towards a simpler fix of the current if (c == '&') being expanded to if ((c == '&') || (c == '<')).

No, this is not enough. In this case necessary if ((c == '&') || (c == '<') || TY_(IsWhite)(c)) ("< "(< and two spaces) parsed as "< " (two spaces after <) instead "< "(one space after <) when mode != Preformatted && mode != IgnoreMarkup)

But a excess condition complicate the logic. Therefore it is better to put any non-letter character back to buffer?

@geoffmcl
Copy link
Contributor

@seaburg yes, on tracing through it all again, and again, I am coming around to agreeing ;=))

It does seem better to put it back, back up the lexer, and let LEX_CONTENT deal with it, whatever it is, rather than trying to decide what extra logic need be added... can now not see any bad consequences with this...

Nearly out of time today, but should be able to get around to merging this tomorrow... thanks for hanging in there...

@geoffmcl geoffmcl merged commit 4b135d9 into htacg:master Mar 30, 2016
geoffmcl added a commit that referenced this pull request Mar 30, 2016
geoffmcl added a commit that referenced this pull request Mar 30, 2016
@geoffmcl
Copy link
Contributor

@seaburg now merged... thanks... add comments, and bumped version to 5.1.49...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants