-
Notifications
You must be signed in to change notification settings - Fork 690
Preserve images within figure elements. #989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This CL fixes an issue where images nested within `<div>` elements inside a `<figure>` tag were being incorrectly removed by Readability's cleaning process. Specifically, a structure like `<figure><div><div><img></div></div></figure>` would first be transformed to `<figure><div><p><img></p></div></figure>`, and then the outer `<div>` (and its contents) would be erroneously identified as extraneous and removed. The fix introduces a targeted exception within _cleanConditionally(). It prevents the removal of `<div>` elements that meet the following criteria: * The element is a `<div>`. * The `<div>` is an ancestor of a `<figure>` element. * The `<div>` contains a single `<img>` element (potentially nested).
|
Context for this change: For recipe website: https://www.allrecipes.com/hot-honey-brussels-sprouts-recipe-11832010 Readability omits image under "Directions" section. |
|
The "1 failing check" was from |
|
Apologies for the slow response, I was at a dedicated workweek last week. In addition to reviewing this PR more closely, I'll try to find some time this week or next to get some other folks to sign up to help with reviews here.
Ideally submit a separate CL with just that change?
Would you mind adding this or a similar case as a testcase so we don't regress this in future? |
| } | ||
|
|
||
| // Handle <img> buried inside nested <div> layers in <figure>. | ||
| if (tag === "div" && this._hasAncestorTag(node, "figure") && this._isSingleImage(node)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works but is a little inefficient if the divs are all nested, because the getElementsByTagName call that is being used for the divs here will find each of the ancestor divs and re-walk it to find the same image (and the ancestor chain to find the same figure).
Would it be feasible to de-nest singly nested divs with images? Along the lines of _prepareBrs, do a prepareImages or something like that. Presumably the nesting of all the divs isn't actually useful for anything if we're removing content otherwise - though I suppose there will be a question of what to do with ids/classnames in terms of scoring...
Also, I'm curious how this works out when there are figure tags with these deeply nested images that also have other content in them (so they fail the _isSingleImage check). From a naive inspection of the page you linked (haven't tried to debug readability right now), it would seem that the page in question does have a noscript tag in the figure as well...
... in fact, why doesn't the logic in this helper already deal with all of this? Why does it fall down for this website?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The <noscript> tags in our example embeds <img> as escaped text content, not HTML tags. That seems to be why _isSingleImage() fails the check.
Perhaps this was #990 ? I just merged that, hopefully just rebasing will address this. |
|
Egh, sorry for the spam, but actually looking at the logfile for the failing PR I think the ? Which would kind of make sense, in that I didn't think the automation would block on |
This CL fixes an issue where images nested within
<div>elements inside a<figure>tag were being incorrectly removed by Readability's cleaning process. Specifically, a structure like<figure><div><div><img></div></div></figure>would first be transformed to<figure><div><p><img></p></div></figure>, and then the outer<div>(and its contents) would be erroneously identified as extraneous and removed.The fix introduces a targeted exception within _cleanConditionally(). It prevents the removal of
<div>elements that meet the following criteria:<div>.<div>is an ancestor of a<figure>element.<div>contains a single<img>element (potentially nested).