Skip to content

Conversation

@samuelhuang
Copy link
Contributor

This CL fixes an issue where images nested within <div> elements inside a <figure> tag were being incorrectly removed by Readability's cleaning process. Specifically, a structure like
<figure><div><div><img></div></div></figure> would first be transformed to <figure><div><p><img></p></div></figure>, and then the outer <div> (and its contents) would be erroneously identified as extraneous and removed.

The fix introduces a targeted exception within _cleanConditionally(). It prevents the removal of <div> elements that meet the following criteria:

  • The element is a <div>.
  • The <div> is an ancestor of a <figure> element.
  • The <div> contains a single <img> element (potentially nested).

This CL fixes an issue where images nested within `<div>` elements
inside a `<figure>` tag were being incorrectly removed by Readability's
cleaning process. Specifically, a structure like
`<figure><div><div><img></div></div></figure>` would first be
transformed to `<figure><div><p><img></p></div></figure>`, and then
the outer `<div>` (and its contents) would be erroneously identified
as extraneous and removed.

The fix introduces a targeted exception within _cleanConditionally().
It prevents the removal of `<div>` elements that meet the following
criteria:
* The element is a `<div>`.
* The `<div>` is an ancestor of a `<figure>` element.
* The `<div>` contains a single `<img>` element (potentially nested).
@samuelhuang
Copy link
Contributor Author

Context for this change: For recipe website:

https://www.allrecipes.com/hot-honey-brussels-sprouts-recipe-11832010

Readability omits image under "Directions" section.

@samuelhuang
Copy link
Contributor Author

The "1 failing check" was from npm audit, which disappears when I run npm audit fix. But running this updates package-lock.json. Should I bundle the change to this CL?

@gijsk
Copy link
Contributor

gijsk commented Nov 17, 2025

Apologies for the slow response, I was at a dedicated workweek last week. In addition to reviewing this PR more closely, I'll try to find some time this week or next to get some other folks to sign up to help with reviews here.

The "1 failing check" was from npm audit, which disappears when I run npm audit fix. But running this updates package-lock.json. Should I bundle the change to this CL?

Ideally submit a separate CL with just that change?

Context for this change: For recipe website:

https://www.allrecipes.com/hot-honey-brussels-sprouts-recipe-11832010

Readability omits image under "Directions" section.

Would you mind adding this or a similar case as a testcase so we don't regress this in future?

}

// Handle <img> buried inside nested <div> layers in <figure>.
if (tag === "div" && this._hasAncestorTag(node, "figure") && this._isSingleImage(node)) {
Copy link
Contributor

@gijsk gijsk Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works but is a little inefficient if the divs are all nested, because the getElementsByTagName call that is being used for the divs here will find each of the ancestor divs and re-walk it to find the same image (and the ancestor chain to find the same figure).

Would it be feasible to de-nest singly nested divs with images? Along the lines of _prepareBrs, do a prepareImages or something like that. Presumably the nesting of all the divs isn't actually useful for anything if we're removing content otherwise - though I suppose there will be a question of what to do with ids/classnames in terms of scoring...

Also, I'm curious how this works out when there are figure tags with these deeply nested images that also have other content in them (so they fail the _isSingleImage check). From a naive inspection of the page you linked (haven't tried to debug readability right now), it would seem that the page in question does have a noscript tag in the figure as well...

... in fact, why doesn't the logic in this helper already deal with all of this? Why does it fall down for this website?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The <noscript> tags in our example embeds <img> as escaped text content, not HTML tags. That seems to be why _isSingleImage() fails the check.

@gijsk
Copy link
Contributor

gijsk commented Nov 18, 2025

The "1 failing check" was from npm audit, which disappears when I run npm audit fix. But running this updates package-lock.json. Should I bundle the change to this CL?

Ideally submit a separate CL with just that change?

Perhaps this was #990 ? I just merged that, hopefully just rebasing will address this.

@gijsk
Copy link
Contributor

gijsk commented Nov 18, 2025

Egh, sorry for the spam, but actually looking at the logfile for the failing PR I think the npm audit thing is more visible due to the coloured formatting, but the actual problem is probably:

[warn] Readability.js
[warn] Code style issues found in the above file. Run Prettier with --write to fix.

? Which would kind of make sense, in that I didn't think the automation would block on npm audit issues at all...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants