Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions Readability.js
Original file line number Diff line number Diff line change
Expand Up @@ -2506,6 +2506,11 @@ Readability.prototype = {
return false;
}

// Handle <img> buried inside nested <div> layers in <figure>.
if (tag === "div" && this._hasAncestorTag(node, "figure") && this._isSingleImage(node)) {
Copy link
Contributor

@gijsk gijsk Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works but is a little inefficient if the divs are all nested, because the getElementsByTagName call that is being used for the divs here will find each of the ancestor divs and re-walk it to find the same image (and the ancestor chain to find the same figure).

Would it be feasible to de-nest singly nested divs with images? Along the lines of _prepareBrs, do a prepareImages or something like that. Presumably the nesting of all the divs isn't actually useful for anything if we're removing content otherwise - though I suppose there will be a question of what to do with ids/classnames in terms of scoring...

Also, I'm curious how this works out when there are figure tags with these deeply nested images that also have other content in them (so they fail the _isSingleImage check). From a naive inspection of the page you linked (haven't tried to debug readability right now), it would seem that the page in question does have a noscript tag in the figure as well...

... in fact, why doesn't the logic in this helper already deal with all of this? Why does it fall down for this website?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The <noscript> tags in our example embeds <img> as escaped text content, not HTML tags. That seems to be why _isSingleImage() fails the check.

return false;
}

var weight = this._getClassWeight(node);

this.log("Cleaning Conditionally", node);
Expand Down