Allow SitemapRequestLoader to accept Request objects for sitemap_urls #1517
Replies: 3 comments 6 replies
-
|
Hello @loic-bellinger and thanks for using Crawlee! While I understand the use case, I'm afraid that implicitly propagating the However, it is true that this kind of behavior is pretty hard to achieve with the current sitemap loader. Paging @Mantisus as the author of the current implementation - any idea how to hack it together or how to improve the loader so that it's possible? |
Beta Was this translation helpful? Give feedback.
-
I think we could add support for It is more transparent and is already used in |
Beta Was this translation helpful? Give feedback.
-
|
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi team,
First off, thanks for the great work on crawlee-python!
I'm proposing a small enhancement to make the
SitemapRequestLoadermore flexible: allowing thesitemap_urlsparameter to accept aSequence[str | Request].This would allow users to attach metadata via user_data directly to sitemap sources, which is invaluable for data enrichment.
Use Case
A common task is scraping sites from different companies, where each company has its own metadata (e.g., a company_id). We need to attach this company_id to every page scraped from that company's sitemap.
Currently, this requires indirect workarounds because the context is lost when only URL strings are passed to the loader.
Proposed Solution
If the loader accepted
Requestobjects, the process would be much cleaner:The user_data could then be propagated to every URL discovered within that sitemap, making data enrichment direct and reliable.
This change would be fully backward-compatible. Thanks for considering it
Beta Was this translation helpful? Give feedback.
All reactions