rss-parser
is a type-safe Python RSS/Atom parsing module built using pydantic and xmltodict.
pip install rss-parser
or
git clone https://github.com/dhvcc/rss-parser.git
cd rss-parser
poetry build
pip install dist/*.whl
- The
Parser
class has been renamed toRSSParser
- Models for RSS-specific schemas have been moved from
rss_parser.models
torss_parser.models.rss
. Generic types remain unchanged - Date parsing has been improved and now uses pydantic's
validator
instead ofemail.utils
, producing better datetime objects where it previously defaulted tostr
NOTE: For parsing Atom, use AtomParser
from rss_parser import RSSParser
from requests import get # noqa
rss_url = "https://rss.art19.com/apology-line"
response = get(rss_url)
rss = RSSParser.parse(response.text)
# Print out rss meta data
print("Language", rss.channel.language)
print("RSS", rss.version)
# Iteratively print feed items
for item in rss.channel.items:
print(item.title)
print(item.description[:50])
# Language en
# RSS 2.0
# Wondery Presents - Flipping The Bird: Elon vs Twitter
# <p>When Elon Musk posted a video of himself arrivi
# Introducing: The Apology Line
# <p>If you could call a number and say you’re sorry
Here we can see that the description still contains <p>
tags - this is because it's wrapped in CDATA like so:
<![CDATA[<p>If you could call ...</p>]]>
If you want to customize the schema or provide a custom one, use the schema
keyword argument of the parser:
from rss_parser import RSSParser
from rss_parser.models import XMLBaseModel
from rss_parser.models.rss import RSS
from rss_parser.models.types import Tag
class CustomSchema(RSS, XMLBaseModel):
channel: None = None # Removing previous channel field
custom: Tag[str]
with open("tests/samples/custom.xml") as f:
data = f.read()
rss = RSSParser.parse(data, schema=CustomSchema)
print("RSS", rss.version)
print("Custom", rss.custom)
# RSS 2.0
# Custom Custom tag data
This library uses xmltodict to parse XML data. You can find the detailed documentation here.
The key thing to understand is that your data is processed into dictionaries.
For example, this XML:
<tag>content</tag>
will result in the following dictionary:
{
"tag": "content"
}
However, when handling attributes, the content of the tag will also be a dictionary:
<tag attr="1" data-value="data">data</tag>
This becomes:
{
"tag": {
"@attr": "1",
"@data-value": "data",
"#text": "content"
}
}
Multiple children of a tag will be placed into a list:
<div>
<tag>content</tag>
<tag>content2</tag>
</div>
This results in a list:
[
{ "tag": "content" },
{ "tag": "content" },
]
If you don't want to deal with these conditions and want to parse something always as a list, please use rss_parser.models.types.only_list.OnlyList
like we did in Channel
:
from typing import Optional
from rss_parser.models.rss.item import Item
from rss_parser.models.types.only_list import OnlyList
from rss_parser.models.types.tag import Tag
from rss_parser.pydantic_proxy import import_v1_pydantic
pydantic = import_v1_pydantic()
...
class OptionalChannelElementsMixin(...):
...
items: Optional[OnlyList[Tag[Item]]] = pydantic.Field(alias="item", default=[])
This is a generic field that handles tags as raw data or as a dictionary returned with attributes.
Example:
from rss_parser.models import XMLBaseModel
from rss_parser.models.types.tag import Tag
class Model(XMLBaseModel):
width: Tag[int]
category: Tag[str]
m = Model(
width=48,
category={"@someAttribute": "https://example.com", "#text": "valid string"},
)
# Content value is an integer, as per the generic type
assert m.width.content == 48
assert type(m.width), type(m.width.content) == (Tag[int], int)
# The attributes are empty by default
assert m.width.attributes == {} # But are populated when provided.
# Note that the @ symbol is trimmed from the beginning and the name is converted to snake_case
assert m.category.attributes == {'some_attribute': 'https://example.com'}
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Install dependencies with poetry install
(pip install poetry
).
Using pre-commit
is highly recommended. To install hooks, run:
poetry run pre-commit install -t=pre-commit -t=pre-push