Skip to content

Conversation

@ghost
Copy link

@ghost ghost commented Sep 24, 2022

Illustrative work on Lark-based RDFLib parser implementations, available transiently as a draft PR.

NQuads disclaimer

I haven't back-ported the NQuads parser implementation/tests to use ConjunctiveGraph, they work ust fine with the revamped Dataset codebase which I'm using by pragmatic preference. fwiw

test/test_w3c_spec/test_nquads_w3c.py ... [100%] Extant RDFLib NQuads parser
=== 78 passed, 1 skipped, 6 xfailed in 0.40s ===
test/test_experimental/test_lark/test_nquads_w3c.py ... [100%] Lark NQuads parser
=== 85 passed in 0.46s ===
test/test_experimental/test_lark/test_nquadsstar_w3c.py ...[100%] Lark NQuads Star parser
=== 85 passed in 0.38s ===

The Lark-based TrIG parser is still W-I-P, as is the Lark-based Notation3 parser (still working out whether the 2020 grammar is backward-compatible).

In-place vs Post-parse processing

Of interest, wall-clock performance results for the larkturtle parser, comparing in-place vs "classic" post-parse (as per pymantic's implementation) graph update, with and without lark-cython plugin vs the extant RDFLib turtle parser on 50k triples produced by sp2b:

pytest -rP test/test_experimental/test_lark/test_comparative_turtle_parser_performance.py
  • Lark Turtle classic parser: 5.40572
  • Lark-Cython Turtle classic parser: 5.37419
  • Lark Turtle in-place parser: 4.24691
  • Lark-Cython Turtle in-place parser: 3.01477
  • Lark-Cython Turtle in-place RDF* parser: 3.47203
  • Extant Turtle parser: 2.36562

Overall State of play

NTriples

test/test_w3c_spec/test_nt_w3c.py ... [100%] Extant RDFLib Ntriples parser
=== 60 passed, 8 xfailed in 0.27s === Full W3 NTriples test suite

test_ntriples_w3c.py ... 100%] Lark NTriples parser
=== 68 passed in 0.27s === Full W3 NTriples test suite

test_ntriplesstar_w3c.py ... [100%] Lark NTriples RDF Star parser
=== 68 passed in 0.30s === Full W3 NTriples test suite

test_ntriplesstar_w3c_rdfstar.py ... [100%] Lark NTriples RDF Star parser
=== 13 passed in 0.15s === RDF Star NTriples test suite

NQuads

test/test_w3c_spec/test_nquads_w3c.py ... [100%] Extant RDFLib NQuads parser
=== 78 passed, 1 skipped, 6 xfailed in 0.33s === Full W3 NQuads test suite

test_nquads_w3c.py ... [100%]  Lark NQuads parser
=== 85 passed in 0.35s === Full W3 NQuads test suite

test_nquadsstar_w3c.py ... [100%]  Lark NQuads RDF Star parser
=== 85 passed in 0.39s === Full W3 NQuads test suite

test_nquadsstar_w3c_rdfstar.py ... [100%] Lark NQuads RDF Star parser
=== 13 passed in 0.14s === RDF Star NQuads test suite

Turtle

test/test_w3c_spec/test_turtle_w3c.py ... [100%] Extant RDFLib Turtle parser
=== 262 passed, 29 xfailed in 1.02s === Full W3 Turtle test suite

test_turtle_w3c.py ... [100%] Lark Turtle parser
=== 278 passed, 13 xfailed in 1.26s ==== Full W3 Turtle test suite

test_turtlestar_w3c.py ... [100%] Lark Turtle RDF Star parser
=== 278 passed, 13 xfailed in 1.32s === Full W3 Turtle test suite

test_turtlestar_w3c_rdfstar.py ... [100%] Lark Turtle RDF Star parser
=== 43 passed, 4 xfailed in 0.66s ===  RDF Star Turtle test suite

Parse vs Process

Additionally, I did some experimental probing of the parsing/processing situation, taking advantage of tgbugs' implementation of librdf, a wrapper for RDFLib and the Redland C-implemented RDF/XML and Turtle parsers and found some illuminating results for the Redland Turtle parser with 500k triples:

  • redland_elapsed 0.8678569793701172 - parse only
  • redland_elapsed list = 2.4557113647460938e-05 parse only and list results
  • librdfpymantic_elapsed 11.232016324996948 - parse and store to a Pymantic Graph (stores triples in sets)
  • librdfrdflib_elapsed 18.871827363967896 - parse and store to an RDFLib Graph (stores triples as nested dicts)
  • pymanticlarkturtle_elapsed 50.34455347061157 - Pymantic Turtle parse and store

Comparative stats on 500k triples for RDFLib Turtle parsers:

  • Lark-Cython Turtle in-place parser: 32.58854
  • Extant Turtle parser: 25.04332

Sorry it's such a scrappy interim report, I'm only in the middle of it all atm.

@ghost ghost mentioned this pull request Sep 24, 2022
8 tasks
@ghost
Copy link
Author

ghost commented Sep 24, 2022

Updated post with performance stats for extant RDFLib parsers

@ghost ghost closed this by deleting the head repository Mar 22, 2023
@aucampia aucampia added the salvage This issue contains work that may be worth keeping. label Mar 22, 2023
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

salvage This issue contains work that may be worth keeping.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant