Experimental lark #2121

ghost · 2022-09-24T19:02:39Z

Illustrative work on Lark-based RDFLib parser implementations, available transiently as a draft PR.

NQuads disclaimer

I haven't back-ported the NQuads parser implementation/tests to use ConjunctiveGraph, they work ust fine with the revamped Dataset codebase which I'm using by pragmatic preference. fwiw

test/test_w3c_spec/test_nquads_w3c.py ... [100%] Extant RDFLib NQuads parser
=== 78 passed, 1 skipped, 6 xfailed in 0.40s ===
test/test_experimental/test_lark/test_nquads_w3c.py ... [100%] Lark NQuads parser
=== 85 passed in 0.46s ===
test/test_experimental/test_lark/test_nquadsstar_w3c.py ...[100%] Lark NQuads Star parser
=== 85 passed in 0.38s ===

The Lark-based TrIG parser is still W-I-P, as is the Lark-based Notation3 parser (still working out whether the 2020 grammar is backward-compatible).

In-place vs Post-parse processing

Of interest, wall-clock performance results for the larkturtle parser, comparing in-place vs "classic" post-parse (as per pymantic's implementation) graph update, with and without lark-cython plugin vs the extant RDFLib turtle parser on 50k triples produced by sp2b:

pytest -rP test/test_experimental/test_lark/test_comparative_turtle_parser_performance.py

Lark Turtle classic parser: 5.40572
Lark-Cython Turtle classic parser: 5.37419
Lark Turtle in-place parser: 4.24691
Lark-Cython Turtle in-place parser: 3.01477
Lark-Cython Turtle in-place RDF* parser: 3.47203
Extant Turtle parser: 2.36562

Overall State of play

NTriples

test/test_w3c_spec/test_nt_w3c.py ... [100%] Extant RDFLib Ntriples parser
=== 60 passed, 8 xfailed in 0.27s === Full W3 NTriples test suite

test_ntriples_w3c.py ... 100%] Lark NTriples parser
=== 68 passed in 0.27s === Full W3 NTriples test suite

test_ntriplesstar_w3c.py ... [100%] Lark NTriples RDF Star parser
=== 68 passed in 0.30s === Full W3 NTriples test suite

test_ntriplesstar_w3c_rdfstar.py ... [100%] Lark NTriples RDF Star parser
=== 13 passed in 0.15s === RDF Star NTriples test suite

NQuads

test/test_w3c_spec/test_nquads_w3c.py ... [100%] Extant RDFLib NQuads parser
=== 78 passed, 1 skipped, 6 xfailed in 0.33s === Full W3 NQuads test suite

test_nquads_w3c.py ... [100%]  Lark NQuads parser
=== 85 passed in 0.35s === Full W3 NQuads test suite

test_nquadsstar_w3c.py ... [100%]  Lark NQuads RDF Star parser
=== 85 passed in 0.39s === Full W3 NQuads test suite

test_nquadsstar_w3c_rdfstar.py ... [100%] Lark NQuads RDF Star parser
=== 13 passed in 0.14s === RDF Star NQuads test suite

Turtle

test/test_w3c_spec/test_turtle_w3c.py ... [100%] Extant RDFLib Turtle parser
=== 262 passed, 29 xfailed in 1.02s === Full W3 Turtle test suite

test_turtle_w3c.py ... [100%] Lark Turtle parser
=== 278 passed, 13 xfailed in 1.26s ==== Full W3 Turtle test suite

test_turtlestar_w3c.py ... [100%] Lark Turtle RDF Star parser
=== 278 passed, 13 xfailed in 1.32s === Full W3 Turtle test suite

test_turtlestar_w3c_rdfstar.py ... [100%] Lark Turtle RDF Star parser
=== 43 passed, 4 xfailed in 0.66s ===  RDF Star Turtle test suite

Parse vs Process

Additionally, I did some experimental probing of the parsing/processing situation, taking advantage of tgbugs' implementation of librdf, a wrapper for RDFLib and the Redland C-implemented RDF/XML and Turtle parsers and found some illuminating results for the Redland Turtle parser with 500k triples:

redland_elapsed 0.8678569793701172 - parse only
redland_elapsed list = 2.4557113647460938e-05 parse only and list results
librdfpymantic_elapsed 11.232016324996948 - parse and store to a Pymantic Graph (stores triples in sets)
librdfrdflib_elapsed 18.871827363967896 - parse and store to an RDFLib Graph (stores triples as nested dicts)
pymanticlarkturtle_elapsed 50.34455347061157 - Pymantic Turtle parse and store

Comparative stats on 500k triples for RDFLib Turtle parsers:

Lark-Cython Turtle in-place parser: 32.58854
Extant Turtle parser: 25.04332

Sorry it's such a scrappy interim report, I'm only in the middle of it all atm.

`pytest -rP test/test_experimental/test_lark/test_comparative_turtle_parser_performance.py`

ghost · 2022-09-24T20:01:03Z

Updated post with performance stats for extant RDFLib parsers

Graham Higgins added 4 commits September 24, 2022 18:20

migrate into transient branch for exposure

dc6094b

add illustrations of different tactics

b5ea792

add tactic-difference performance test, run with ...

0421d6d

`pytest -rP test/test_experimental/test_lark/test_comparative_turtle_parser_performance.py`

include notation3 work and sample SPARQL Lark grammar files

8f88ab1

ghost mentioned this pull request Sep 24, 2022

rdfstar parsers and serializers update #2115

Open

8 tasks

Graham Higgins added 5 commits September 24, 2022 21:04

ci-conformance

2fbf790

ci-conformance

b6dc087

remove unwanted

f40622d

ci-conformance

364b8e4

More work on lark notation3 parser.

22267c9

ghost closed this by deleting the head repository Mar 22, 2023

aucampia added the salvage This issue contains work that may be worth keeping. label Mar 22, 2023

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Experimental lark #2121

Experimental lark #2121

Uh oh!

ghost commented Sep 24, 2022 •

edited by ghost

Loading

Uh oh!

ghost commented Sep 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Experimental lark #2121

Experimental lark #2121

Uh oh!

Conversation

ghost commented Sep 24, 2022 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

NQuads disclaimer

In-place vs Post-parse processing

Overall State of play

NTriples

NQuads

Turtle

Parse vs Process

Uh oh!

ghost commented Sep 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ghost commented Sep 24, 2022 •

edited by ghost

Loading