Selector Schema codegen

Introduction

ssc-gen - a python-based DSL to describe parsers for html documents, which is translated into a standalone parsing module

For a better experience using this library, you should know:

HTML CSS selectors (CSS3 standard min), Xpath
regular expressions (PCRE)

Project solving next problems:

designed for SSR (server-side-render) html pages parsers, NOT FOR REST-API, GRAPHQL ENDPOINTS
decrease boilerplate code
generates independent modules from the project that can be reused.
generates docstring documentation and the signature of the parser output.
for a better IDE experience, generates a typedefs, type annotations (if the target programming language supports it).
support annotation and parsing of JSON-like strings from a document
AST API codegen for developing a converter for parsing

Support converters

Current support converters

Language	HTML parser lib + dependencies	XPath	CSS3	CSS4	Generated annotations, types, structs	formatter dependency
Python (3.8+)	bs4, lxml	N	Y	Y	TypedDict`1`, list, dict	ruff
...	parsel	Y	Y	N	...	...
...	selectolax (lexbor)	N	Y	N	...	...
...	lxml	Y	Y	N	...	...
js (ES6)`2`	pure (firefox/chrome extension/nodejs)	Y	Y	Y	Array, Map`3`	prettier
go (1.10+) (UNSTABLE)	goquery, gjson (`4`)	N	Y	N	struct(+json anchors), array, map	gofmt
lua (5.2+), luajit(2+) (UNSTABLE)`5`	lua-htmlparser, lrexlib(opt), dkjson	N	Y	N	EmmyLua	LuaFormatter

CSS3 means support next selectors:
- basic: (tag, .class, #id, tag1,tag2)
- combined: (div p, ul > li, h2 +p, title ~head)
- attribute: (a[href], input[type='text'], a[href*='...'], ...)
- CSS3 pseudo classes: (:nth-child(n), :first-child, :last-child)
CSS4 means support next selectors:
- :nth-of-type(), :where(), :is(), :not() etc
1this annotation type was deliberately chosen as a compromise reasons: Python has many ways of serialization: namedtuple, dataclass, attrs, pydantic, msgspec, etc
- TypedDict is like a build-in dict, but with IDE and linter hint support, and you can easily implement an adapter for the required structure.
2ES8 standart required if needed use PCRE re.S | re.DOTALL flag
3js exclude build-in serialization methods, used standard Array and Map types. Focus on the singanutur documentation!
4golang has not been tested much, there may be issues
formatter dependency - optional dependency for prettify and fix codestyle
5lua
- Experimental Research PoC, performance and stability are not guaranteed
- Priority on generation to pure lua without C-libs dependencies. using mva/htmlparser and dhkolf/dkjson
- Translates unsupported CSS3 selectors into the equivalent in the form of function calls:
  - for example, div +p is equivalent to CssExt.combine_plus(root:select("div"), "p")
- Translates PCRE regex to string pattern matching (with restrictions) for more information in lua_re_compat.py

Limitations

For maximum portability of the configuration to the target language:

If possible, use CSS selectors: they are guaranteed to be converted to XPATH
Unlike javascript, most html parse libs implement CSS3 selectors standard. They may not fully implement the functionality! Check the html parser lib documentation aboud CSS selectors before implement code. Examples:
1. Several libs not support + operations (eg: selectolax(modest), dart.universal_html)
2. For research purpose, lua_htmlparser include converter for unsupported CSS3 query syntax

HTML parser libs maybe not supports attribute selectors: *=, ~=, |=, ^=, $=
Several libs not support pseudo classes (eg: standard dart.html lib miss this feature).

Getting started

ssc_gen required python 3.10 version or higher

Install

pip:

pip install ssc_codegen

uv:

uv pip install ssc_codegen

Example

Create a file `schema.py` with:

from ssc_codegen import ItemSchema, D

class HelloWorld(ItemSchema):
    title = D().css('title').text()
    a_hrefs = D().css_all('a').attr('href')

try it in cli

Note

this tools developed for testing purposes, not for web-scraping tasks

eval from file

Download any html file and pass as argument:

ssc-gen parse-from-file index.html -t schema.py:HelloWorld

Short options descriptions:

-t --target - config schema file and class from where to start the parser

send GET request to url and parse response

ssc-gen parse-from-url https://example.com -t schema.py:HelloWorld

send request via Chromium browser (CDP protocol)

ssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorld

Note

if script cannot found chrome executable - provide it manually:

ssc-gen parse-from-chrome https://example.com -t schema.py:HelloWorld -sc /usr/bin/chromium

Convert to code

Convert to code for use in projects:

Note

for example, used js: it can be fast test in developer console

ssc-gen js schema.py -o .

Code output looks like this:

// autogenerated by ssc-gen DO NOT_EDIT
/***
 *
 * {
 *     "title": "String",
 *     "a_hrefs": "Array<String>"
 * }*/
class HelloWorld {
  constructor(doc) {
    if (typeof doc === "string") {
      this._doc = new DOMParser().parseFromString(doc, "text/html");
    } else if (doc instanceof Document || doc instanceof Element) {
      this._doc = doc;
    } else {
      throw new Error("Invalid input: Expected a Document, Element, or string");
    }
  }

  _parseTitle(v) {
    let v0 = v.querySelector("title");
    return typeof v0.textContent === "undefined"
      ? v0.documentElement.textContent
      : v0.textContent;
  }

  _parseAHrefs(v) {
    let v0 = Array.from(v.querySelectorAll("a"));
    return v0.map((e) => e.getAttribute("href"));
  }

  parse() {
    return {
      title: this._parseTitle(this._doc),
      a_hrefs: this._parseAHrefs(this._doc),
    };
  }
}

Copy code output and past to developer console:

Print output:

alert(JSON.stringify(new HelloWorld(document).parse()));

You can use any html source:

parse from html files
parse from http responses
parse from browsers: playwright, selenium, chrome-cdp, etc.
call curl in shell and parse STDIN
use in STDIN pipelines with third-party tools like projectdiscovery/httpx

Name		Name	Last commit message	Last commit date
Latest commit History 811 Commits
docs		docs
examples		examples
scripts		scripts
ssc_codegen		ssc_codegen
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Selector Schema codegen

Introduction

For a better experience using this library, you should know:

Project solving next problems:

Support converters

Limitations

Getting started

Install

Example

Create a file `schema.py` with:

try it in cli

eval from file

send GET request to url and parse response

send request via Chromium browser (CDP protocol)

Convert to code

Copy code output and past to developer console:

See also

About

Uh oh!

Uh oh!

Languages

License

vypivshiy/selector_schema_codegen

Folders and files

Latest commit

History

Repository files navigation

Selector Schema codegen

Introduction

For a better experience using this library, you should know:

Project solving next problems:

Support converters

Limitations

Getting started

Install

Example

Create a file schema.py with:

try it in cli

eval from file

send GET request to url and parse response

send request via Chromium browser (CDP protocol)

Convert to code

Copy code output and past to developer console:

See also

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages

Create a file `schema.py` with: