This repo contains chain velds encapsulating comparison of two tokenization tools: teitok-tools and xmlanntools, using their veldified versions: veld_code__teitok-tools and veld_code__xmlanntools respectively. Additionally, veld_code__downloader is reused and veld_code__jupyter_notebook_base is integrated directly into this chain repo as a git subtree.
- git
- docker compose (note: older docker compose versions require running
docker-compose
instead ofdocker compose
)
Clone this repo with all its submodules
git clone --recurse-submodules https://github.com/veldhub/veld_chain__compare_tokenizations.git
The following chain velds were used. Open the respective veld yaml file for more information.
This chain reuses the individual chains described below and allows batch execution of them all in one go.
docker compose -f veld_step_all.yaml up
Downloads a sample TEI XML from the german ELTeC corpus
docker compose -f veld_step_1_download.yaml up
./veld_step_2_xmlanntools.yaml
Runs xmlanntools to tokenize the TEI file.
docker compose -f veld_step_2_xmlanntools.yaml up
Runs teitok-tools to tokenize the TEI file.
docker compose -f veld_step_3_teitok.yaml up
./veld_step_4_jupyter_analysis.yaml
Launches a jupyter notebook that compares the two enriched TEI files. After running the following command, the notebook can be opened at http://localhost:8888/ . The notebook is persisted at ./code/veld_code__jupyter_analysis/src/enrichment_summary.ipynb
docker compose -f veld_step_4_jupyter_analysis.yaml up