Tools for building a sentence-segmented speech corpus from Catalan government's language learning resource Dictats en línia.
Important: While this code can scrape and process publicly accessible materials, the content itself is subject to a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license as stated in the credits page.
This license means:
- You must give appropriate credit
- You cannot use the materials for commercial purposes
- You cannot distribute modified versions or derivatives of the materials
Therefore, while this tool can create a corpus for research purposes, the resulting data cannot be:
- Used to train speech models (considered a derivative work)
- Used commercially
- Redistributed in modified form
The code itself is open source and can be freely used and modified.
This project consists of two main components:
- Scraper: Downloads audio files and transcripts from Generalitat de Catalunya language resources
- Segmenter: Processes the audio files to segment them into sentence-level audio clips with aligned transcripts
# Clone the repository
git clone https://github.com/yourusername/dictats-scrape.git
cd dictats-scrape
# Install dependencies
pip install -r requirements.txt
# Set up project for development
pip install -e .
# Set up API key for Replicate (required for audio alignment)
export REPLICATE_API_TOKEN=your_token_here
Run the main scraper script:
python scripts/gencat_main.py
This will:
- Download audio files from different Catalan learning levels (B1, B2, C1, C2)
- Extract and save transcripts
- Generate metadata for each topic
- Create a structured directory of files in
downloaded_audio/
Run the segmenter script:
python scripts/segmenter_main.py
This will:
- Process the downloaded audio files
- Use an alignment API to segment audio into sentences
- Create a corpus of sentence-level audio clips with transcripts
- Generate a CSV file with all segments
Process only one file (for testing):
python scripts/segmenter_main.py --process-one
Specify custom directories:
python scripts/segmenter_main.py --data-dir custom_input --output-dir custom_output
Process a specific file:
python scripts/segmenter_main.py --specific-file path/to/audio.mp3 --transcript-file path/to/transcript.txt --level b1 --topic topic_name
- Python 3.7+
- ffmpeg (must be installed and in PATH)
- Replicate API access (for audio alignment)
- Chrome/Chromium (for web scraping with Selenium)
scripts/
: Executable scriptsgencat_main.py
: Main script for running the scrapersegmenter_main.py
: Main script for running the segmenter
src/
: Source codescraper/
: Scraper componentsgencat_scraper.py
: Main scraper class for downloading contentprogress_manager.py
: Tracks progress of scrapingsummary_manager.py
: Generates summaries of downloaded content
segmenter/
: Segmenter componentsgencat_segmenter.py
: Audio processing and segmentation
utils/
: Utility functions
data/
: Data storagedownloaded_audio/
: Raw downloaded audio files and transcriptscorpus/
: Processed, segmented audio files and transcripts
This tool is provided for educational and research purposes only. Users are responsible for complying with the license terms of the materials they access.