This project consists of three core steps. The code is located in the scripts/
folder.
- Clip Segmentation
- Caption Generation
- Merge Captions
Please follow the order above when executing the steps.
This script splits videos in the input folder into smaller clips based on certain rules and saves them to a specified directory.
Parameters:
--input_folder
: Path to the folder containing original video files--output_dir
: Output directory for the segmented video clips--max_workers
: Maximum number of threads for concurrent processing
Example Command:
python clip_segmentation.py \
--input_folder /path/to/your/videos \
--output_dir /path/to/output/clips \
--max_workers 50
This script generates captions for the video clips produced in the previous step.
Parameters:
--model-path
: Path to the model weights file--video-folder
: Directory containing the video clips from the previous step--jsonl-file
: Input JSONL file recording video clip information--result-file
: Output file path for results (generated per chunk)--batch-size
: Batch size used during inference--num-chunks
: Number of chunks to split the task into--chunk-idx
: Index of the current chunk being processed (starting from 0)--LOG_FILE
: Log output file path (optional)
Example Command (Chunked Processing):
export CKPT=/path/to/model_weights
export BASE=/path/to/workdir
CHUNKS=8
IDX=0
LOG_FILE=output_log_${IDX}.log
python caption_generator.py \
--model-path ${CKPT} \
--video-folder your/video/folder \
--jsonl-file your/jsonl/file \
--result-file your/result/file \
--batch-size 16 \
--num-chunks ${CHUNKS} \
--chunk-idx ${IDX} \
> "$LOG_FILE" 2>&1 &
This script merges all generated caption files from individual chunks into a single final output file in JSONL format.
Parameters:
--cap-file
: Output file path after merging--result-file
: Paths to all chunk result files (wildcard matching supported)--num-workers
: Number of workers for parallel processing
Example Command:
python caption_merger.py \
--cap-file /path/to/final_caption_output.jsonl \
--result-file "/path/to/results/result_0_*.jsonl" \
--num-workers 50
The LoVR benchmark dataset has been released on Hugging Face:
๐ LoVR-benchmark
The dataset includes:
- All the Videos and Clips
- Caption annotations
- Video annotations
We encourage responsible usage of the LoVR benchmark. Users should not use the dataset to develop harmful, discriminatory, or privacy-invasive applications. We recommend performing fairness audits and adhering to ethical AI principles when using this dataset.
If you make use of this dataset in your work, please cite our paper (link coming soon).