Automated Instruction Selection

This is the repository associated with the paper Practical Large-Scale Data Selection for Instruction Tuning.

We release data and models associated with this paper on HuggingFace at this collection.

Install Dependencies

Note I assume you are working in a Linux environment with CUDA-compatible GPUs and conda.

Install the requirements in environment.yml and requirements.txt:

conda env create --file environment.yml --name mmft
conda activate mmft
pip install -r requirements.txt

Download the required data files (for eval and training):

./shell_scripts/download_eval_data.sh
./shell_scripts/download_train_data.sh  # optional if you don't want to do data selection on Tulu datasets.

Now you should be good to go! If you are at Ai2, then you don't even need these steps, since we have a bunch of beaker-gantry scripts that should automatically setup the environment when you run them.

Run the Selection and Training Pipeline

The pipeline for this paper has four steps, described below:

Construct an index for training and eval data.
Select data based off the score.
Train a model on the selected data.
Evaluate the trained model.

Below, we detail the steps when selecting and training data for the RDS+ method. See "Other selection methods" below for how we ran baselines.

Index creation and selection

Scripts for doing indexing and selection live in shell_scripts/core_commands/index_selection_and_creation. I have support for a few different approaches here, but the most important one is RDS+ (see our paper for more details on this).

We start by indexing our training data, using a given base model:

python -m minimal_multitask.compute_influence_cosinesim \
    --model_name meta-llama/Llama-2-7b-hf \
    --seed 42 \
    --train_dataset data/training_data/tulu_v2_unfiltered_data_dedup.jsonl \
    --eval_dataset alpacafarm \
    --index_path rds_200k_exp_from_base_weighted_mean_pool/cosine_train_reps.pt \
    --save_dir rds_200k_exp_from_base_weighted_mean_pool/ \
    --batch_size 1 \
    --pooling weighted_mean

This gives us an index we can re-use for future queries. We then query for the eval set we care about, loading this train index:

for dataset in alpacafarm gsm8k_shots bbh_shots tydiqa_shots codex squad mmlu_shots; do
    python -m minimal_multitask.compute_influence_cosinesim \
        --model_name meta-llama/Llama-2-7b-hf \
        --seed 42 \
        --train_dataset data/training_data/tulu_v2_unfiltered_data_dedup.jsonl \
        --eval_dataset $dataset \
        --index_path rds_200k_exp_from_base_weighted_mean_pool/cosine_train_reps.pt \
        --save_dir rds_200k_exp_from_base_weighted_mean_pool/ \
        --batch_size 1 \
        --pooling weighted_mean
done

Finally, we want to use the computed scores to produce selected datasets for us:

for dataset in alpacafarm gsm8k_shots bbh_shots tydiqa_shots codex squad mmlu_shots; do
    python -m minimal_multitask.get_top_influences \
        --input_files rds_200k_exp_from_base_weighted_mean_pool/${dataset}_cossim.pkl \
        --output_file rds_200k_exp_from_base_weighted_mean_pool/${dataset}_top10k.json \
        --output_size 10000 --selection_method max \
        --train_datasets training_data/tulu_v2_unfiltered_data_dedup.jsonl \
        --output_dataset
done

We can then directly finetune on the resulting dataset with e.g.:

./shell_scripts/core_commands/model_training/full_finetune.sh rds_200k_exp_from_base_weighted_mean_pool/alpacafarm_top10k.json alpacafarm_exp

Sharded Indexing

Some of the datasets we indexed are quite large, and so to process these efficiently we relied on sharding them. We provide easy ways to do this in this codebase.

First, shard your dataset (e.g. using the split command): split -l 200000 <filename>. Pick your shard size based on how many gpus you can run in parallel.

Then, for each shard, run the training data and eval data indexing:

for file in shards/*.jsonl; do
    shard=$(basename $file .jsonl)
    echo "Processing $shard"
    python -m minimal_multitask.compute_influence_cosinesim \
        --model_name meta-llama/Llama-2-7b-hf \
        --seed 42 \
        --train_dataset $file \
        --eval_dataset alpacafarm \
        --index_path rds_selection_${shard}/cosine_train_reps.pt \
        --save_dir rds_selection_${shard}/ \
        --batch_size 1 \
        --pooling weighted_mean
done

for dataset in alpacafarm gsm8k_shots bbh_shots tydiqa_shots codex squad mmlu_shots; do
    for file in shards/*.jsonl; do
        shard=$(basename $file .jsonl)
        echo "Processing $shard"
        python -m minimal_multitask.compute_influence_cosinesim \
            --model_name meta-llama/Llama-2-7b-hf \
            --seed 42 \
            --train_dataset $file \
            --eval_dataset $dataset \
            --index_path rds_selection_${shard}/cosine_train_reps.pt \
            --save_dir rds_selection/${dataset}/${shard}/ \
            --batch_size 1 \
            --pooling weighted_mean
    done
done

We then combine the sharded scores into one file. Note this takes a lot of RAM:

for dataset in alpacafarm bbh_shots codex gsm8k_shots tydiqa_shots mmlu_shots squad arena_hard wildchat; do
    python scripts/combine_pickles.py --input_files rds_selection/${dataset}/**/*.pkl --output_file rds_selection/${dataset}_cossim.pkl
done

Finally, given this we can then select datasets. I also have some more optimized selection code that helps when selecting large datasets. Note you need the full datafile around to construct the train file.

for dataset in alpacafarm bbh_shots codex gsm8k_shots tydiqa_shots mmlu_shots squad arena_hard wildchat; do
    python -m minimal_multitask.get_top_optimized \
        --input_files rds_selection/${dataset}_cossim.json \
        --output_file rds_selection/${dataset}_top10k.json \
        --output_size 10000 --selection_method max \
        --train_datasets <original full datafile>
done

# or if you want to use the round-robin setup
python -m minimal_multitask.get_top_aggregated_influences \
    --input_files rds_selection/*_cossim.json \
    --output_file rds_selection/multitask_rrmax_320k.json \
    --output_size 320000 --selection_method max \
    --train_dataset <original full datafile> \
    --output_dataset

Training

To train a model, you can use the scripts in shell_scripts/core_commands/model_training. The scripts named do the appropriate things. You have to provide a .jsonl file, which contains data formatted following the tulu format (i.e. like the huggingface tulu 2 dataset). You can also randomly subselect from this dataset, or train loras on top of a trained model, or train loras and full finetune. The scripts are mostly self-descriptive.

For example, to train llama 2 7b on a given file, I can run:

./shell_scripts/core_commands/model_training/full_finetune.sh <filename> <run name>

Or to train on x random samples from a file:

./shell_scripts/core_commands/model_training/random_select.sh <filename> <run name> <x>

Please look into the scripts and feel free to edit them (in particular, you might wish to edit the output directory name).

You can also run locally with:

python -m minimal_multitask.instruction_tune \
        --model_name /model \
        --output_dir /results \
        --per_device_train_batch_size 1 \
        --gradient_accumulation_steps 128 \
        --num_train_epochs 2 \
        --learning_rate 2e-5 \
        --seed 42 \
        --warmup_ratio 0.03 \
        --lr_scheduler_type linear \
        --weight_decay 0. \
        --evaluation_strategy no \
        --save_strategy no \
        --logging_steps 1 \
        --is_llama=True \
        --lora_alpha 512 \
        --lora_rank 128 \
        --use_hf_auth_token True \
        --train $TRAIN_FILE

Just replace the flags with what makes sense for you. If you remove the lora_{alpha/rank} flags then full-finetuning will happen.

If you are at Ai2, you can use gantry with these scripts by setting GANTRY=1 before running the script, e.g:

GANTRY=1 ./shell_scripts/core_commands/model_training/full_finetune.sh <filename> <run name>

Evaluation

Given a trained model, you can evaluate using eval/eval.sh <model_path>. Look into that script if you want to run a particular eval. We base our evaluations off the implementations in Open-Instruct. **Note: for non-llama 2 models, please replace create_prompt_with_tulu_chat_format with create_prompt_with_huggingface_tokenizer_template -- this is to make sure chat templates are correctly set.

To evaluate internally at Ai2, just run ./shell_scripts/eval/eval_beaker.sh <model_name> <beaker dataset id>.

Analysis

Generally, analysis and utility scripts live in scripts. These are undocumented for now, but feel free to poke around in there. I also have a tonne of old scripts from a previous version of this project in scripts/old.

Other selection methods

While RDS+ is our flagship method, we also tested a variety of other data selection methods in the paper and we provide code for you to reproduce/try them as well.

As mentioned above, the data selection pipeline can be viewed as indexing (scoring) -> selection -> training. Different methods prodvide differnet scoring, but the rest of the procedure is the same. Notes that some methods don't rely on validation set, therefore having slightly different pipeline.

Embedding

We support embedding models from sentence-transformers and HuggingFace. In the paper, we tested NV-Embed-V2 and GTR-t5-base.

To compute score with sentence embedding:

for dataset in alpacafarm gsm8k_shots bbh_shots tydiqa_shots codex squad mmlu_shots; do
    python -m minimal_multitask.compute_influence_sentence_embedding. \
        --seed 42 \
        --train_dataset data/training_data/tulu_v2_unfiltered_data_dedup.jsonl \
        --eval_dataset $dataset \
        --index_path sentence_embedding_based/gtrt5base_train_reps.pt \
        --save_dir sentence_embedding_based_${dataset}/ \
        --batch_size 1 \
        --model_name_or_path sentence-transformers/gtr-t5-base
done

Perplexity

We follow CCDS and support selection based on top-ppl (data points with highest perplexity) and mid-ppl (data points with middle perplexity).

To compute the perplexity on each training sample:

python -m minimal_multitask.compute_influence_perplexity \
    --model_name meta-llama/Llama-2-7b-hf \
    --seed 42 \
    --train_dataset data/training_data tulu_v2_unfiltered_data_dedup.jsonl \
    --save_dir perplexity/ \
    --batch_size 1 \

Then the selection can be done with (by default using top-ppl selection, specify --mid_ppl for mid_ppl selection):

python scripts/ppl_selection.py \
    --ppl_scores perplexity/nlls.pkl \
    --train_dataset data/training_data tulu_v2_unfiltered_data_dedup.jsonl \
    --output_size 10000 \
    --output_file_path perplexity/top_ppl_10k.json

LESS

We follow the procedure outlined in the paper, using the codebase shared by the authors.

IFD

We use the procedure outlined in the paper, using the codebase shared by the authors. We adjust the codebase slightly to accomodate multi-turn samples and custom chat templates.

Random

For random-unbalanced we simply randomly subsample with the shuf linux utility, e.g.:

sort --random-source=<(get_seeded_random 42) -R file.txt | head -k > output.jsonl

Replace k with the number of samples you wish to take. Credits to this stackoverflow post for this snippet.

For random-balanced, we provide a script:

python scripts/construct_downsampled_balanced.py <input_file> <output_file> --seed 42 --num_samples k

Again, set k to the number of samples you wish to take. This script works by sampling uniformally from the list of values in the dataset field of the dataset. If we run out of data from a given source, we divide its remaining `budget' equally among the dataset sources we have not yet exhausted.

Citation

If you find this paper or code useful, please cite us with:

@misc{ivisondata2025,
  title={{Large-Scale Data Selection for Instruction Tuning}},
  author={Hamish Ivison and Muru Zhang and Faeze Brahman and Pang Wei Koh and Pradeep Dasigi},
  year={2025},
  eprint={2503.01807},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2503.01807},
 }

Name		Name	Last commit message	Last commit date
Latest commit History 425 Commits
.github/workflows		.github/workflows
beaker_scripts		beaker_scripts
data		data
minimal_multitask		minimal_multitask
scripts		scripts
shell_scripts		shell_scripts
.flake8		.flake8
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
dev-requirements.txt		dev-requirements.txt
ds_config.json		ds_config.json
environment.yml		environment.yml
mypy.ini		mypy.ini
performance_graphic.png		performance_graphic.png
plot_pareto_fig.py		plot_pareto_fig.py
plot_scores_dist.py		plot_scores_dist.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
requirements_olmo.txt		requirements_olmo.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Automated Instruction Selection

Install Dependencies

Run the Selection and Training Pipeline

Index creation and selection

Sharded Indexing

Training

Evaluation

Analysis

Other selection methods

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

hamishivi/automated-instruction-selection

Folders and files

Latest commit

History

Repository files navigation

Automated Instruction Selection

Install Dependencies

Run the Selection and Training Pipeline

Index creation and selection

Sharded Indexing

Training

Evaluation

Analysis

Other selection methods

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages