Skip to content

Commit 1d0eb3c

Browse files
committed
Added subset selection scripts for diverse dataset sampling v2
Key Features: - Script-only structure (no __init__.py files) optimized for CLI usage - Consolidated CLI with __main__.py entry point - Python 3.12 and CUDA 12.1+ support - Multi-GPU parallel processing with automatic detection - Snowflake Arctic Embed encoder integration - Fixed semaphore leak issues with explicit resource cleanup - Simplified single-command installation - Removed Redundant Variables Credits: Original implementation by Krishnateja Killamsetty Based on: https://github.com/krishnatejakk/DataCurate4LLMs Signed-off-by: RobuRishabh <[email protected]>
1 parent 1fdc247 commit 1d0eb3c

File tree

7 files changed

+1972
-0
lines changed

7 files changed

+1972
-0
lines changed
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
data/
2+
# Python
3+
__pycache__/
4+
*.py[cod]
5+
*.pyo
6+
*.pyd
7+
*.egg-info/
8+
*.egg
9+
dist/
10+
build/
11+
pip-wheel-metadata/
12+
13+
# Environments
14+
.env
15+
*.env
16+
.venv/
17+
venv/
18+
env/
19+
20+
# Tooling caches
21+
.mypy_cache/
22+
.pytype/
23+
.ruff_cache/
24+
.pytest_cache/
25+
.tox/
26+
.nox/
27+
.coverage
28+
.coverage.*
29+
coverage.xml
30+
31+
# Editors/IDEs
32+
.vscode/
33+
.idea/
34+
.history/
35+
.DS_Store
36+
Thumbs.db
37+
38+
# Jupyter
39+
.ipynb_checkpoints/
40+
41+
# Cursor
42+
.cursor/
43+
44+
# Local runs / artifacts
45+
local_outputs/
46+
47+
# Logs
48+
*.log
49+
logs/
50+
venv/

scripts/subset_selection/README.md

Lines changed: 331 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,331 @@
1+
# Subset Selection Scripts
2+
3+
This package provides functionality for selecting diverse subsets of datasets using facility location maximization with embedding-based similarity.
4+
5+
## Overview
6+
7+
The subset selection scripts use advanced machine learning techniques to identify representative samples from large datasets. This is particularly useful for:
8+
- Reducing dataset size while maintaining diversity
9+
- Selecting training data that covers the full distribution
10+
- Creating validation/test sets that represent the full dataset
11+
12+
## Requirements
13+
14+
- **Python 3.12** (required for compatibility with the rest of the codebase)
15+
- CUDA 12.1+ for GPU support (recommended)
16+
17+
## Installation
18+
19+
Install all dependencies including PyTorch with CUDA support:
20+
21+
```bash
22+
pip install -r scripts/subset_selection/requirements.txt --extra-index-url https://download.pytorch.org/whl/cu121
23+
```
24+
25+
**Note:** The CLI automatically configures multiprocessing to use the 'spawn' method for CUDA compatibility, enabling efficient multi-GPU parallel processing.
26+
27+
## Model Setup
28+
29+
The default encoder (`Snowflake/snowflake-arctic-embed-l-v2.0`) needs to be available before running. Choose one of these options:
30+
31+
### Option 1: Auto-download with Testing Mode (Recommended for First Run)
32+
33+
Use `--testing-mode` to automatically download the model from HuggingFace:
34+
35+
```bash
36+
python -m scripts.subset_selection \
37+
--input dataset.jsonl \
38+
--subset-sizes "0.1" \
39+
--testing-mode \
40+
--output-dir output/
41+
```
42+
43+
**Important:** Despite the name, `--testing-mode` **still uses all available GPUs** for processing. It simply allows automatic model downloading from HuggingFace and provides CPU fallback if no GPUs are detected. After the first run, the model is cached and you can omit this flag.
44+
45+
### Option 2: Pre-download with ilab
46+
47+
If you have `ilab` installed:
48+
49+
```bash
50+
ilab model download --repository Snowflake/snowflake-arctic-embed-l-v2.0
51+
```
52+
53+
The model will be cached at `~/.cache/instructlab/models/Snowflake/snowflake-arctic-embed-l-v2.0`
54+
55+
## Usage
56+
57+
### Command Line Interface (Recommended)
58+
59+
The easiest way to use subset selection is through the CLI:
60+
61+
```bash
62+
# Basic usage - Select 10% and 50% subsets
63+
python -m scripts.subset_selection \
64+
--input path/to/dataset.jsonl \
65+
--subset-sizes "0.1,0.5" \
66+
--output-dir output/
67+
68+
# Absolute counts - Select exactly 1000 and 5000 samples
69+
python -m scripts.subset_selection \
70+
--input path/to/dataset.jsonl \
71+
--subset-sizes "1000,5000" \
72+
--output-dir output/
73+
74+
# Small dataset (< 100k samples) - adjust epsilon and num_folds
75+
python -m scripts.subset_selection \
76+
--input path/to/small_dataset.jsonl \
77+
--subset-sizes "0.5" \
78+
--epsilon 0.1 \
79+
--num-folds 10 \
80+
--output-dir output/
81+
82+
# Multiple files combined
83+
python -m scripts.subset_selection \
84+
--input file1.jsonl file2.jsonl file3.jsonl \
85+
--subset-sizes "0.25,0.5" \
86+
--combine-files \
87+
--output-dir output/
88+
89+
# Testing mode (allows model auto-download, still uses GPUs if available)
90+
python -m scripts.subset_selection \
91+
--input dataset.jsonl \
92+
--subset-sizes "0.1" \
93+
--testing-mode \
94+
--output-dir output/
95+
```
96+
97+
#### CLI Options
98+
99+
```
100+
Required:
101+
--input <file> [<file> ...] Input file(s) to process (JSONL, JSON, CSV, Parquet)
102+
--subset-sizes <sizes> Comma-separated sizes (e.g., "0.1,0.5" or "1000,5000")
103+
104+
Optional:
105+
--output-dir <dir> Output directory (default: output)
106+
--batch-size <int> Batch size for processing (default: 100000)
107+
--num-folds <int> Number of folds/partitions (default: 50)
108+
--epsilon <float> Optimization parameter (default: 160.0)
109+
--num-gpus <int> Number of GPUs to use (default: auto-detect)
110+
--combine-files Combine multiple input files before processing
111+
--testing-mode Enable model auto-download and CPU fallback
112+
--encoder-type <str> Encoder type (default: arctic)
113+
--encoder-model <str> Model name (default: Snowflake/snowflake-arctic-embed-l-v2.0)
114+
--template-name <str> Template name (default: conversation)
115+
--seed <int> Random seed (default: 42)
116+
```
117+
118+
### Python API
119+
120+
You can also use subset selection directly in Python:
121+
122+
```python
123+
from scripts.subset_selection.subset_selection import subset_datasets
124+
125+
# Select subsets from your dataset
126+
subset_datasets(
127+
input_files=["path/to/your/dataset.jsonl"],
128+
subset_sizes=[0.1, 0.5], # 10% and 50% of the dataset
129+
)
130+
```
131+
132+
### Advanced Python Configuration
133+
134+
```python
135+
from scripts.subset_selection.subset_selection import (
136+
subset_datasets,
137+
BasicConfig,
138+
EncoderConfig,
139+
TemplateConfig,
140+
SystemConfig
141+
)
142+
143+
# Configure subset selection
144+
subset_datasets(
145+
input_files=["dataset1.jsonl", "dataset2.jsonl"],
146+
subset_sizes=[1000, 5000], # Select 1000 and 5000 samples
147+
output_dir="output",
148+
batch_size=100000,
149+
num_folds=50,
150+
combine_files=False,
151+
epsilon=160.0,
152+
encoder_type="arctic",
153+
encoder_model="Snowflake/snowflake-arctic-embed-l-v2.0",
154+
template_name="conversation",
155+
)
156+
```
157+
158+
## Configuration
159+
160+
### BasicConfig Parameters
161+
162+
- **`output_dir`**: Directory for output files (default: `"output"`)
163+
- **`batch_size`**: Batch size for processing (default: `100000`)
164+
- **`num_folds`**: Number of folds/partitions for subset selection (default: `50`)
165+
- The dataset is divided into folds for parallel processing across GPUs
166+
- **Recommendations based on dataset size:**
167+
- < 1,000 samples: Use `5-10` folds
168+
- 1,000-10,000 samples: Use `10-20` folds
169+
- 10,000-100,000 samples: Use `20-50` folds
170+
- \> 100,000 samples: Use `50-100` folds (default: 50)
171+
- More folds = better parallelization but higher memory usage per fold
172+
- Use fewer folds for small datasets to ensure each fold has enough samples
173+
- **`combine_files`**: Whether to combine multiple input files (default: `False`)
174+
- **`epsilon`**: Epsilon parameter for the LazierThanLazyGreedy optimizer (default: `160.0`)
175+
- Controls the trade-off between optimization quality and speed
176+
- **Recommendations based on dataset size:**
177+
- < 1,000 samples: Use `0.01-0.1`
178+
- 1,000-10,000 samples: Use `0.1-1.0`
179+
- 10,000-100,000 samples: Use `1.0-10.0`
180+
- \> 100,000 samples: Use `160.0` (default)
181+
182+
### EncoderConfig Parameters
183+
184+
- `encoder_type`: Type of encoder to use (default: "arctic")
185+
- `encoder_model`: Model name for the encoder
186+
- `instruction`: Custom instruction for embedding generation
187+
- `testing_mode`: Enable model auto-download from HuggingFace and CPU fallback (default: False)
188+
189+
### TemplateConfig Parameters
190+
191+
- `template_name`: Name of the template to use (default: "conversation")
192+
- `templates`: Custom templates for text formatting
193+
194+
### SystemConfig Parameters
195+
196+
- `num_gpus`: Number of GPUs to use (auto-detected by default)
197+
- `seed`: Random seed for reproducibility (default: 42)
198+
- `max_retries`: Maximum number of retries on failure (default: 3)
199+
- `retry_delay`: Delay between retries in seconds (default: 30)
200+
201+
## Package Structure
202+
203+
```
204+
scripts/
205+
└── subset_selection/
206+
├── __main__.py # Entry point for module execution
207+
├── subset_selection.py # Main subset selection logic, CLI, and encoder registry
208+
├── requirements.txt # Package dependencies
209+
├── README.md # This file
210+
├── encoders/
211+
│ └── arctic_encoder.py # Arctic embedding encoder
212+
└── utils/
213+
└── subset_selection_utils.py # Utility functions
214+
```
215+
216+
## Supported Encoders
217+
218+
Currently supported encoders:
219+
- `arctic`: Snowflake Arctic Embed models
220+
221+
To see all supported encoders:
222+
223+
```python
224+
from scripts.subset_selection.subset_selection import get_supported_encoders
225+
print(get_supported_encoders())
226+
```
227+
228+
## Output Files
229+
230+
The script generates several output files:
231+
232+
1. **Embeddings**: Stored in HDF5 format in `{output_dir}/{dataset_name}/embeddings/`
233+
2. **Metadata**: NPZ files containing indices and gains for each subset
234+
3. **Subset Files**: Dataset subsets in the original file format (JSON, CSV, Parquet)
235+
236+
237+
## Quick Start Example
238+
239+
Using your data file:
240+
241+
```bash
242+
# Navigate to project root
243+
cd /Users/roburishabh/Github/odh-data-processing
244+
245+
# Run subset selection - Select 10% and 50% subsets
246+
python -m scripts.subset_selection \
247+
--input scripts/subset_selection/data/combined_cut_50x.jsonl \
248+
--subset-sizes "0.1,0.5" \
249+
--output-dir scripts/subset_selection/data/output \
250+
--epsilon 0.1 \
251+
--num-folds 10
252+
253+
# Check results
254+
ls scripts/subset_selection/data/output/
255+
```
256+
257+
## Troubleshooting
258+
259+
### CUDA Multiprocessing Errors
260+
261+
The CLI automatically handles CUDA multiprocessing compatibility by setting the start method to 'spawn' (required on Linux). If you're using the Python API directly and encounter errors like:
262+
```
263+
RuntimeError: Cannot re-initialize CUDA in forked subprocess
264+
```
265+
266+
Add this at the start of your script:
267+
```python
268+
import multiprocessing
269+
multiprocessing.set_start_method('spawn', force=True)
270+
```
271+
272+
### Model Not Found Error
273+
274+
If you see `Model not found in available models: Snowflake/snowflake-arctic-embed-l-v2.0`, you have two options:
275+
276+
1. **Use `--testing-mode`** to auto-download from HuggingFace (still uses GPUs):
277+
```bash
278+
python -m scripts.subset_selection --input data.jsonl --subset-sizes "0.1" --testing-mode --output-dir output/
279+
```
280+
281+
2. **Pre-download with ilab**:
282+
```bash
283+
ilab model download --repository Snowflake/snowflake-arctic-embed-l-v2.0
284+
```
285+
286+
### Memory Issues
287+
288+
If you run out of GPU memory:
289+
- Reduce `--num-folds` to process larger chunks per GPU
290+
- Reduce `--num-gpus` to use fewer GPUs
291+
- For small datasets (<10k samples), use fewer folds (5-10)
292+
- The default batch size is optimized for A100 GPUs; adjust if needed
293+
294+
### GPU Not Detected
295+
296+
Verify CUDA is properly installed and accessible:
297+
```bash
298+
# Check GPU availability
299+
nvidia-smi
300+
301+
# Check PyTorch CUDA
302+
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'GPU count: {torch.cuda.device_count()}')"
303+
```
304+
305+
## Notes
306+
307+
- **Dataset Size**: Subset selection is optimized for datasets >100k samples
308+
- For smaller datasets, adjust `--epsilon` and `--num-folds` accordingly
309+
- **GPU Requirement**: GPU acceleration is strongly recommended for production use
310+
- The code automatically uses all available GPUs with parallel processing
311+
- CPU fallback available with `--testing-mode` (much slower)
312+
- **Multiple GPUs**: Automatically detects and utilizes all available GPUs
313+
- Uses 'spawn' multiprocessing method for CUDA compatibility
314+
- Override with `--num-gpus` flag if needed
315+
- **Memory**: Each fold processes independently, so more folds = less memory per fold
316+
- **Performance**:
317+
- Larger epsilon values = faster but potentially lower quality
318+
- More folds = better GPU utilization but more overhead
319+
- Multi-GPU processing scales linearly with the number of GPUs
320+
321+
## Credits and Acknowledgements
322+
323+
This subset selection implementation is derived from the **DataCurate4LLMs** project.
324+
325+
### Original Author
326+
**Krishnateja Killamsetty**
327+
328+
329+
### Original Repository
330+
The original codebase can be found at: [https://github.com/krishnatejakk/DataCurate4LLMs](https://github.com/krishnatejakk/DataCurate4LLMs)
331+
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
#!/usr/bin/env python3
2+
"""
3+
Entry point for subset selection when run as a module.
4+
"""
5+
import sys
6+
from scripts.subset_selection.subset_selection import main
7+
8+
if __name__ == "__main__":
9+
sys.exit(main())
10+

0 commit comments

Comments
 (0)