Skip to content

Conversation

@RobuRishabh
Copy link
Contributor

@RobuRishabh RobuRishabh commented Oct 14, 2025

Description

https://issues.redhat.com/browse/RHAIENG-1146
This PR contains subset selection-scripts .py files only.

How Has This Been Tested?

  • Tested it in local environment without GPU on CPU only.
  • Tested it on AWS with GPU.

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • New Features

    • Added a comprehensive subset selection pipeline with multi-GPU support for processing datasets.
    • Introduced command-line interface for easy dataset subsetting with customizable parameters.
    • Added support for text embeddings using Snowflake Arctic model.
    • Implemented facility location optimization for intelligent subset generation.
  • Documentation

    • Added comprehensive README with installation, usage, and configuration guidelines.

@coderabbitai
Copy link

coderabbitai bot commented Oct 14, 2025

Walkthrough

This PR introduces a complete subset selection pipeline module at scripts/subset_selection/ featuring embedding generation using Snowflake's Arctic model, multi-GPU processing, facility location-based subset optimization, and CLI-driven orchestration with configurable templating and fold-based selection.

Changes

Cohort / File(s) Summary
Configuration & Package Structure
scripts/subset_selection/__init__.py, scripts/subset_selection/.gitignore
Package initializer re-exporting public API (BasicConfig, DataProcessor, EncoderConfig, etc.); .gitignore file with standard Python, environment, tooling, and IDE exclusions.
Documentation
scripts/subset_selection/README.md, scripts/subset_selection/requirements.txt
README documenting package overview, installation, CLI/Python API usage, configuration parameters, and quick start; pinned dependencies for ML frameworks, data processing, subset selection, templating, and logging.
Encoder Implementation
scripts/subset_selection/encoders/arctic_encoder.py, scripts/subset_selection/encoders/__init__.py
ArcticEmbedEncoder class for text embeddings using Snowflake's pretrained model with FP16 support and per-instance config; encoder registry and discovery function mapping "arctic" to the encoder class.
Core Pipeline & Orchestration
scripts/subset_selection/subset_selection.py
Central DataProcessor orchestrating end-to-end workflow: configuration dataclasses (BasicConfig, EncoderConfig, TemplateConfig, SystemConfig, ProcessingConfig), multi-GPU embedding generation, fold-based subset selection via facility location, dataset loading/combining, templating, and comprehensive error handling.
CLI Interface
scripts/subset_selection/cli.py
Command-line entry point with argument parsing for input files, subset sizes, output directory, GPU configuration, fold parameters, encoder settings, and template options; calls subset_datasets with parsed configuration.
Utility Functions
scripts/subset_selection/utils/subset_selection_utils.py, scripts/subset_selection/utils/__init__.py
Retry decorator for exception handling with CUDA cache cleanup; GPU availability detection; batched pairwise metric computation (cosine, euclidean, RBF) with device management; utility re-exports.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI Interface
    participant DP as DataProcessor
    participant Enc as ArcticEmbedEncoder
    participant GPU1 as GPU-1<br/>(Shard Process)
    participant GPU2 as GPU-2<br/>(Shard Process)
    participant Subset as Facility<br/>Location
    participant Out as Output<br/>Persistence

    CLI->>DP: parse config & call subset_datasets()
    DP->>DP: load & combine datasets
    DP->>DP: split dataset into shards per GPU
    par GPU Embedding Generation
        DP->>GPU1: _process_dataset_shard(shard_1)
        GPU1->>Enc: encode(texts)
        Enc->>Enc: tokenize, batch, inference
        Enc-->>GPU1: embeddings
        GPU1->>GPU1: save embeddings
        GPU1-->>DP: shard complete
    and
        DP->>GPU2: _process_dataset_shard(shard_2)
        GPU2->>Enc: encode(texts)
        Enc-->>GPU2: embeddings
        GPU2->>GPU2: save embeddings
        GPU2-->>DP: shard complete
    end
    DP->>DP: merge shard embeddings
    par Fold-based Subset Selection
        DP->>Subset: process_folds_with_gpu(fold_1)
        Subset->>Subset: facility location optimization
        Subset-->>DP: subset indices
    and
        DP->>Subset: process_folds_with_gpu(fold_2)
        Subset-->>DP: subset indices
    end
    DP->>Out: _save_subset(data, metadata)
    Out-->>CLI: completion signal
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

  • GPU coordination & multi-processing: Multi-GPU shard distribution, process pool management, device placement logic in subset_selection.py and arctic_encoder.py requires careful verification of synchronization and memory cleanup.
  • Embedding generation pipeline: Batching logic, tokenization with max_length handling, CLS pooling, L2 normalization, and tensor/array output options in arctic_encoder.py need validation for correctness across diverse input shapes.
  • Subset selection algorithm: Fold-based facility location optimization via process_folds_with_gpu() in subset_selection.py — verify fold splitting, GPU affinity, and result aggregation.
  • Error handling & retry logic: Decorator in subset_selection_utils.py with CUDA-specific exception handling and GC cleanup; decorator application across DataProcessor methods in subset_selection.py.
  • Configuration & dataclass orchestration: Five interdependent config dataclasses flow through DataProcessor; verify initialization order and default value consistency.
  • Dataset loading & templating: Custom format_text logic and dataset combining in subset_selection.py must correctly preserve indices for subset extraction.

Poem

🐰✨ A subset selection rabbit hops with glee,
Arctic embeddings dance on GPU spree!
Fold by fold, the sparse forests gleam,
Facility location realizes the dream!
Multi-GPU magic, embeddings so fine,
Your data subsets shall perfectly align! 🌟

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title Check ❓ Inconclusive The PR title "[RHAIENG]-1146 Initial Repository Setup and Baseline Testing (only scripts)" describes the nature of the changes at a high level but lacks specificity about the actual content being introduced. While the title accurately reflects that this is an initial setup with testing and includes a clarifying issue reference, the phrase "Initial Repository Setup" is quite generic and doesn't clearly communicate that the main change is introducing a complete subset selection pipeline with embeddings, utilities, and CLI infrastructure. Someone scanning the PR history would need to reference the issue number to understand what specifically is being set up, rather than gleaning it directly from the title. The title is not misleading or unrelated to the changeset, but it's overly broad and could apply to many different types of repository initializations. Consider revising the title to be more specific about the core content being added. For example, "Add end-to-end subset selection pipeline scripts with baseline testing" or "Add subset selection module with embeddings encoder and CLI" would more clearly convey to reviewers scanning the history what this PR introduces, while still maintaining brevity and clarity.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 91.18% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@RobuRishabh RobuRishabh changed the title R1146 only scripts for subset selection [RHAIENG]-1146 Initial Repository Setup and Baseline Testing (only scripts) Oct 14, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 14cd429 and 0300fc6.

📒 Files selected for processing (10)
  • scripts/subset_selection/.gitignore (1 hunks)
  • scripts/subset_selection/README.md (1 hunks)
  • scripts/subset_selection/__init__.py (1 hunks)
  • scripts/subset_selection/cli.py (1 hunks)
  • scripts/subset_selection/encoders/__init__.py (1 hunks)
  • scripts/subset_selection/encoders/arctic_encoder.py (1 hunks)
  • scripts/subset_selection/requirements.txt (1 hunks)
  • scripts/subset_selection/subset_selection.py (1 hunks)
  • scripts/subset_selection/utils/__init__.py (1 hunks)
  • scripts/subset_selection/utils/subset_selection_utils.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
scripts/subset_selection/utils/__init__.py (1)
scripts/subset_selection/utils/subset_selection_utils.py (3)
  • compute_pairwise_dense (86-147)
  • get_default_num_gpus (66-83)
  • retry_on_exception (20-63)
scripts/subset_selection/cli.py (1)
scripts/subset_selection/subset_selection.py (1)
  • subset_datasets (864-923)
scripts/subset_selection/__init__.py (2)
scripts/subset_selection/subset_selection.py (8)
  • BasicConfig (39-78)
  • DataProcessor (168-576)
  • EncoderConfig (82-93)
  • ProcessingConfig (127-165)
  • SystemConfig (112-123)
  • TemplateConfig (97-108)
  • get_supported_encoders (853-861)
  • subset_datasets (864-923)
scripts/subset_selection/encoders/arctic_encoder.py (1)
  • EncoderConfig (47-55)
scripts/subset_selection/encoders/__init__.py (1)
scripts/subset_selection/encoders/arctic_encoder.py (1)
  • ArcticEmbedEncoder (58-204)
scripts/subset_selection/encoders/arctic_encoder.py (1)
scripts/subset_selection/subset_selection.py (1)
  • EncoderConfig (82-93)
scripts/subset_selection/subset_selection.py (3)
scripts/subset_selection/encoders/__init__.py (1)
  • get_encoder_class (12-23)
scripts/subset_selection/utils/subset_selection_utils.py (3)
  • compute_pairwise_dense (86-147)
  • get_default_num_gpus (66-83)
  • retry_on_exception (20-63)
scripts/subset_selection/encoders/arctic_encoder.py (2)
  • EncoderConfig (47-55)
  • encode (170-204)
🪛 markdownlint-cli2 (0.18.1)
scripts/subset_selection/README.md

64-64: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


168-168: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Pylint (4.0.0)
scripts/subset_selection/utils/subset_selection_utils.py

[refactor] 86-86: Too many positional arguments (7/5)

(R0917)

scripts/subset_selection/encoders/arctic_encoder.py

[refactor] 59-59: Too many positional arguments (6/5)

(R0917)


[refactor] 58-58: Too few public methods (1/2)

(R0903)

scripts/subset_selection/subset_selection.py

[refactor] 738-738: Too many branches (14/12)

(R0912)

🪛 Ruff (0.14.0)
scripts/subset_selection/cli.py

1-1: Shebang is present but file is not executable

(EXE001)


116-116: f-string without any placeholders

Remove extraneous f prefix

(F541)


147-147: Consider moving this statement to an else block

(TRY300)


148-148: Do not catch blind exception: Exception

(BLE001)

scripts/subset_selection/utils/subset_selection_utils.py

34-34: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


34-34: Use explicit conversion flag

Replace with conversion flag

(RUF010)


38-40: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


39-39: Use explicit conversion flag

Replace with conversion flag

(RUF010)


44-44: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


44-44: Use explicit conversion flag

Replace with conversion flag

(RUF010)


48-48: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


48-48: Use explicit conversion flag

Replace with conversion flag

(RUF010)


52-52: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


52-52: Use explicit conversion flag

Replace with conversion flag

(RUF010)


80-82: Avoid specifying long messages outside the exception class

(TRY003)


128-128: Avoid specifying long messages outside the exception class

(TRY003)

scripts/subset_selection/encoders/__init__.py

17-20: Abstract raise to an inner function

(TRY301)


17-20: Avoid specifying long messages outside the exception class

(TRY003)


23-23: Avoid specifying long messages outside the exception class

(TRY003)


23-23: Use explicit conversion flag

Replace with conversion flag

(RUF010)

scripts/subset_selection/encoders/arctic_encoder.py

69-71: Avoid specifying long messages outside the exception class

(TRY003)


117-120: Avoid specifying long messages outside the exception class

(TRY003)


148-151: Avoid specifying long messages outside the exception class

(TRY003)


161-164: Avoid specifying long messages outside the exception class

(TRY003)

scripts/subset_selection/subset_selection.py

59-59: Avoid specifying long messages outside the exception class

(TRY003)


155-155: Prefer TypeError exception for invalid type

(TRY004)


155-155: Avoid specifying long messages outside the exception class

(TRY003)


159-159: Prefer TypeError exception for invalid type

(TRY004)


159-159: Avoid specifying long messages outside the exception class

(TRY003)


161-163: Avoid specifying long messages outside the exception class

(TRY003)


165-165: Avoid specifying long messages outside the exception class

(TRY003)


181-181: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.

(S701)


206-206: Avoid specifying long messages outside the exception class

(TRY003)


235-237: Avoid specifying long messages outside the exception class

(TRY003)


256-258: Avoid specifying long messages outside the exception class

(TRY003)


345-345: Avoid specifying long messages outside the exception class

(TRY003)


412-412: Loop control variable fold_idx not used within loop body

Rename unused fold_idx to _fold_idx

(B007)


424-427: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


494-494: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


494-494: Use explicit conversion flag

Replace with conversion flag

(RUF010)


558-558: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


558-558: Use explicit conversion flag

Replace with conversion flag

(RUF010)


614-614: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.

(S701)


640-640: Abstract raise to an inner function

(TRY301)


640-640: Avoid specifying long messages outside the exception class

(TRY003)


683-683: Consider moving this statement to an else block

(TRY300)


688-688: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


688-688: Use explicit conversion flag

Replace with conversion flag

(RUF010)


762-762: Abstract raise to an inner function

(TRY301)


762-762: Avoid specifying long messages outside the exception class

(TRY003)


831-833: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


832-832: Use explicit conversion flag

Replace with conversion flag

(RUF010)


847-847: Consider moving this statement to an else block

(TRY300)


849-849: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


849-849: Use explicit conversion flag

Replace with conversion flag

(RUF010)


916-916: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


916-916: Use explicit conversion flag

Replace with conversion flag

(RUF010)

Comment on lines +253 to +263
if isinstance(size_spec, float):
# if not in range of 0 to 1, raise error
if size_spec <= 0 or size_spec > 1:
raise ValueError(
"Percentage values must be between 0(non-inclusive) and 1(inclusive)"
)
# If between 0 and 1, treat as decimal percentage (0.5 = 50%)
return max(1, int((size_spec) * total_samples))
# Treat as absolute number
return min(size_spec, total_samples)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix subset percentage normalization

ProcessingConfig.__post_init__ explicitly allows float subset_sizes up to 100 to represent percentages, but calculate_subset_size then rejects anything greater than 1. A config such as subset_sizes=[10.0] sails through validation yet explodes here, so the CLI fails on a perfectly valid 10% request. Normalize values >1 to fractions (and keep the validation aligned) so we honor the documented percentage semantics and avoid the runtime error.

         if isinstance(size_spec, float):
-            # if not in range of 0 to 1, raise error
-            if size_spec <= 0 or size_spec > 1:
-                raise ValueError(
-                    "Percentage values must be between 0(non-inclusive) and 1(inclusive)"
-                )
-            # If between 0 and 1, treat as decimal percentage (0.5 = 50%)
-            return max(1, int((size_spec) * total_samples))
+            percentage = size_spec / 100 if size_spec > 1 else size_spec
+            if percentage <= 0 or percentage > 1:
+                raise ValueError(
+                    "Percentage values must be between 0 (exclusive) and 100 (inclusive)"
+                )
+            return max(1, int(percentage * total_samples))

Make the same normalization when computing per-fold budgets so both code paths agree.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if isinstance(size_spec, float):
# if not in range of 0 to 1, raise error
if size_spec <= 0 or size_spec > 1:
raise ValueError(
"Percentage values must be between 0(non-inclusive) and 1(inclusive)"
)
# If between 0 and 1, treat as decimal percentage (0.5 = 50%)
return max(1, int((size_spec) * total_samples))
# Treat as absolute number
return min(size_spec, total_samples)
if isinstance(size_spec, float):
percentage = size_spec / 100 if size_spec > 1 else size_spec
if percentage <= 0 or percentage > 1:
raise ValueError(
"Percentage values must be between 0 (exclusive) and 100 (inclusive)"
)
return max(1, int(percentage * total_samples))
# Treat as absolute number
return min(size_spec, total_samples)
🧰 Tools
🪛 Ruff (0.14.0)

256-258: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In scripts/subset_selection/subset_selection.py around lines 253 to 263, float
size_spec values greater than 1 are currently rejected even though config allows
percentages expressed as >1 (e.g., 10.0 meaning 10%); adjust
calculate_subset_size to first normalize float percentage values >1 by dividing
by 100 (so 10.0 -> 0.10), then validate the resulting fraction is >0 and <=1 and
compute the sample count; also apply the same normalization logic in the
per-fold budget computation path so both code paths treat percentage semantics
consistently.

Comment on lines +769 to +827
for fold_idx, fold_indices in gpu_folds_info:
try:
logger.info(f"Processing fold {fold_idx + 1} on GPU {gpu_id}")

fold_embeddings = embeddings[fold_indices].to(device)

logger.info(f"Computing similarity matrix for fold {fold_idx + 1}")
max_sim_mat = compute_pairwise_dense(
fold_embeddings,
batch_size=50000,
metric="cosine",
device=device,
scaling="additive",
)
similarity_matrix = max_sim_mat.cpu().numpy()

subsets = {}
ds_func = FacilityLocationFunction(
n=similarity_matrix.shape[0],
sijs=similarity_matrix,
mode="dense",
separate_rep=False,
)

for size_spec in subset_sizes:
if isinstance(size_spec, float):
# Percentage-based selection
budget = max(
1, math.ceil(size_spec * similarity_matrix.shape[0])
)
else:
# Absolute number-based selection
budget = max(
1,
math.ceil(
size_spec * (similarity_matrix.shape[0] / total_samples)
),
)

logger.info(
f"Selecting subset of size {budget} for fold {fold_idx + 1}"
)

subset_result = ds_func.maximize(
budget=budget,
optimizer="LazierThanLazyGreedy",
epsilon=epsilon,
stopIfZeroGain=False,
stopIfNegativeGain=False,
verbose=False,
)

subset_indices = [fold_indices[x[0]] for x in subset_result]
subset_gains = [x[1] for x in subset_result]
subsets[size_spec] = {
"indices": subset_indices,
"gains": subset_gains,
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Skip empty folds and cap per-fold budgets

When len(embeddings) < num_folds we manufacture empty folds. Likewise, requesting an absolute subset larger than the dataset (e.g. subset_sizes=[500] on a 200-row corpus) makes the fold-level budget exceed the fold cardinality. In both cases FacilityLocationFunction.maximize is asked to pick at least one item from a fold that has zero (or fewer than budget) candidates, and we crash. Guard the empty folds and clamp budgets to the fold size so the optimizer only runs on feasible sets.

-        for fold_idx, fold_indices in gpu_folds_info:
+        for fold_idx, fold_indices in gpu_folds_info:
+            fold_size = len(fold_indices)
+            if fold_size == 0:
+                logger.info(f"Skipping empty fold {fold_idx + 1} on GPU {gpu_id}")
+                continue
             try:
@@
-                for size_spec in subset_sizes:
+                for size_spec in subset_sizes:
                     if isinstance(size_spec, float):
-                        # Percentage-based selection
-                        budget = max(
-                            1, math.ceil(size_spec * similarity_matrix.shape[0])
-                        )
+                        percentage = size_spec / 100 if size_spec > 1 else size_spec
+                        budget = max(1, math.ceil(percentage * fold_size))
                     else:
                         # Absolute number-based selection
-                        budget = max(
-                            1,
-                            math.ceil(
-                                size_spec * (similarity_matrix.shape[0] / total_samples)
-                            ),
-                        )
+                        budget = max(
+                            1,
+                            math.ceil(size_spec * (fold_size / total_samples)),
+                        )
+                    budget = min(budget, fold_size)
+                    if budget == 0:
+                        continue
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for fold_idx, fold_indices in gpu_folds_info:
try:
logger.info(f"Processing fold {fold_idx + 1} on GPU {gpu_id}")
fold_embeddings = embeddings[fold_indices].to(device)
logger.info(f"Computing similarity matrix for fold {fold_idx + 1}")
max_sim_mat = compute_pairwise_dense(
fold_embeddings,
batch_size=50000,
metric="cosine",
device=device,
scaling="additive",
)
similarity_matrix = max_sim_mat.cpu().numpy()
subsets = {}
ds_func = FacilityLocationFunction(
n=similarity_matrix.shape[0],
sijs=similarity_matrix,
mode="dense",
separate_rep=False,
)
for size_spec in subset_sizes:
if isinstance(size_spec, float):
# Percentage-based selection
budget = max(
1, math.ceil(size_spec * similarity_matrix.shape[0])
)
else:
# Absolute number-based selection
budget = max(
1,
math.ceil(
size_spec * (similarity_matrix.shape[0] / total_samples)
),
)
logger.info(
f"Selecting subset of size {budget} for fold {fold_idx + 1}"
)
subset_result = ds_func.maximize(
budget=budget,
optimizer="LazierThanLazyGreedy",
epsilon=epsilon,
stopIfZeroGain=False,
stopIfNegativeGain=False,
verbose=False,
)
subset_indices = [fold_indices[x[0]] for x in subset_result]
subset_gains = [x[1] for x in subset_result]
subsets[size_spec] = {
"indices": subset_indices,
"gains": subset_gains,
}
for fold_idx, fold_indices in gpu_folds_info:
fold_size = len(fold_indices)
if fold_size == 0:
logger.info(f"Skipping empty fold {fold_idx + 1} on GPU {gpu_id}")
continue
try:
logger.info(f"Processing fold {fold_idx + 1} on GPU {gpu_id}")
fold_embeddings = embeddings[fold_indices].to(device)
logger.info(f"Computing similarity matrix for fold {fold_idx + 1}")
max_sim_mat = compute_pairwise_dense(
fold_embeddings,
batch_size=50000,
metric="cosine",
device=device,
scaling="additive",
)
similarity_matrix = max_sim_mat.cpu().numpy()
subsets = {}
ds_func = FacilityLocationFunction(
n=similarity_matrix.shape[0],
sijs=similarity_matrix,
mode="dense",
separate_rep=False,
)
for size_spec in subset_sizes:
if isinstance(size_spec, float):
percentage = size_spec / 100 if size_spec > 1 else size_spec
budget = max(1, math.ceil(percentage * fold_size))
else:
# Absolute number-based selection
budget = max(
1,
math.ceil(size_spec * (fold_size / total_samples)),
)
budget = min(budget, fold_size)
if budget == 0:
continue
logger.info(
f"Selecting subset of size {budget} for fold {fold_idx + 1}"
)
subset_result = ds_func.maximize(
budget=budget,
optimizer="LazierThanLazyGreedy",
epsilon=epsilon,
stopIfZeroGain=False,
stopIfNegativeGain=False,
verbose=False,
)
subset_indices = [fold_indices[x[0]] for x in subset_result]
subset_gains = [x[1] for x in subset_result]
subsets[size_spec] = {
"indices": subset_indices,
"gains": subset_gains,
}

Comment on lines +122 to +127
distance = torch.cdist(a, b)
squared_distance = distance**2
avg_dist = torch.mean(squared_distance)
torch.div(squared_distance, kw * avg_dist, out=squared_distance)
torch.exp(-squared_distance, out=squared_distance)
return squared_distance
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Clamp the RBF scaling factor to avoid NaNs

When all embeddings in a batch are identical (or nearly so), squared_distance becomes 0 everywhere, making avg_dist zero. Dividing by kw * avg_dist (Line 126) therefore performs a 0/0, yielding NaNs that propagate through the similarity matrix and break downstream selection. Clamp the scale away from zero before dividing.

         if metric == "rbf":
             distance = torch.cdist(a, b)
             squared_distance = distance**2
-            avg_dist = torch.mean(squared_distance)
-            torch.div(squared_distance, kw * avg_dist, out=squared_distance)
+            avg_dist = torch.mean(squared_distance)
+            eps = torch.finfo(squared_distance.dtype).eps
+            scale = kw * torch.clamp(avg_dist, min=eps)
+            torch.div(squared_distance, scale, out=squared_distance)
             torch.exp(-squared_distance, out=squared_distance)
             return squared_distance
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
distance = torch.cdist(a, b)
squared_distance = distance**2
avg_dist = torch.mean(squared_distance)
torch.div(squared_distance, kw * avg_dist, out=squared_distance)
torch.exp(-squared_distance, out=squared_distance)
return squared_distance
if metric == "rbf":
distance = torch.cdist(a, b)
squared_distance = distance**2
- avg_dist = torch.mean(squared_distance)
avg_dist = torch.mean(squared_distance)
eps = torch.finfo(squared_distance.dtype).eps
scale = kw * torch.clamp(avg_dist, min=eps)
torch.div(squared_distance, scale, out=squared_distance)
torch.exp(-squared_distance, out=squared_distance)
return squared_distance
🤖 Prompt for AI Agents
In scripts/subset_selection/utils/subset_selection_utils.py around lines 122 to
127, the RBF scaling divides by kw * avg_dist which can be zero for identical
embeddings causing 0/0 NaNs; clamp avg_dist away from zero before dividing (e.g.
compute eps = torch.finfo(squared_distance.dtype).eps * squared_distance.numel()
or a small constant like 1e-12 on the same device/dtype, then use
avg_dist_clamped = torch.clamp_min(avg_dist, eps) or max(avg_dist, eps) and
divide by kw * avg_dist_clamped) so the division cannot produce NaN and retains
device/dtype consistency.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (4)
scripts/subset_selection/README.md (3)

103-103: Add language specifier to code block.

The code block should specify a language for proper syntax highlighting and to comply with Markdown best practices.

Apply this diff:

-```
+```text
 Required:
   --input <file> [<file> ...]    Input file(s) to process (JSONL, JSON, CSV, Parquet)
   --subset-sizes <sizes>         Comma-separated sizes (e.g., "0.1,0.5" or "1000,5000")

Based on static analysis hints.


207-207: Add language specifier to code block.

The code block should specify a language for proper syntax highlighting.

Apply this diff:

-```
+```text
 scripts/
 ├── __init__.py              # Top-level package initialization
 └── subset_selection/

Based on static analysis hints.


270-270: Add language specifier to code block.

The code block should specify a language for proper syntax highlighting.

Apply this diff:

-```
+```text
 RuntimeError: Cannot re-initialize CUDA in forked subprocess

Based on static analysis hints.

</blockquote></details>
<details>
<summary>scripts/subset_selection/cli.py (1)</summary><blockquote>

`124-124`: **Remove unnecessary f-string prefix.**

The string has no placeholders, so the `f` prefix is unnecessary.

Apply this diff:

```diff
-    print(f"Starting subset selection...")
+    print("Starting subset selection...")

Based on static analysis hints.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0300fc6 and eac077a.

📒 Files selected for processing (3)
  • scripts/subset_selection/README.md (1 hunks)
  • scripts/subset_selection/cli.py (1 hunks)
  • scripts/subset_selection/requirements.txt (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
scripts/subset_selection/cli.py (1)
scripts/subset_selection/subset_selection.py (1)
  • subset_datasets (864-923)
🪛 markdownlint-cli2 (0.18.1)
scripts/subset_selection/README.md

103-103: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


207-207: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


270-270: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 OSV Scanner (2.2.3)
scripts/subset_selection/requirements.txt

[CRITICAL] 1-1: torch 2.5.1+cu121: undefined

(PYSEC-2025-41)


[CRITICAL] 1-1: torch 2.5.1+cu121: PyTorch susceptible to local Denial of Service

(GHSA-3749-ghw9-m3mg)


[CRITICAL] 1-1: torch 2.5.1+cu121: PyTorch: torch.load with weights_only=True leads to remote code execution

(GHSA-53q9-r3pm-6pq6)


[CRITICAL] 1-1: torch 2.5.1+cu121: PyTorch Improper Resource Shutdown or Release vulnerability

(GHSA-887c-mr87-cxwp)

🪛 Ruff (0.14.0)
scripts/subset_selection/cli.py

1-1: Shebang is present but file is not executable

(EXE001)


124-124: f-string without any placeholders

Remove extraneous f prefix

(F541)


155-155: Consider moving this statement to an else block

(TRY300)


156-156: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (10)
scripts/subset_selection/README.md (7)

1-58: LGTM! Clear installation and model setup documentation.

The overview, installation instructions, and model setup options are well-documented and provide clear guidance for users.


59-100: LGTM! Comprehensive CLI usage examples.

The CLI usage examples cover a wide range of scenarios and provide clear guidance for different use cases.


122-161: LGTM! Clear Python API documentation.

The Python API examples are well-structured and demonstrate both basic and advanced usage patterns.


162-204: LGTM! Thorough configuration documentation.

The configuration section provides detailed parameter descriptions with helpful recommendations based on dataset size.


224-243: LGTM! Clear documentation of encoders and outputs.

The supported encoders and output files sections are concise and informative.


280-312: LGTM! Comprehensive troubleshooting guidance.

The troubleshooting section provides clear solutions for common issues users may encounter.


313-328: LGTM! Helpful notes for users.

The notes section provides important context about dataset size recommendations, GPU requirements, and performance considerations.

scripts/subset_selection/cli.py (3)

13-101: LGTM! Well-structured argument parsing.

The argument parsing is comprehensive with clear help text and appropriate defaults for all parameters.


103-122: LGTM! Proper multiprocessing setup and argument parsing.

The multiprocessing setup for CUDA compatibility is appropriate, and the subset size parsing correctly handles both percentage and absolute count formats.


147-162: LGTM! Appropriate exception handling for CLI entry point.

The broad exception handling is acceptable for a CLI entry point, as it ensures user-friendly error messages are displayed regardless of the error type. The function properly returns exit codes for success (0) and failure (1).

@alimaredia
Copy link
Contributor

@RobuRishabh at a high level I can tell this took a lot of work and looks great to me. I really like how you got rid of the config file.

At a high level there's a few small things I would change about this PR:

  1. There's no mention of the author or the original repository this work came from
  2. Since we're not trying to make a regular package out of these scripts all of the __init__.py's can be dropped (https://stackoverflow.com/a/37140173) and relative imports can be used. The code in encoders/__init__.py can go into subset_selection.py.
  3. This is a personal preference, but I don't see the need to have a cli.py. I would combine that file with subset_selection.py and just have subselection.py be the entry point for the user.

@RobuRishabh RobuRishabh force-pushed the R1146-Only-scripts-for-subset-selection branch 3 times, most recently from ae39d24 to e69bef8 Compare October 22, 2025 20:26
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (6)
scripts/subset_selection/requirements.txt (2)

38-49: Blocker: CUDA-only wheels and pinned +cu121 builds break CPU installs and CI

The nvidia-* CUDA wheels and +cu121 variants force a CUDA 12.1 Linux stack; pip fails on CPU-only machines and many CI runners. Make the default requirements CPU-friendly and move CUDA wheels to an optional GPU requirements file or extras. Also drop the explicit triton pin (bundled by torch) to avoid resolver conflicts.

Suggested approach:

  • requirements.txt (CPU-default): remove lines 38–49 and use generic torch/vision/audio wheels.
  • requirements-gpu-cu121.txt: contain the CUDA builds and instruct installing with the PyTorch extra index.

Example edits (CPU default):

- nvidia-cublas-cu12==12.1.3.1
- nvidia-cuda-cupti-cu12==12.1.105
- nvidia-cuda-nvrtc-cu12==12.1.105
- nvidia-cuda-runtime-cu12==12.1.105
- nvidia-cudnn-cu12==9.1.0.70
- nvidia-cufft-cu12==11.0.2.54
- nvidia-curand-cu12==10.3.2.106
- nvidia-cusolver-cu12==11.4.5.107
- nvidia-cusparse-cu12==12.1.0.106
- nvidia-nccl-cu12==2.21.5
- nvidia-nvjitlink-cu12==12.9.86
- nvidia-nvtx-cu12==12.1.105
- torch==2.5.1+cu121
- torchaudio==2.5.1+cu121
- torchvision==0.20.1+cu121
- triton==3.1.0
+ torch==2.5.1
+ torchaudio==2.5.1
+ torchvision==0.20.1

Provide a GPU file with the CUDA-tagged trio and doc: pip install -r requirements-gpu-cu121.txt --extra-index-url https://download.pytorch.org/whl/cu121.

Please confirm the CPU path installs on a clean Python 3.12 environment and that GPU users can install via the GPU file.

Also applies to: 70-76


70-72: Security: pinned PyTorch build flagged by multiple critical advisories

torch==2.5.1+cu121 is flagged (GHSA-3749-ghw9-m3mg, GHSA-53q9-r3pm-6pq6, GHSA-887c-mr87-cxwp). Upgrade to a patched PyTorch release and align torchvision/torchaudio accordingly. If upgrade is blocked, document mitigations and avoid torch.load on untrusted inputs.

What is the first PyTorch version that fixes GHSA-53q9-r3pm-6pq6 and GHSA-887c-mr87-cxwp? Provide official PyTorch release notes or advisories.
scripts/subset_selection/README.md (1)

241-244: Replace developer-specific absolute path with a project-relative path

Use a repo-relative path to avoid confusion and leaking local FS details.

-# Navigate to project root
-cd /Users/roburishabh/Github/odh-data-processing
+# Navigate to project root
+cd odh-data-processing
scripts/subset_selection/utils/subset_selection_utils.py (1)

122-127: Clamp RBF scale to avoid NaNs when avg_dist ≈ 0

Division by kw * avg_dist can produce NaNs for identical embeddings. Clamp the denominator away from zero.

-            avg_dist = torch.mean(squared_distance)
-            torch.div(squared_distance, kw * avg_dist, out=squared_distance)
+            avg_dist = torch.mean(squared_distance)
+            eps = torch.finfo(squared_distance.dtype).eps
+            scale = kw * torch.clamp(avg_dist, min=eps)
+            torch.div(squared_distance, scale, out=squared_distance)
scripts/subset_selection/subset_selection.py (2)

277-285: Fix percentage handling: accept 0–100 floats and normalize

Config allows floats up to 100, but this path rejects values >1. Normalize percentages >1 to fractions and validate 0–100.

-        if isinstance(size_spec, float):
-            # if not in range of 0 to 1, raise error
-            if size_spec <= 0 or size_spec > 1:
-                raise ValueError(
-                    "Percentage values must be between 0(non-inclusive) and 1(inclusive)"
-                )
-            # If between 0 and 1, treat as decimal percentage (0.5 = 50%)
-            return max(1, int((size_spec) * total_samples))
+        if isinstance(size_spec, float):
+            percentage = size_spec / 100 if size_spec > 1 else size_spec
+            if percentage <= 0 or percentage > 1:
+                raise ValueError("Percentage values must be between 0 (exclusive) and 100 (inclusive)")
+            return max(1, int(percentage * total_samples))

816-875: Robustness: skip empty folds and cap per-fold budgets

Empty folds and over-large budgets crash the optimizer. Skip zero-size folds and clamp budget ≤ fold size. Normalize float percentages >1 here too.

-        for fold_idx, fold_indices in gpu_folds_info:
+        for fold_idx, fold_indices in gpu_folds_info:
+            fold_size = len(fold_indices)
+            if fold_size == 0:
+                logger.info("Skipping empty fold %d on GPU %d", fold_idx + 1, gpu_id)
+                continue
@@
-                for size_spec in subset_sizes:
+                for size_spec in subset_sizes:
                     if isinstance(size_spec, float):
-                        # Percentage-based selection
-                        budget = max(
-                            1, math.ceil(size_spec * similarity_matrix.shape[0])
-                        )
+                        percentage = size_spec / 100 if size_spec > 1 else size_spec
+                        budget = max(1, math.ceil(percentage * fold_size))
                     else:
-                        # Absolute number-based selection
-                        budget = max(
-                            1,
-                            math.ceil(
-                                size_spec * (similarity_matrix.shape[0] / total_samples)
-                            ),
-                        )
+                        budget = max(1, math.ceil(size_spec * (fold_size / total_samples)))
+                    budget = min(budget, fold_size)
+                    if budget == 0:
+                        continue
🧹 Nitpick comments (12)
scripts/subset_selection/README.md (3)

21-23: Document CPU vs GPU install flows

Add a CPU-friendly install (default) and an optional GPU install with the PyTorch extra index, aligning with the split requirements.

-```bash
-pip install -r scripts/subset_selection/requirements.txt --extra-index-url https://download.pytorch.org/whl/cu121
-```
+```bash
+# CPU (default)
+pip install -r scripts/subset_selection/requirements.txt
+
+# GPU (CUDA 12.1)
+pip install -r scripts/subset_selection/requirements-gpu-cu121.txt \
+  --extra-index-url https://download.pytorch.org/whl/cu121
+```

99-117: Add languages to fenced code blocks for linting and readability

Several code fences lack a language identifier (MD040). Tag them with bash, python, or text as appropriate.

Also applies to: 203-215, 262-271


329-331: Fix bare URL formatting

Use Markdown link syntax to avoid MD034.

-The original codebase can be found at: [https://github.com/krishnatejakk/DataCurate4LLMs](https://github.com/krishnatejakk/DataCurate4LLMs)
+The original codebase can be found at: [krishnatejakk/DataCurate4LLMs](https://github.com/krishnatejakk/DataCurate4LLMs)
scripts/subset_selection/encoders/arctic_encoder.py (2)

104-110: UX: testing_mode log claims “model not found” without checking

You log “Model not found locally” unconditionally in testing mode. Check os.path.exists(model_path) before warning, and prefer local files even in testing mode to avoid redundant downloads.

-        if hasattr(self.cfg, "testing_mode") and self.cfg.testing_mode:
-            logger.warning(
-                f"Model not found locally at {model_path}. "
-                "Testing mode enabled - downloading from HuggingFace..."
-            )
+        if hasattr(self.cfg, "testing_mode") and self.cfg.testing_mode:
+            if not os.path.exists(model_path):
+                logger.warning(
+                    f"Model not found locally at {model_path}. "
+                    "Testing mode enabled - downloading from HuggingFace..."
+                )
+            else:
+                logger.info(f"Using local model path in testing mode: {model_path}")

Also applies to: 116-121


19-23: Dead code

safe_print is unused. Remove or use it.

scripts/subset_selection/utils/subset_selection_utils.py (2)

13-18: Library code shouldn’t call logging.basicConfig

Configure handlers in the application, not in reusable modules. Remove basicConfig and rely on module loggers.

-logging.basicConfig(
-    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
-)
 logger = logging.getLogger(__name__)

34-40: Preserve tracebacks with logger.exception

Replace logger.error(..., str(e)) with logger.exception("...") to keep stack traces.

-                logger.error(f"GPU out of memory on attempt {attempt + 1}: {str(e)}")
+                logger.exception("GPU out of memory on attempt %d", attempt + 1)
...
-                logger.error(
-                    f"PyTorch runtime error on attempt {attempt + 1}: {str(e)}"
-                )
+                logger.exception("PyTorch runtime error on attempt %d", attempt + 1)
...
-                logger.error(f"Value error on attempt {attempt + 1}: {str(e)}")
+                logger.exception("Value error on attempt %d", attempt + 1)
...
-                logger.error(f"Type error on attempt {attempt + 1}: {str(e)}")
+                logger.exception("Type error on attempt %d", attempt + 1)
...
-                logger.error(f"Index error on attempt {attempt + 1}: {str(e)}")
+                logger.exception("Index error on attempt %d", attempt + 1)
...
-        raise last_exception
+        raise last_exception

Also applies to: 44-52, 80-82, 128-129

scripts/subset_selection/subset_selection.py (4)

361-369: Use Pool context manager to ensure clean teardown

Prefer with Pool(...) as pool: over manual close/join/terminate.

-        pool = Pool(processes=num_gpus)
-        try:
-            shard_files = pool.map(_process_dataset_shard, args_list)
-        finally:
-            pool.close()
-            pool.join()
-            pool.terminate()
+        with Pool(processes=num_gpus) as pool:
+            shard_files = pool.map(_process_dataset_shard, args_list)

426-433: Same: use Pool context manager here too

Mirror the pattern to avoid leaks on exceptions.

-        pool = Pool(processes=self.config.system.num_gpus)
-        try:
-            gpu_results = pool.map(process_folds_with_gpu, gpu_assignments)
-        finally:
-            pool.close()
-            pool.join()
-            pool.terminate()
+        with Pool(processes=self.config.system.num_gpus) as pool:
+            gpu_results = pool.map(process_folds_with_gpu, gpu_assignments)

659-681: Memory pressure: avoid buffering all embeddings in RAM

all_embeddings holds all batches before writing, which can blow memory on large shards. Stream directly to HDF5 (resizable dataset) or flush periodically.

I can provide a streaming HDF5 writer snippet if you want to adopt it in this function.

Also applies to: 682-691, 703-717


1105-1111: Nit: remove unused f-string prefix

print(f"Starting subset selection...") has no placeholders.

-    print(f"Starting subset selection...")
+    print("Starting subset selection...")
scripts/subset_selection/__main__.py (1)

6-6: Use a relative intra‑package import.

More robust within the package and avoids relying on the top‑level package name.

-from scripts.subset_selection.subset_selection import main
+from .subset_selection import main
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eac077a and e69bef8.

📒 Files selected for processing (7)
  • scripts/subset_selection/.gitignore (1 hunks)
  • scripts/subset_selection/README.md (1 hunks)
  • scripts/subset_selection/__main__.py (1 hunks)
  • scripts/subset_selection/encoders/arctic_encoder.py (1 hunks)
  • scripts/subset_selection/requirements.txt (1 hunks)
  • scripts/subset_selection/subset_selection.py (1 hunks)
  • scripts/subset_selection/utils/subset_selection_utils.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • scripts/subset_selection/.gitignore
🧰 Additional context used
🧬 Code graph analysis (3)
scripts/subset_selection/__main__.py (1)
scripts/subset_selection/subset_selection.py (1)
  • main (1081-1139)
scripts/subset_selection/encoders/arctic_encoder.py (1)
scripts/subset_selection/subset_selection.py (1)
  • EncoderConfig (106-117)
scripts/subset_selection/subset_selection.py (2)
scripts/subset_selection/encoders/arctic_encoder.py (3)
  • ArcticEmbedEncoder (58-204)
  • EncoderConfig (47-55)
  • encode (170-204)
scripts/subset_selection/utils/subset_selection_utils.py (3)
  • compute_pairwise_dense (86-147)
  • get_default_num_gpus (66-83)
  • retry_on_exception (20-63)
🪛 markdownlint-cli2 (0.18.1)
scripts/subset_selection/README.md

99-99: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


203-203: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


262-262: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


327-327: Bare URL used

(MD034, no-bare-urls)

🪛 OSV Scanner (2.2.3)
scripts/subset_selection/requirements.txt

[CRITICAL] 1-1: torch 2.5.1+cu121: undefined

(PYSEC-2025-41)


[CRITICAL] 1-1: torch 2.5.1+cu121: PyTorch susceptible to local Denial of Service

(GHSA-3749-ghw9-m3mg)


[CRITICAL] 1-1: torch 2.5.1+cu121: PyTorch: torch.load with weights_only=True leads to remote code execution

(GHSA-53q9-r3pm-6pq6)


[CRITICAL] 1-1: torch 2.5.1+cu121: PyTorch Improper Resource Shutdown or Release vulnerability

(GHSA-887c-mr87-cxwp)

🪛 Pylint (4.0.1)
scripts/subset_selection/utils/subset_selection_utils.py

[refactor] 86-86: Too many positional arguments (7/5)

(R0917)

scripts/subset_selection/encoders/arctic_encoder.py

[refactor] 59-59: Too many positional arguments (6/5)

(R0917)


[refactor] 58-58: Too few public methods (1/2)

(R0903)

scripts/subset_selection/subset_selection.py

[refactor] 362-362: Consider using 'with' for resource-allocating operations

(R1732)


[refactor] 426-426: Consider using 'with' for resource-allocating operations

(R1732)


[refactor] 613-613: Too many statements (57/50)

(R0915)


[refactor] 786-786: Too many branches (16/12)

(R0912)


[refactor] 786-786: Too many statements (53/50)

(R0915)

🪛 Ruff (0.14.1)
scripts/subset_selection/utils/subset_selection_utils.py

34-34: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


34-34: Use explicit conversion flag

Replace with conversion flag

(RUF010)


38-40: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


39-39: Use explicit conversion flag

Replace with conversion flag

(RUF010)


44-44: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


44-44: Use explicit conversion flag

Replace with conversion flag

(RUF010)


48-48: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


48-48: Use explicit conversion flag

Replace with conversion flag

(RUF010)


52-52: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


52-52: Use explicit conversion flag

Replace with conversion flag

(RUF010)


80-82: Avoid specifying long messages outside the exception class

(TRY003)


128-128: Avoid specifying long messages outside the exception class

(TRY003)

scripts/subset_selection/__main__.py

1-1: Shebang is present but file is not executable

(EXE001)

scripts/subset_selection/encoders/arctic_encoder.py

69-71: Avoid specifying long messages outside the exception class

(TRY003)


117-120: Avoid specifying long messages outside the exception class

(TRY003)


148-151: Avoid specifying long messages outside the exception class

(TRY003)


161-164: Avoid specifying long messages outside the exception class

(TRY003)

scripts/subset_selection/subset_selection.py

53-56: Abstract raise to an inner function

(TRY301)


53-56: Avoid specifying long messages outside the exception class

(TRY003)


59-59: Avoid specifying long messages outside the exception class

(TRY003)


59-59: Use explicit conversion flag

Replace with conversion flag

(RUF010)


83-83: Avoid specifying long messages outside the exception class

(TRY003)


179-179: Prefer TypeError exception for invalid type

(TRY004)


179-179: Avoid specifying long messages outside the exception class

(TRY003)


183-183: Prefer TypeError exception for invalid type

(TRY004)


183-183: Avoid specifying long messages outside the exception class

(TRY003)


185-187: Avoid specifying long messages outside the exception class

(TRY003)


189-189: Avoid specifying long messages outside the exception class

(TRY003)


205-205: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.

(S701)


230-230: Avoid specifying long messages outside the exception class

(TRY003)


259-261: Avoid specifying long messages outside the exception class

(TRY003)


280-282: Avoid specifying long messages outside the exception class

(TRY003)


374-374: Avoid specifying long messages outside the exception class

(TRY003)


458-461: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


528-528: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


528-528: Use explicit conversion flag

Replace with conversion flag

(RUF010)


592-592: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


592-592: Use explicit conversion flag

Replace with conversion flag

(RUF010)


648-648: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.

(S701)


674-674: Abstract raise to an inner function

(TRY301)


674-674: Avoid specifying long messages outside the exception class

(TRY003)


727-727: Consider moving this statement to an else block

(TRY300)


732-732: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


732-732: Use explicit conversion flag

Replace with conversion flag

(RUF010)


810-810: Abstract raise to an inner function

(TRY301)


810-810: Avoid specifying long messages outside the exception class

(TRY003)


879-881: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


880-880: Use explicit conversion flag

Replace with conversion flag

(RUF010)


900-900: Consider moving this statement to an else block

(TRY300)


902-902: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


902-902: Use explicit conversion flag

Replace with conversion flag

(RUF010)


973-973: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


973-973: Use explicit conversion flag

Replace with conversion flag

(RUF010)


1105-1105: f-string without any placeholders

Remove extraneous f prefix

(F541)


1136-1136: Consider moving this statement to an else block

(TRY300)


1137-1137: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (2)
scripts/subset_selection/requirements.txt (1)

35-37: Sanity-check numeric stack in isolation

  • Use python3 (matching CI) instead of python3.12 to avoid interpreter mismatches
  • Exclude GPU-specific PyTorch wheels (torch==2.5.1+cu121) when verifying core numeric libs, e.g.:
    grep -Ev '^torch==' scripts/subset_selection/requirements.txt > tmp-reqs.txt
  • Run:
    python3 -m venv /tmp/odh-reqs && source /tmp/odh-reqs/bin/activate
    pip install --upgrade pip
    pip install -r tmp-reqs.txt
    python - <<'PY'
    import numpy, scipy, sklearn, numba, llvmlite; print("OK")
    PY
scripts/subset_selection/__main__.py (1)

8-10: Entry point looks good. Document invocation.

LGTM. Please ensure README shows usage: python -m scripts.subset_selection ... so users don’t try to run the file directly.

Comment on lines +103 to +115
# In testing mode, allow direct download from HuggingFace
if hasattr(self.cfg, "testing_mode") and self.cfg.testing_mode:
logger.warning(
f"Model not found locally at {model_path}. "
"Testing mode enabled - downloading from HuggingFace..."
)
self.tokenizer = AutoTokenizer.from_pretrained(self.cfg.model_name)
self.model = AutoModel.from_pretrained(
self.cfg.model_name,
add_pooling_layer=False,
trust_remote_code=True,
)
else:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Security posture: unrestricted remote code execution during model load

trust_remote_code=True allows arbitrary code from model repos. For reproducibility and safety, pin a specific revision (commit hash) and maintain an allowlist of approved repos; set trust_remote_code only for those.

-            self.tokenizer = AutoTokenizer.from_pretrained(self.cfg.model_name)
-            self.model = AutoModel.from_pretrained(
-                self.cfg.model_name,
+            revision = os.environ.get("ARCTIC_MODEL_REV", None)
+            self.tokenizer = AutoTokenizer.from_pretrained(self.cfg.model_name, revision=revision)
+            self.model = AutoModel.from_pretrained(
+                self.cfg.model_name,
                 add_pooling_layer=False,
-                trust_remote_code=True,
+                trust_remote_code=True,  # allowlisted repo
+                revision=revision,
             )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# In testing mode, allow direct download from HuggingFace
if hasattr(self.cfg, "testing_mode") and self.cfg.testing_mode:
logger.warning(
f"Model not found locally at {model_path}. "
"Testing mode enabled - downloading from HuggingFace..."
)
self.tokenizer = AutoTokenizer.from_pretrained(self.cfg.model_name)
self.model = AutoModel.from_pretrained(
self.cfg.model_name,
add_pooling_layer=False,
trust_remote_code=True,
)
else:
# In testing mode, allow direct download from HuggingFace
if hasattr(self.cfg, "testing_mode") and self.cfg.testing_mode:
logger.warning(
f"Model not found locally at {model_path}. "
"Testing mode enabled - downloading from HuggingFace..."
)
revision = os.environ.get("ARCTIC_MODEL_REV", None)
self.tokenizer = AutoTokenizer.from_pretrained(
self.cfg.model_name,
revision=revision,
)
self.model = AutoModel.from_pretrained(
self.cfg.model_name,
add_pooling_layer=False,
trust_remote_code=True, # allowlisted repo
revision=revision,
)
else:
🤖 Prompt for AI Agents
In scripts/subset_selection/encoders/arctic_encoder.py around lines 103 to 115,
the code loads a remote HF model with trust_remote_code=True which permits
arbitrary remote code execution; change the loader to require a pinned revision
(pass revision=<commit hash> from config) and only enable trust_remote_code when
the model repo is on an approved allowlist. Concretely: add a cfg field for
revision (fail if missing in testing_mode), add or read an allowlist of approved
model IDs, check that cfg.model_name is in that allowlist before setting
trust_remote_code=True (otherwise set it False), and pass
revision=self.cfg.revision to both AutoTokenizer.from_pretrained and
AutoModel.from_pretrained; log a warning and abort/raise if testing_mode
requests an unapproved repo or missing revision.

Comment on lines +181 to +199
encodings = self.tokenizer(
inputs,
max_length=self.cfg.model_config["max_length"],
padding=True,
truncation=True,
return_tensors="pt",
).to(self.cfg.device)

embeddings_list = []
for i in tqdm(
range(0, len(inputs), self.cfg.batch_size),
disable=not show_progress or len(inputs) < 256,
):
batch = {k: v[i : i + self.cfg.batch_size] for k, v in encodings.items()}
outputs = self.model(**batch)
# Take the first token embedding (CLS) and normalize it
embeddings = F.normalize(outputs.last_hidden_state[:, 0], p=2, dim=1)
embeddings_list.append(embeddings.cpu())

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: pre-tokenizing all inputs then moving to GPU will OOM on large batches

self.tokenizer(...).to(self.cfg.device) tokenizes the entire input list and moves all tensors to GPU, defeating batching. Tokenize and move per batch inside the loop.

-        encodings = self.tokenizer(
-            inputs,
-            max_length=self.cfg.model_config["max_length"],
-            padding=True,
-            truncation=True,
-            return_tensors="pt",
-        ).to(self.cfg.device)
-
-        embeddings_list = []
-        for i in tqdm(
+        embeddings_list = []
+        for i in tqdm(
             range(0, len(inputs), self.cfg.batch_size),
             disable=not show_progress or len(inputs) < 256,
         ):
-            batch = {k: v[i : i + self.cfg.batch_size] for k, v in encodings.items()}
+            batch_texts = inputs[i : i + self.cfg.batch_size]
+            enc = self.tokenizer(
+                batch_texts,
+                max_length=self.cfg.model_config["max_length"],
+                padding=True,
+                truncation=True,
+                return_tensors="pt",
+            )
+            batch = {k: v.to(self.cfg.device) for k, v in enc.items()}
             outputs = self.model(**batch)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
encodings = self.tokenizer(
inputs,
max_length=self.cfg.model_config["max_length"],
padding=True,
truncation=True,
return_tensors="pt",
).to(self.cfg.device)
embeddings_list = []
for i in tqdm(
range(0, len(inputs), self.cfg.batch_size),
disable=not show_progress or len(inputs) < 256,
):
batch = {k: v[i : i + self.cfg.batch_size] for k, v in encodings.items()}
outputs = self.model(**batch)
# Take the first token embedding (CLS) and normalize it
embeddings = F.normalize(outputs.last_hidden_state[:, 0], p=2, dim=1)
embeddings_list.append(embeddings.cpu())
embeddings_list = []
for i in tqdm(
range(0, len(inputs), self.cfg.batch_size),
disable=not show_progress or len(inputs) < 256,
):
batch_texts = inputs[i : i + self.cfg.batch_size]
enc = self.tokenizer(
batch_texts,
max_length=self.cfg.model_config["max_length"],
padding=True,
truncation=True,
return_tensors="pt",
)
batch = {k: v.to(self.cfg.device) for k, v in enc.items()}
outputs = self.model(**batch)
# Take the first token embedding (CLS) and normalize it
embeddings = F.normalize(outputs.last_hidden_state[:, 0], p=2, dim=1)
embeddings_list.append(embeddings.cpu())
🤖 Prompt for AI Agents
In scripts/subset_selection/encoders/arctic_encoder.py around lines 181 to 199,
the code currently tokenizes the entire inputs list and calls
.to(self.cfg.device) on the resulting tensors which moves all tokenized data to
GPU and can OOM for large datasets; change to tokenizing inside the loop per
batch and only move that batch's tensors to the device. Specifically, remove the
single pre-tokenize call, and inside the tqdm loop build a slice of raw inputs,
call self.tokenizer(batch_inputs, max_length=..., padding=True, truncation=True,
return_tensors="pt") and then call .to(self.cfg.device) on that batch dict
before passing to model; keep collecting normalized embeddings.cpu() into
embeddings_list as before.

Comment on lines +604 to +577
extension = input_file.split(".")[-1]
if extension in ["json", "jsonl"]:
subset_data.to_json(output_file, orient="records", lines=True)
elif extension == "csv":
subset_data.to_csv(output_file, index=False)
elif extension == "parquet":
subset_data.to_parquet(output_file)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Write JSON vs JSONL correctly

to_json(..., lines=True) produces JSONL. Only set lines=True for .jsonl; use lines=False for .json.

-        if extension in ["json", "jsonl"]:
-            subset_data.to_json(output_file, orient="records", lines=True)
+        if extension == "jsonl":
+            subset_data.to_json(output_file, orient="records", lines=True)
+        elif extension == "json":
+            subset_data.to_json(output_file, orient="records", lines=False)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
extension = input_file.split(".")[-1]
if extension in ["json", "jsonl"]:
subset_data.to_json(output_file, orient="records", lines=True)
elif extension == "csv":
subset_data.to_csv(output_file, index=False)
elif extension == "parquet":
subset_data.to_parquet(output_file)
extension = input_file.split(".")[-1]
if extension == "jsonl":
subset_data.to_json(output_file, orient="records", lines=True)
elif extension == "json":
subset_data.to_json(output_file, orient="records", lines=False)
elif extension == "csv":
subset_data.to_csv(output_file, index=False)
elif extension == "parquet":
subset_data.to_parquet(output_file)
🤖 Prompt for AI Agents
In scripts/subset_selection/subset_selection.py around lines 604 to 611, the
code always calls to_json(..., lines=True) which writes JSONL; change the logic
to write JSONL only when the extension is ".jsonl" and write normal JSON for
".json" (i.e., call to_json(..., lines=False) or omit lines), keeping
orient="records" for both if desired; update the extension check to distinguish
"json" vs "jsonl" and call to_json with lines=False for ".json" and lines=True
for ".jsonl".

@RobuRishabh RobuRishabh force-pushed the R1146-Only-scripts-for-subset-selection branch from e69bef8 to 1d0eb3c Compare October 22, 2025 21:29
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
notebooks/use-cases/document-conversion-vlm.ipynb (1)

60-65: Fix undefined variable in exception example.

The new try/except snippet calls converter.convert(file_path) and logs file_path, but in this notebook file_path isn’t defined until after convert succeeds. Copying the snippet as-is will raise a NameError. Please either call converter.convert(file) (and log file) or show how to define file_path before the conversion.

-    result = converter.convert(file_path)
+    result = converter.convert(file)
 ...
-    print(f"❌ Failed to convert {file_path}: {str(e)}")
+    print(f"❌ Failed to convert {file}: {e}")
notebooks/tutorials/dataset-preparation-for-RAG.ipynb (1)

485-488: Pin revision and restrict trust_remote_code when loading HF model.

Unpinned trust_remote_code=True permits arbitrary code execution and non-reproducible results. Pin a commit and optionally enforce an allowlist; set device explicitly.

Replace the model init cell with:

import os, torch
ALLOWED = {"Snowflake/snowflake-arctic-embed-l"}
REV = os.environ.get("ARCTIC_MODEL_REV")  # set to pinned commit hash
if REV is None:
    raise ValueError("Set ARCTIC_MODEL_REV to a pinned commit for reproducibility.")

embedding_model = SentenceTransformer(
    "Snowflake/snowflake-arctic-embed-l",
    trust_remote_code=("Snowflake/snowflake-arctic-embed-l" in ALLOWED),
    revision=REV,
    device="cuda" if torch.cuda.is_available() else "cpu",
)
♻️ Duplicate comments (7)
scripts/subset_selection/__main__.py (1)

1-1: Drop the shebang (module entrypoint).

Remove the shebang to avoid Ruff EXE001; this module is invoked via python -m.

-#!/usr/bin/env python3
scripts/subset_selection/encoders/arctic_encoder.py (2)

104-116: Harden HF model loading: pin revision and restrict trust_remote_code.

Remote loading with trust_remote_code=True without a pinned revision allows arbitrary code execution and non-reproducible runs. Require a pinned revision and only allow trust_remote_code for an explicit allowlist.

@@
-        # In testing mode, allow direct download from HuggingFace
-        if hasattr(self.cfg, "testing_mode") and self.cfg.testing_mode:
+        # In testing mode, allow direct download from HuggingFace (with pinned revision and allowlist)
+        ALLOWLIST = {"Snowflake/snowflake-arctic-embed-l-v2.0"}
+        revision = os.environ.get("ARCTIC_MODEL_REV", None)
+        if hasattr(self.cfg, "testing_mode") and self.cfg.testing_mode:
             logger.warning(
                 f"Model not found locally at {model_path}. "
                 "Testing mode enabled - downloading from HuggingFace..."
             )
-            self.tokenizer = AutoTokenizer.from_pretrained(self.cfg.model_name)
+            if revision is None:
+                raise ValueError("ARCTIC_MODEL_REV must be set to a pinned commit hash when testing_mode=True")
+            self.tokenizer = AutoTokenizer.from_pretrained(self.cfg.model_name, revision=revision)
             self.model = AutoModel.from_pretrained(
                 self.cfg.model_name,
                 add_pooling_layer=False,
-                trust_remote_code=True,
+                trust_remote_code=(self.cfg.model_name in ALLOWLIST),
+                revision=revision,
             )
@@
-            self.tokenizer = AutoTokenizer.from_pretrained(model_path)
+            self.tokenizer = AutoTokenizer.from_pretrained(model_path)
             self.model = AutoModel.from_pretrained(
                 model_path,
                 add_pooling_layer=False,
-                trust_remote_code=True,
+                trust_remote_code=(self.cfg.model_name in ALLOWLIST),
                 local_files_only=True,
             )

Also applies to: 123-129


182-189: Avoid OOM: tokenize and move to device per-batch, not for the entire corpus.

Current code pre-tokenizes and moves all tensors to GPU, which will OOM on large inputs. Tokenize inside the loop and move only that batch.

-        encodings = self.tokenizer(
-            inputs,
-            max_length=self.cfg.model_config["max_length"],
-            padding=True,
-            truncation=True,
-            return_tensors="pt",
-        ).to(self.cfg.device)
-
         embeddings_list = []
         for i in tqdm(
             range(0, len(inputs), self.cfg.batch_size),
             disable=not show_progress or len(inputs) < 256,
         ):
-            batch = {k: v[i : i + self.cfg.batch_size] for k, v in encodings.items()}
-            outputs = self.model(**batch)
+            batch_texts = inputs[i : i + self.cfg.batch_size]
+            enc = self.tokenizer(
+                batch_texts,
+                max_length=self.cfg.model_config["max_length"],
+                padding=True,
+                truncation=True,
+                return_tensors="pt",
+            )
+            batch = {k: v.to(self.cfg.device) for k, v in enc.items()}
+            outputs = self.model(**batch)

Also applies to: 191-199

scripts/subset_selection/utils/subset_selection_utils.py (1)

123-128: Clamp RBF scale to avoid 0/0 NaNs.

When all vectors are identical, avg_dist=0 leads to NaNs. Clamp before division.

         if metric == "rbf":
             distance = torch.cdist(a, b)
             squared_distance = distance**2
-            avg_dist = torch.mean(squared_distance)
-            torch.div(squared_distance, kw * avg_dist, out=squared_distance)
+            avg_dist = torch.mean(squared_distance)
+            eps = torch.finfo(squared_distance.dtype).eps
+            scale = kw * torch.clamp(avg_dist, min=eps)
+            torch.div(squared_distance, scale, out=squared_distance)
             torch.exp(-squared_distance, out=squared_distance)
             return squared_distance
scripts/subset_selection/subset_selection.py (3)

278-286: Fix percentage normalization (floats >1 should mean percent).

Config allows floats up to 100, but here values >1 are rejected. Normalize >1 by /100 and validate.

-        if isinstance(size_spec, float):
-            # if not in range of 0 to 1, raise error
-            if size_spec <= 0 or size_spec > 1:
-                raise ValueError(
-                    "Percentage values must be between 0(non-inclusive) and 1(inclusive)"
-                )
-            # If between 0 and 1, treat as decimal percentage (0.5 = 50%)
-            return max(1, int((size_spec) * total_samples))
+        if isinstance(size_spec, float):
+            percentage = size_spec / 100 if size_spec > 1 else size_spec
+            if percentage <= 0 or percentage > 1:
+                raise ValueError("Percentage values must be between 0 (exclusive) and 100 (inclusive)")
+            return max(1, int(percentage * total_samples))

831-889: Skip empty folds and cap per-fold budgets to fold size.

Empty folds and oversized budgets crash the optimizer. Guard folds and clamp budgets; also normalize float percentages >1 here.

         for fold_idx, fold_indices in gpu_folds_info:
             try:
                 logger.info(f"Processing fold {fold_idx + 1} on GPU {gpu_id}")
 
-                fold_embeddings = embeddings[fold_indices].to(device)
+                fold_embeddings = embeddings[fold_indices].to(device)
+                fold_size = fold_embeddings.shape[0]
+                if fold_size == 0:
+                    logger.info(f"Skipping empty fold {fold_idx + 1} on GPU {gpu_id}")
+                    continue
@@
-                for size_spec in subset_sizes:
+                for size_spec in subset_sizes:
                     if isinstance(size_spec, float):
-                        # Percentage-based selection
-                        budget = max(
-                            1, math.ceil(size_spec * similarity_matrix.shape[0])
-                        )
+                        percentage = size_spec / 100 if size_spec > 1 else size_spec
+                        budget = max(1, math.ceil(percentage * fold_size))
                     else:
-                        # Absolute number-based selection
-                        budget = max(
-                            1,
-                            math.ceil(
-                                size_spec * (similarity_matrix.shape[0] / total_samples)
-                            ),
-                        )
+                        budget = max(1, math.ceil(size_spec * (fold_size / total_samples)))
+                    budget = min(budget, fold_size)
+                    if budget == 0:
+                        continue

610-617: Write JSON vs JSONL correctly.

Use lines=True only for .jsonl; write standard JSON for .json.

-        if extension in ["json", "jsonl"]:
-            subset_data.to_json(output_file, orient="records", lines=True)
+        if extension == "jsonl":
+            subset_data.to_json(output_file, orient="records", lines=True)
+        elif extension == "json":
+            subset_data.to_json(output_file, orient="records", lines=False)
🧹 Nitpick comments (6)
scripts/subset_selection/encoders/arctic_encoder.py (1)

75-77: Optional: support Apple Metal (MPS) device.

Improve device fallback to prefer MPS on Apple if CUDA is unavailable.

-        self.device = device or torch.device(
-            "cuda" if torch.cuda.is_available() else "cpu"
-        )
+        if device is not None:
+            self.device = device
+        else:
+            if torch.cuda.is_available():
+                self.device = torch.device("cuda")
+            elif getattr(torch.backends, "mps", None) and torch.backends.mps.is_available():
+                self.device = torch.device("mps")
+            else:
+                self.device = torch.device("cpu")
notebooks/tutorials/dataset-preparation-for-RAG.ipynb (1)

43-44: Reproducibility: pin package versions (or use a constraints file).

Unpinned installs can break the notebook over time. Pin docling and sentence-transformers (you already pin pymilvus).

Example:

%pip install -qq "docling==2.43.0" "sentence-transformers==3.2.1" "numpy==2.0.2" "pymilvus[milvus_lite]==2.6.2"
scripts/subset_selection/utils/subset_selection_utils.py (2)

107-107: Preserve tensor dtype in results.

Initialize results with the input dtype to avoid unintended downcasting.

-    results = torch.zeros(n_samples1, n_samples2, device="cpu")
+    results = torch.zeros(n_samples1, n_samples2, device="cpu", dtype=tensor1.dtype)

35-35: Use logger.exception to keep stack traces on retries.

Retain traceback for failed attempts to aid debugging.

-                logger.error(f"GPU out of memory on attempt {attempt + 1}: {str(e)}")
+                logger.exception("GPU out of memory on attempt %d: %s", attempt + 1, e)
@@
-                logger.error(
-                    f"PyTorch runtime error on attempt {attempt + 1}: {str(e)}"
-                )
+                logger.exception("PyTorch runtime error on attempt %d: %s", attempt + 1, e)
@@
-                logger.error(f"Value error on attempt {attempt + 1}: {str(e)}")
+                logger.exception("Value error on attempt %d: %s", attempt + 1, e)
@@
-                logger.error(f"Type error on attempt {attempt + 1}: {str(e)}")
+                logger.exception("Type error on attempt %d: %s", attempt + 1, e)
@@
-                logger.error(f"Index error on attempt {attempt + 1}: {str(e)}")
+                logger.exception("Index error on attempt %d: %s", attempt + 1, e)

Also applies to: 39-41, 45-45, 49-49, 53-53

scripts/subset_selection/subset_selection.py (2)

365-372: Use context managers for Pools to avoid resource leaks.

with Pool(...) ensures proper cleanup even on exceptions.

-        pool = Pool(processes=num_gpus)
-        try:
-            shard_files = pool.map(_process_dataset_shard, args_list)
-        finally:
-            pool.close()
-            pool.join()
-            pool.terminate()
+        with Pool(processes=num_gpus) as pool:
+            shard_files = pool.map(_process_dataset_shard, args_list)
@@
-        pool = Pool(processes=self.config.system.num_gpus)
-        try:
-            gpu_results = pool.map(process_folds_with_gpu, gpu_assignments)
-        finally:
-            pool.close()
-            pool.join()
-            pool.terminate()
+        with Pool(processes=self.config.system.num_gpus) as pool:
+            gpu_results = pool.map(process_folds_with_gpu, gpu_assignments)

Also applies to: 429-436


206-209: Optional: enable Jinja2 autoescape for safety.

Set autoescape to reduce surprises when rendering user-provided text.

-from jinja2 import BaseLoader, Environment
+from jinja2 import BaseLoader, Environment, select_autoescape
@@
-        self.env = Environment(loader=BaseLoader())
+        self.env = Environment(loader=BaseLoader(), autoescape=select_autoescape(default_for_string=True))
@@
-        env = Environment(loader=BaseLoader())
+        env = Environment(loader=BaseLoader(), autoescape=select_autoescape(default_for_string=True))

Also applies to: 656-657

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1d0eb3c and 7a580fa.

📒 Files selected for processing (8)
  • notebooks/tutorials/dataset-preparation-for-RAG.ipynb (38 hunks)
  • notebooks/use-cases/document-conversion-standard.ipynb (18 hunks)
  • notebooks/use-cases/document-conversion-vlm.ipynb (17 hunks)
  • notebooks/use-cases/hybrid-chunking-demonstration.ipynb (20 hunks)
  • scripts/subset_selection/__main__.py (1 hunks)
  • scripts/subset_selection/encoders/arctic_encoder.py (1 hunks)
  • scripts/subset_selection/subset_selection.py (1 hunks)
  • scripts/subset_selection/utils/subset_selection_utils.py (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • notebooks/use-cases/hybrid-chunking-demonstration.ipynb
  • notebooks/use-cases/document-conversion-standard.ipynb
🧰 Additional context used
🧬 Code graph analysis (3)
scripts/subset_selection/encoders/arctic_encoder.py (1)
scripts/subset_selection/subset_selection.py (1)
  • EncoderConfig (107-118)
scripts/subset_selection/subset_selection.py (2)
scripts/subset_selection/encoders/arctic_encoder.py (3)
  • ArcticEmbedEncoder (59-205)
  • EncoderConfig (48-56)
  • encode (171-205)
scripts/subset_selection/utils/subset_selection_utils.py (3)
  • compute_pairwise_dense (87-148)
  • get_default_num_gpus (67-84)
  • retry_on_exception (21-64)
scripts/subset_selection/__main__.py (1)
scripts/subset_selection/subset_selection.py (1)
  • main (1094-1152)
🪛 Pylint (4.0.1)
scripts/subset_selection/encoders/arctic_encoder.py

[refactor] 60-60: Too many positional arguments (6/5)

(R0917)


[refactor] 59-59: Too few public methods (1/2)

(R0903)

scripts/subset_selection/utils/subset_selection_utils.py

[refactor] 87-87: Too many positional arguments (7/5)

(R0917)

scripts/subset_selection/subset_selection.py

[refactor] 365-365: Consider using 'with' for resource-allocating operations

(R1732)


[refactor] 429-429: Consider using 'with' for resource-allocating operations

(R1732)


[refactor] 619-619: Too many statements (57/50)

(R0915)


[refactor] 800-800: Too many branches (16/12)

(R0912)


[refactor] 800-800: Too many statements (53/50)

(R0915)

🪛 Ruff (0.14.1)
scripts/subset_selection/encoders/arctic_encoder.py

70-72: Avoid specifying long messages outside the exception class

(TRY003)


118-121: Avoid specifying long messages outside the exception class

(TRY003)


149-152: Avoid specifying long messages outside the exception class

(TRY003)


162-165: Avoid specifying long messages outside the exception class

(TRY003)

scripts/subset_selection/utils/subset_selection_utils.py

35-35: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


35-35: Use explicit conversion flag

Replace with conversion flag

(RUF010)


39-41: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


40-40: Use explicit conversion flag

Replace with conversion flag

(RUF010)


45-45: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


45-45: Use explicit conversion flag

Replace with conversion flag

(RUF010)


49-49: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


49-49: Use explicit conversion flag

Replace with conversion flag

(RUF010)


53-53: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


53-53: Use explicit conversion flag

Replace with conversion flag

(RUF010)


81-83: Avoid specifying long messages outside the exception class

(TRY003)


129-129: Avoid specifying long messages outside the exception class

(TRY003)

scripts/subset_selection/subset_selection.py

54-57: Abstract raise to an inner function

(TRY301)


54-57: Avoid specifying long messages outside the exception class

(TRY003)


60-60: Avoid specifying long messages outside the exception class

(TRY003)


60-60: Use explicit conversion flag

Replace with conversion flag

(RUF010)


84-84: Avoid specifying long messages outside the exception class

(TRY003)


180-180: Prefer TypeError exception for invalid type

(TRY004)


180-180: Avoid specifying long messages outside the exception class

(TRY003)


184-184: Prefer TypeError exception for invalid type

(TRY004)


184-184: Avoid specifying long messages outside the exception class

(TRY003)


186-188: Avoid specifying long messages outside the exception class

(TRY003)


190-190: Avoid specifying long messages outside the exception class

(TRY003)


206-206: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.

(S701)


231-231: Avoid specifying long messages outside the exception class

(TRY003)


260-262: Avoid specifying long messages outside the exception class

(TRY003)


281-283: Avoid specifying long messages outside the exception class

(TRY003)


377-377: Avoid specifying long messages outside the exception class

(TRY003)


534-534: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


534-534: Use explicit conversion flag

Replace with conversion flag

(RUF010)


598-598: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


598-598: Use explicit conversion flag

Replace with conversion flag

(RUF010)


656-656: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.

(S701)


684-684: Abstract raise to an inner function

(TRY301)


684-684: Avoid specifying long messages outside the exception class

(TRY003)


741-741: Consider moving this statement to an else block

(TRY300)


746-746: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


746-746: Use explicit conversion flag

Replace with conversion flag

(RUF010)


824-824: Abstract raise to an inner function

(TRY301)


824-824: Avoid specifying long messages outside the exception class

(TRY003)


893-895: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


894-894: Use explicit conversion flag

Replace with conversion flag

(RUF010)


914-914: Consider moving this statement to an else block

(TRY300)


916-916: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


916-916: Use explicit conversion flag

Replace with conversion flag

(RUF010)


987-987: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


987-987: Use explicit conversion flag

Replace with conversion flag

(RUF010)


1149-1149: Consider moving this statement to an else block

(TRY300)


1150-1150: Do not catch blind exception: Exception

(BLE001)

scripts/subset_selection/__main__.py

1-1: Shebang is present but file is not executable

(EXE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

Comment on lines +131 to +136
if self.cfg.use_fp16:
self.model = self.model.half()

self.model = self.model.to(self.cfg.device)
logger.info(f"Model loaded on device: {self.cfg.device}")

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard fp16 on CPU.

Calling .half() on CPU models can degrade or break ops. Apply fp16 only on CUDA; warn otherwise.

-        if self.cfg.use_fp16:
-            self.model = self.model.half()
+        if self.cfg.use_fp16:
+            if self.cfg.device.type == "cuda":
+                self.model = self.model.to(dtype=torch.float16)
+            else:
+                logger.warning("use_fp16=True ignored on non-CUDA device")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if self.cfg.use_fp16:
self.model = self.model.half()
self.model = self.model.to(self.cfg.device)
logger.info(f"Model loaded on device: {self.cfg.device}")
if self.cfg.use_fp16:
if self.cfg.device.type == "cuda":
self.model = self.model.to(dtype=torch.float16)
else:
logger.warning("use_fp16=True ignored on non-CUDA device")
self.model = self.model.to(self.cfg.device)
logger.info(f"Model loaded on device: {self.cfg.device}")

@alimaredia
Copy link
Contributor

@RobuRishabh it looks like you accidently pushed notebooks that have already been merged into this PR.

@RobuRishabh RobuRishabh force-pushed the R1146-Only-scripts-for-subset-selection branch from 7a580fa to abb5c87 Compare October 23, 2025 16:24
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (9)
scripts/subset_selection/encoders/arctic_encoder.py (3)

104-115: Pin HF revision and restrict trust_remote_code (remote code execution risk).

Loading with trust_remote_code=True without a pinned revision allows arbitrary remote code execution and non-reproducible builds. Gate trust_remote_code behind an allowlist and require a pinned revision (commit SHA) for both tokenizer and model. Also pass revision when downloading; keep local_files_only=True for local paths.

Apply:

@@
-        # In testing mode, allow direct download from HuggingFace
-        if hasattr(self.cfg, "testing_mode") and self.cfg.testing_mode:
+        # In testing mode, allow direct download from HuggingFace (with pinned revision and allowlist)
+        if hasattr(self.cfg, "testing_mode") and self.cfg.testing_mode:
             logger.warning(
                 f"Model not found locally at {model_path}. "
                 "Testing mode enabled - downloading from HuggingFace..."
             )
-            self.tokenizer = AutoTokenizer.from_pretrained(self.cfg.model_name)
-            self.model = AutoModel.from_pretrained(
-                self.cfg.model_name,
-                add_pooling_layer=False,
-                trust_remote_code=True,
-            )
+            revision = os.environ.get("ARCTIC_MODEL_REV")
+            allowlist = {"Snowflake/snowflake-arctic-embed-l-v2.0"}
+            trust_rc = self.cfg.model_name in allowlist
+            if revision is None:
+                raise RuntimeError("ARCTIC_MODEL_REV env var must be set to a pinned commit SHA in testing_mode")
+            self.tokenizer = AutoTokenizer.from_pretrained(self.cfg.model_name, revision=revision)
+            self.model = AutoModel.from_pretrained(
+                self.cfg.model_name,
+                add_pooling_layer=False,
+                trust_remote_code=trust_rc,
+                revision=revision,
+            )
         else:
             if not os.path.exists(model_path):
                 raise ValueError(
                     f"Model not found in available models: {self.cfg.model_name}\n"
                     "Please run `ilab model download` and download the necessary model"
                 )
 
-            self.tokenizer = AutoTokenizer.from_pretrained(model_path)
+            self.tokenizer = AutoTokenizer.from_pretrained(model_path)
             self.model = AutoModel.from_pretrained(
                 model_path,
                 add_pooling_layer=False,
-                trust_remote_code=True,
+                trust_remote_code=True,  # local allowlisted artifact
                 local_files_only=True,
             )

Also applies to: 117-130


131-135: Guard fp16 on CPU.

.half() on CPU can break ops. Apply fp16 only on CUDA; warn otherwise.

-        if self.cfg.use_fp16:
-            self.model = self.model.half()
+        if self.cfg.use_fp16:
+            if self.cfg.device.type == "cuda":
+                self.model = self.model.to(dtype=torch.float16)
+            else:
+                logger.warning("use_fp16=True ignored on non-CUDA device")

182-200: Batch-tokenize and move per-batch to device to avoid OOM.

Pre-tokenizing the entire corpus and calling .to(device) moves all tensors to GPU at once. Tokenize inside the loop and move only that batch.

-        encodings = self.tokenizer(
-            inputs,
-            max_length=self.cfg.model_config["max_length"],
-            padding=True,
-            truncation=True,
-            return_tensors="pt",
-        ).to(self.cfg.device)
-
-        embeddings_list = []
-        for i in tqdm(
+        embeddings_list = []
+        for i in tqdm(
             range(0, len(inputs), self.cfg.batch_size),
             disable=not show_progress or len(inputs) < 256,
         ):
-            batch = {k: v[i : i + self.cfg.batch_size] for k, v in encodings.items()}
+            batch_texts = inputs[i : i + self.cfg.batch_size]
+            enc = self.tokenizer(
+                batch_texts,
+                max_length=self.cfg.model_config["max_length"],
+                padding=True,
+                truncation=True,
+                return_tensors="pt",
+            )
+            batch = {k: v.to(self.cfg.device) for k, v in enc.items()}
             outputs = self.model(**batch)
             # Take the first token embedding (CLS) and normalize it
             embeddings = F.normalize(outputs.last_hidden_state[:, 0], p=2, dim=1)
             embeddings_list.append(embeddings.cpu())
scripts/subset_selection/README.md (1)

269-276: Replace developer-specific absolute path with project-relative path.

-# Navigate to project root
-cd /Users/roburishabh/Github/odh-data-processing
+# Navigate to project root
+cd odh-data-processing
scripts/subset_selection/__main__.py (1)

1-1: Drop the shebang; module entry points don’t need it.

-#!/usr/bin/env python3
scripts/subset_selection/utils/subset_selection_utils.py (1)

122-129: Clamp RBF scaling to avoid NaNs when avg_dist ≈ 0.

Division by kw * avg_dist can be 0/0 for identical embeddings. Clamp the scale away from zero.

         if metric == "rbf":
             distance = torch.cdist(a, b)
             squared_distance = distance**2
-            avg_dist = torch.mean(squared_distance)
-            torch.div(squared_distance, kw * avg_dist, out=squared_distance)
+            avg_dist = torch.mean(squared_distance)
+            eps = torch.finfo(squared_distance.dtype).eps
+            scale = kw * torch.clamp(avg_dist, min=eps)
+            torch.div(squared_distance, scale, out=squared_distance)
             torch.exp(-squared_distance, out=squared_distance)
             return squared_distance
scripts/subset_selection/subset_selection.py (3)

278-287: Fix percentage normalization inconsistency (floats up to 100 allowed in config).

ProcessingConfig.__post_init__ allows floats in 0–100, but calculate_subset_size rejects floats >1. Normalize floats >1 to percentages.

-        if isinstance(size_spec, float):
-            # if not in range of 0 to 1, raise error
-            if size_spec <= 0 or size_spec > 1:
-                raise ValueError(
-                    "Percentage values must be between 0(non-inclusive) and 1(inclusive)"
-                )
-            # If between 0 and 1, treat as decimal percentage (0.5 = 50%)
-            return max(1, int((size_spec) * total_samples))
+        if isinstance(size_spec, float):
+            percentage = size_spec / 100.0 if size_spec > 1 else size_spec
+            if percentage <= 0 or percentage > 1:
+                raise ValueError("Percentage values must be in (0, 100]")
+            return max(1, int(percentage * total_samples))

829-887: Skip empty folds and cap per‑fold budgets to fold size.

Empty folds and over‑budget requests crash FacilityLocationFunction.maximize. Skip zero‑sized folds and clamp budgets.

-        for fold_idx, fold_indices in gpu_folds_info:
+        for fold_idx, fold_indices in gpu_folds_info:
+            fold_size = len(fold_indices)
+            if fold_size == 0:
+                logger.info("Skipping empty fold %d on GPU %d", fold_idx + 1, gpu_id)
+                continue
@@
-                for size_spec in subset_sizes:
+                for size_spec in subset_sizes:
                     if isinstance(size_spec, float):
-                        # Percentage-based selection
-                        budget = max(
-                            1, math.ceil(size_spec * similarity_matrix.shape[0])
-                        )
+                        percentage = size_spec / 100.0 if size_spec > 1 else size_spec
+                        budget = max(1, math.ceil(percentage * fold_size))
                     else:
                         # Absolute number-based selection
-                        budget = max(
-                            1,
-                            math.ceil(
-                                size_spec * (similarity_matrix.shape[0] / total_samples)
-                            ),
-                        )
+                        budget = max(1, math.ceil(size_spec * (fold_size / total_samples)))
+                    budget = min(budget, fold_size)
+                    if budget == 0:
+                        continue

608-615: Write JSON vs JSONL correctly.

Only use lines=True for .jsonl; for .json write a normal JSON array.

-        if extension in ["json", "jsonl"]:
-            subset_data.to_json(output_file, orient="records", lines=True)
+        if extension == "jsonl":
+            subset_data.to_json(output_file, orient="records", lines=True)
+        elif extension == "json":
+            subset_data.to_json(output_file, orient="records", lines=False)
         elif extension == "csv":
             subset_data.to_csv(output_file, index=False)
         elif extension == "parquet":
             subset_data.to_parquet(output_file)
🧹 Nitpick comments (6)
scripts/subset_selection/README.md (1)

99-116: Docs polish: fenced code languages + install note.

  • Add language to fenced blocks to satisfy markdownlint (CLI options, tree, traceback).
  • Prefer repo-root pinned install as per reviewer guidance; keep CUDA index hint.
-```
+```text
@@
-```
+```
@@
-```
+```text
@@
-```
+```python
@@
-The original codebase can be found at: [https://github.com/krishnatejakk/DataCurate4LLMs](https://github.com/krishnatejakk/DataCurate4LLMs)
+The original codebase can be found at: [krishnatejakk/DataCurate4LLMs](https://github.com/krishnatejakk/DataCurate4LLMs)

Optionally, update Installation to:

- pip install -r scripts/subset_selection/requirements.txt --extra-index-url https://download.pytorch.org/whl/cu121
+ pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu121

Would you like me to push the README tweaks as a follow-up commit?

Also applies to: 231-242, 290-299, 353-359

scripts/subset_selection/utils/subset_selection_utils.py (2)

35-53: Prefer logger.exception to retain stack traces on failures.

Switch logger.error(f"...{str(e)}") to logger.exception("...%s", e) to capture tracebacks; applies to all except blocks in this wrapper.

-                logger.error(f"GPU out of memory on attempt {attempt + 1}: {str(e)}")
+                logger.exception("GPU out of memory on attempt %d: %s", attempt + 1, e)
@@
-                logger.error(
-                    f"PyTorch runtime error on attempt {attempt + 1}: {str(e)}"
-                )
+                logger.exception("PyTorch runtime error on attempt %d: %s", attempt + 1, e)
@@
-                logger.error(f"Value error on attempt {attempt + 1}: {str(e)}")
+                logger.exception("Value error on attempt %d: %s", attempt + 1, e)
@@
-                logger.error(f"Type error on attempt {attempt + 1}: {str(e)}")
+                logger.exception("Type error on attempt %d: %s", attempt + 1, e)
@@
-                logger.error(f"Index error on attempt {attempt + 1}: {str(e)}")
+                logger.exception("Index error on attempt %d: %s", attempt + 1, e)

96-108: Memory caution: NxN result matrix can be large.

results is n1 x n2 on CPU; for big folds this can be GBs. Consider streaming selection or memory-mapped storage if folds exceed a threshold, or compute sparse top‑k similarities per row to reduce footprint.

Would you like a prototype that backs results with a memmap and yields row-chunks to the optimizer?

scripts/subset_selection/subset_selection.py (3)

365-372: Use context manager for multiprocessing.Pool to ensure cleanup.

Replace manual close/join/terminate with with Pool(...) as pool: for safer resource management and clearer intent.

-        pool = Pool(processes[num_gpus])
-        try:
-            shard_files = pool.map(_process_dataset_shard, args_list)
-        finally:
-            pool.close()
-            pool.join()
-            pool.terminate()
+        with Pool(processes=num_gpus) as pool:
+            shard_files = pool.map(_process_dataset_shard, args_list)
@@
-        pool = Pool(processes=self.config.system.num_gpus)
-        try:
-            gpu_results = pool.map(process_folds_with_gpu, gpu_assignments)
-        finally:
-            pool.close()
-            pool.join()
-            pool.terminate()
+        with Pool(processes=self.config.system.num_gpus) as pool:
+            gpu_results = pool.map(process_folds_with_gpu, gpu_assignments)

Also applies to: 429-436


654-675: Jinja2 autoescape off by default; enable for safety in templating.

Even for CLI utilities, enabling autoescape for common text types is safer.

-from jinja2 import BaseLoader, Environment
+from jinja2 import BaseLoader, Environment, select_autoescape
@@
-        self.env = Environment(loader=BaseLoader())
+        self.env = Environment(loader=BaseLoader(), autoescape=select_autoescape(["html", "xml"]))
@@
-        env = Environment(loader=BaseLoader())
+        env = Environment(loader=BaseLoader(), autoescape=select_autoescape(["html", "xml"]))

Also applies to: 206-213


531-534: Preserve stack traces in logs for easier debugging.

Use logger.exception("...%s", e) inside except blocks instead of logger.error(...).

I can push a quick pass converting these to logger.exception if you’d like.

Also applies to: 595-598, 984-986, 891-894, 914-919

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7a580fa and abb5c87.

📒 Files selected for processing (5)
  • scripts/subset_selection/README.md (1 hunks)
  • scripts/subset_selection/__main__.py (1 hunks)
  • scripts/subset_selection/encoders/arctic_encoder.py (1 hunks)
  • scripts/subset_selection/subset_selection.py (1 hunks)
  • scripts/subset_selection/utils/subset_selection_utils.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
scripts/subset_selection/__main__.py (1)
scripts/subset_selection/subset_selection.py (1)
  • main (1093-1174)
scripts/subset_selection/encoders/arctic_encoder.py (1)
scripts/subset_selection/subset_selection.py (1)
  • EncoderConfig (107-118)
scripts/subset_selection/subset_selection.py (2)
scripts/subset_selection/encoders/arctic_encoder.py (3)
  • ArcticEmbedEncoder (59-205)
  • EncoderConfig (48-56)
  • encode (171-205)
scripts/subset_selection/utils/subset_selection_utils.py (3)
  • compute_pairwise_dense (87-148)
  • get_default_num_gpus (67-84)
  • retry_on_exception (21-64)
🪛 LanguageTool
scripts/subset_selection/README.md

[style] ~129-~129: This adverb was used twice in the sentence. Consider removing one of them or replacing them with a synonym.
Context: ... = exactly 1000 samples - "5000" = exactly 5000 samples - Example: `--subset-si...

(ADVERB_REPETITION_PREMIUM)

🪛 markdownlint-cli2 (0.18.1)
scripts/subset_selection/README.md

99-99: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


231-231: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


290-290: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


355-355: Bare URL used

(MD034, no-bare-urls)

🪛 Pylint (4.0.1)
scripts/subset_selection/utils/subset_selection_utils.py

[refactor] 87-87: Too many positional arguments (7/5)

(R0917)

scripts/subset_selection/encoders/arctic_encoder.py

[refactor] 60-60: Too many positional arguments (6/5)

(R0917)


[refactor] 59-59: Too few public methods (1/2)

(R0903)

scripts/subset_selection/subset_selection.py

[refactor] 365-365: Consider using 'with' for resource-allocating operations

(R1732)


[refactor] 429-429: Consider using 'with' for resource-allocating operations

(R1732)


[refactor] 617-617: Too many statements (57/50)

(R0915)


[refactor] 798-798: Too many branches (16/12)

(R0912)


[refactor] 798-798: Too many statements (53/50)

(R0915)

🪛 Ruff (0.14.1)
scripts/subset_selection/__main__.py

1-1: Shebang is present but file is not executable

(EXE001)

scripts/subset_selection/utils/subset_selection_utils.py

35-35: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


35-35: Use explicit conversion flag

Replace with conversion flag

(RUF010)


39-41: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


40-40: Use explicit conversion flag

Replace with conversion flag

(RUF010)


45-45: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


45-45: Use explicit conversion flag

Replace with conversion flag

(RUF010)


49-49: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


49-49: Use explicit conversion flag

Replace with conversion flag

(RUF010)


53-53: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


53-53: Use explicit conversion flag

Replace with conversion flag

(RUF010)


81-83: Avoid specifying long messages outside the exception class

(TRY003)


129-129: Avoid specifying long messages outside the exception class

(TRY003)

scripts/subset_selection/encoders/arctic_encoder.py

70-72: Avoid specifying long messages outside the exception class

(TRY003)


118-121: Avoid specifying long messages outside the exception class

(TRY003)


149-152: Avoid specifying long messages outside the exception class

(TRY003)


162-165: Avoid specifying long messages outside the exception class

(TRY003)

scripts/subset_selection/subset_selection.py

54-57: Abstract raise to an inner function

(TRY301)


54-57: Avoid specifying long messages outside the exception class

(TRY003)


60-60: Avoid specifying long messages outside the exception class

(TRY003)


60-60: Use explicit conversion flag

Replace with conversion flag

(RUF010)


84-84: Avoid specifying long messages outside the exception class

(TRY003)


180-180: Prefer TypeError exception for invalid type

(TRY004)


180-180: Avoid specifying long messages outside the exception class

(TRY003)


184-184: Prefer TypeError exception for invalid type

(TRY004)


184-184: Avoid specifying long messages outside the exception class

(TRY003)


186-188: Avoid specifying long messages outside the exception class

(TRY003)


190-190: Avoid specifying long messages outside the exception class

(TRY003)


206-206: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.

(S701)


231-231: Avoid specifying long messages outside the exception class

(TRY003)


260-262: Avoid specifying long messages outside the exception class

(TRY003)


281-283: Avoid specifying long messages outside the exception class

(TRY003)


377-377: Avoid specifying long messages outside the exception class

(TRY003)


532-532: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


532-532: Use explicit conversion flag

Replace with conversion flag

(RUF010)


596-596: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


596-596: Use explicit conversion flag

Replace with conversion flag

(RUF010)


654-654: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.

(S701)


682-682: Abstract raise to an inner function

(TRY301)


682-682: Avoid specifying long messages outside the exception class

(TRY003)


739-739: Consider moving this statement to an else block

(TRY300)


744-744: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


744-744: Use explicit conversion flag

Replace with conversion flag

(RUF010)


822-822: Abstract raise to an inner function

(TRY301)


822-822: Avoid specifying long messages outside the exception class

(TRY003)


891-893: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


892-892: Use explicit conversion flag

Replace with conversion flag

(RUF010)


912-912: Consider moving this statement to an else block

(TRY300)


914-914: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


914-914: Use explicit conversion flag

Replace with conversion flag

(RUF010)


985-985: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


985-985: Use explicit conversion flag

Replace with conversion flag

(RUF010)


1116-1118: Avoid specifying long messages outside the exception class

(TRY003)


1128-1130: Avoid specifying long messages outside the exception class

(TRY003)


1136-1136: Avoid specifying long messages outside the exception class

(TRY003)


1171-1171: Consider moving this statement to an else block

(TRY300)


1172-1172: Do not catch blind exception: Exception

(BLE001)

@RobuRishabh RobuRishabh force-pushed the R1146-Only-scripts-for-subset-selection branch from abb5c87 to 158f73b Compare October 24, 2025 02:03
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (9)
scripts/subset_selection/requirements.txt (1)

70-72: Duplicate: torch 2.5.2 remains vulnerable to critical CVEs

This issue was already flagged in previous reviews. OSV continues to report 4 CRITICAL vulnerabilities for torch 2.5.2 (PYSEC-2025-41, GHSA-3749-ghw9-m3mg, GHSA-53q9-r3pm-6pq6, GHSA-887c-mr87-cxwp). Please verify if a patched version is available or document mitigation strategy.

scripts/subset_selection/encoders/arctic_encoder.py (3)

97-132: Duplicate: Security risk with unrestricted trust_remote_code

This issue was flagged in previous reviews. All three model loading paths (custom path, local cache, HuggingFace download) use trust_remote_code=True without pinning a revision or maintaining an allowlist, allowing arbitrary remote code execution.


183-201: Duplicate: Pre-tokenizing entire input list causes OOM on large batches

This issue was flagged in previous reviews. Line 183-189 tokenizes the entire inputs list and moves all tensors to GPU via .to(self.cfg.device) before batching, defeating the purpose of batching and causing OOM on large datasets. Tokenization should happen inside the batch loop.


134-135: Duplicate: FP16 applied without checking device type

This issue was flagged in previous reviews. Calling .half() on CPU models can degrade performance or break operations. FP16 should only be applied to CUDA devices.

scripts/subset_selection/utils/subset_selection_utils.py (1)

116-122: Duplicate: RBF scaling can divide by zero causing NaNs

This issue was flagged in previous reviews. When embeddings are identical or nearly so, avg_dist becomes zero, making the division at line 120 perform 0/0 and yielding NaNs that propagate through the similarity matrix.

scripts/subset_selection/subset_selection.py (3)

277-286: Duplicate: Float percentage values > 1 are rejected despite config allowing them

This issue was flagged in previous reviews. Lines 279-282 reject float values > 1, but ProcessingConfig.__post_init__ (line 184) explicitly allows floats up to 100 to represent percentages. A config with subset_sizes=[10.0] passes validation but fails here at runtime.


607-608: Duplicate: JSON files written as JSONL

This issue was flagged in previous reviews. Line 608 uses lines=True for both .json and .jsonl extensions, producing JSONL format for .json files. Only .jsonl should use lines=True.


824-882: Duplicate: Empty folds and uncapped budgets cause crashes

This issue was flagged in previous reviews. When len(embeddings) < num_folds, empty folds are created. When subset size exceeds dataset size, per-fold budgets can exceed fold cardinality. Both cases cause FacilityLocationFunction.maximize to fail. Empty folds should be skipped and budgets capped to fold size.

scripts/subset_selection/__main__.py (1)

1-1: Shebang still present despite being marked as addressed

The shebang on line 1 is unnecessary for module execution via python -m subset_selection and triggers Ruff EXE001. Although this was marked as addressed in a previous review, the shebang remains in the code.

Apply this diff to remove it:

-#!/usr/bin/env python3
 """
 Entry point for subset selection when run as a module.
 """

Based on static analysis hints.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between abb5c87 and 158f73b.

📒 Files selected for processing (7)
  • scripts/subset_selection/.gitignore (1 hunks)
  • scripts/subset_selection/README.md (1 hunks)
  • scripts/subset_selection/__main__.py (1 hunks)
  • scripts/subset_selection/encoders/arctic_encoder.py (1 hunks)
  • scripts/subset_selection/requirements.txt (1 hunks)
  • scripts/subset_selection/subset_selection.py (1 hunks)
  • scripts/subset_selection/utils/subset_selection_utils.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • scripts/subset_selection/.gitignore
🧰 Additional context used
🧬 Code graph analysis (3)
scripts/subset_selection/subset_selection.py (2)
scripts/subset_selection/encoders/arctic_encoder.py (3)
  • ArcticEmbedEncoder (58-206)
  • EncoderConfig (48-55)
  • encode (172-206)
scripts/subset_selection/utils/subset_selection_utils.py (3)
  • compute_pairwise_dense (81-142)
  • get_default_num_gpus (67-78)
  • retry_on_exception (21-64)
scripts/subset_selection/encoders/arctic_encoder.py (1)
scripts/subset_selection/subset_selection.py (1)
  • EncoderConfig (107-118)
scripts/subset_selection/__main__.py (1)
scripts/subset_selection/subset_selection.py (1)
  • main (1090-1173)
🪛 LanguageTool
scripts/subset_selection/README.md

[style] ~130-~130: This adverb was used twice in the sentence. Consider removing one of them or replacing them with a synonym.
Context: ... = exactly 1000 samples - "5000" = exactly 5000 samples - Example: `--subset-si...

(ADVERB_REPETITION_PREMIUM)

🪛 markdownlint-cli2 (0.18.1)
scripts/subset_selection/README.md

100-100: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


233-233: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


300-300: Bare URL used

(MD034, no-bare-urls)

🪛 OSV Scanner (2.2.3)
scripts/subset_selection/requirements.txt

[CRITICAL] 1-1: torch 2.5.2: undefined

(PYSEC-2025-41)


[CRITICAL] 1-1: torch 2.5.2: PyTorch susceptible to local Denial of Service

(GHSA-3749-ghw9-m3mg)


[CRITICAL] 1-1: torch 2.5.2: PyTorch: torch.load with weights_only=True leads to remote code execution

(GHSA-53q9-r3pm-6pq6)


[CRITICAL] 1-1: torch 2.5.2: PyTorch Improper Resource Shutdown or Release vulnerability

(GHSA-887c-mr87-cxwp)

🪛 Pylint (4.0.1)
scripts/subset_selection/subset_selection.py

[refactor] 364-364: Consider using 'with' for resource-allocating operations

(R1732)


[refactor] 427-427: Consider using 'with' for resource-allocating operations

(R1732)


[refactor] 615-615: Too many statements (57/50)

(R0915)


[refactor] 796-796: Too many branches (15/12)

(R0912)


[refactor] 796-796: Too many statements (51/50)

(R0915)

scripts/subset_selection/encoders/arctic_encoder.py

[refactor] 59-59: Too many positional arguments (6/5)

(R0917)


[refactor] 58-58: Too few public methods (1/2)

(R0903)

scripts/subset_selection/utils/subset_selection_utils.py

[refactor] 81-81: Too many positional arguments (7/5)

(R0917)

🪛 Ruff (0.14.1)
scripts/subset_selection/subset_selection.py

54-57: Abstract raise to an inner function

(TRY301)


54-57: Avoid specifying long messages outside the exception class

(TRY003)


60-60: Avoid specifying long messages outside the exception class

(TRY003)


60-60: Use explicit conversion flag

Replace with conversion flag

(RUF010)


84-84: Avoid specifying long messages outside the exception class

(TRY003)


179-179: Prefer TypeError exception for invalid type

(TRY004)


179-179: Avoid specifying long messages outside the exception class

(TRY003)


183-183: Prefer TypeError exception for invalid type

(TRY004)


183-183: Avoid specifying long messages outside the exception class

(TRY003)


185-187: Avoid specifying long messages outside the exception class

(TRY003)


189-189: Avoid specifying long messages outside the exception class

(TRY003)


205-205: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.

(S701)


230-230: Avoid specifying long messages outside the exception class

(TRY003)


259-261: Avoid specifying long messages outside the exception class

(TRY003)


280-282: Avoid specifying long messages outside the exception class

(TRY003)


376-376: Avoid specifying long messages outside the exception class

(TRY003)


530-530: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


530-530: Use explicit conversion flag

Replace with conversion flag

(RUF010)


594-594: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


594-594: Use explicit conversion flag

Replace with conversion flag

(RUF010)


652-652: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.

(S701)


680-680: Abstract raise to an inner function

(TRY301)


680-680: Avoid specifying long messages outside the exception class

(TRY003)


737-737: Consider moving this statement to an else block

(TRY300)


742-742: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


742-742: Use explicit conversion flag

Replace with conversion flag

(RUF010)


886-888: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


887-887: Use explicit conversion flag

Replace with conversion flag

(RUF010)


907-907: Consider moving this statement to an else block

(TRY300)


909-909: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


909-909: Use explicit conversion flag

Replace with conversion flag

(RUF010)


979-979: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


979-979: Use explicit conversion flag

Replace with conversion flag

(RUF010)


1113-1115: Avoid specifying long messages outside the exception class

(TRY003)


1125-1127: Avoid specifying long messages outside the exception class

(TRY003)


1133-1133: Avoid specifying long messages outside the exception class

(TRY003)


1170-1170: Consider moving this statement to an else block

(TRY300)


1171-1171: Do not catch blind exception: Exception

(BLE001)

scripts/subset_selection/encoders/arctic_encoder.py

69-71: Avoid specifying long messages outside the exception class

(TRY003)


150-153: Avoid specifying long messages outside the exception class

(TRY003)


163-166: Avoid specifying long messages outside the exception class

(TRY003)

scripts/subset_selection/__main__.py

1-1: Shebang is present but file is not executable

(EXE001)

scripts/subset_selection/utils/subset_selection_utils.py

35-35: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


35-35: Use explicit conversion flag

Replace with conversion flag

(RUF010)


39-41: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


40-40: Use explicit conversion flag

Replace with conversion flag

(RUF010)


45-45: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


45-45: Use explicit conversion flag

Replace with conversion flag

(RUF010)


49-49: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


49-49: Use explicit conversion flag

Replace with conversion flag

(RUF010)


53-53: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


53-53: Use explicit conversion flag

Replace with conversion flag

(RUF010)


123-123: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (1)
scripts/subset_selection/README.md (1)

294-303: Credits section properly addresses attribution requirement

The addition of the Credits and Acknowledgements section properly attributes the original author (Krishnateja Killamsetty) and links to the DataCurate4LLMs repository, addressing the PR objective to add attribution.

from tqdm import tqdm

# Local
from encoders.arctic_encoder import ArcticEmbedEncoder
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Import uses relative path instead of absolute module path

The import on line 22 uses from encoders.arctic_encoder import ArcticEmbedEncoder with a relative path. Based on feedback from alimaredia, this should use an absolute import path to properly reference the module within the package structure.

Consider updating to use the full module path or ensure the package structure supports this relative import pattern consistently.

🤖 Prompt for AI Agents
In scripts/subset_selection/subset_selection.py at line 22, the import uses a
relative/non-absolute form "from encoders.arctic_encoder import
ArcticEmbedEncoder"; change it to the package's absolute module path (for
example "from my_package.encoders.arctic_encoder import ArcticEmbedEncoder" or
whatever the repository's root package is) so imports resolve consistently; if
the project uses a top-level package name adjust the prefix accordingly and
update other imports or __init__.py files to maintain consistent absolute
imports across the package.

Signed-off-by: roburishabh <[email protected]>
@RobuRishabh RobuRishabh force-pushed the R1146-Only-scripts-for-subset-selection branch from 158f73b to 408e383 Compare October 27, 2025 20:59
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (9)
scripts/subset_selection/README.md (1)

211-214: Replace developer-specific absolute path with a portable example.

Use a relative path to avoid leaking local paths and to make instructions reusable.

 # Navigate to project root
-cd /Users/roburishabh/Github/odh-data-processing
+cd odh-data-processing
+# or: cd <project-root>
scripts/subset_selection/cli.py (1)

1-1: Make CLI script executable (shebang present).

Set the executable bit so it can be run directly.

chmod +x scripts/subset_selection/cli.py
scripts/subset_selection/utils/subset_selection_utils.py (1)

121-127: Prevent NaNs in RBF scaling when avg_dist → 0.

avg_dist can be zero for identical embeddings; divide-by-zero yields NaNs that break downstream selection. Clamp the scale.

         if metric == "rbf":
             distance = torch.cdist(a, b)
             squared_distance = distance**2
-            avg_dist = torch.mean(squared_distance)
-            torch.div(squared_distance, kw * avg_dist, out=squared_distance)
+            avg_dist = torch.mean(squared_distance)
+            eps = torch.finfo(squared_distance.dtype).eps
+            scale = kw * torch.clamp(avg_dist, min=eps)
+            torch.div(squared_distance, scale, out=squared_distance)
             torch.exp(-squared_distance, out=squared_distance)
             return squared_distance

if previously raised by bot.

scripts/subset_selection/subset_selection.py (3)

253-263: Normalize percentage floats and align with config (0–100).

Config accepts floats up to 100, but this method rejects >1. Normalize >1 by dividing by 100, then validate and compute.

-        if isinstance(size_spec, float):
-            # if not in range of 0 to 1, raise error
-            if size_spec <= 0 or size_spec > 1:
-                raise ValueError(
-                    "Percentage values must be between 0(non-inclusive) and 1(inclusive)"
-                )
-            # If between 0 and 1, treat as decimal percentage (0.5 = 50%)
-            return max(1, int((size_spec) * total_samples))
+        if isinstance(size_spec, float):
+            # Accept 0–1 (fraction) or 0–100 (percent)
+            percentage = size_spec / 100.0 if size_spec > 1 else size_spec
+            if percentage <= 0 or percentage > 1:
+                raise ValueError("Percentage values must be in (0, 100]")
+            return max(1, int(percentage * total_samples))

if previously raised by bot.


738-847: Skip empty folds and clamp per-fold budgets to fold size.

Prevents optimizer crashes when fold is empty or budget > fold size. Also normalize float percentages consistently.

-        for fold_idx, fold_indices in gpu_folds_info:
+        for fold_idx, fold_indices in gpu_folds_info:
+            fold_size = len(fold_indices)
+            if fold_size == 0:
+                logger.info(f"Skipping empty fold {fold_idx + 1} on GPU {gpu_id}")
+                continue
             try:
@@
-                for size_spec in subset_sizes:
+                for size_spec in subset_sizes:
                     if isinstance(size_spec, float):
-                        # Percentage-based selection
-                        budget = max(
-                            1, math.ceil(size_spec * similarity_matrix.shape[0])
-                        )
+                        # Percentage-based selection (accept 0–1 or 0–100)
+                        percentage = (size_spec / 100.0) if size_spec > 1 else size_spec
+                        budget = max(1, math.ceil(percentage * fold_size))
                     else:
                         # Absolute number-based selection
-                        budget = max(
-                            1,
-                            math.ceil(
-                                size_spec * (similarity_matrix.shape[0] / total_samples)
-                            ),
-                        )
+                        budget = max(1, math.ceil(size_spec * (fold_size / total_samples)))
+                    budget = min(budget, fold_size)
+                    if budget == 0:
+                        continue

if previously raised by bot.


570-577: Write JSON vs JSONL correctly.

Only use lines=True for .jsonl; write standard JSON for .json.

-        extension = input_file.split(".")[-1]
-        if extension in ["json", "jsonl"]:
-            subset_data.to_json(output_file, orient="records", lines=True)
+        extension = input_file.split(".")[-1]
+        if extension == "jsonl":
+            subset_data.to_json(output_file, orient="records", lines=True)
+        elif extension == "json":
+            subset_data.to_json(output_file, orient="records", lines=False)
         elif extension == "csv":
             subset_data.to_csv(output_file, index=False)
         elif extension == "parquet":
             subset_data.to_parquet(output_file)

if previously raised by bot.

scripts/subset_selection/encoders/arctic_encoder.py (3)

130-134: Guard fp16 on CPU devices.

Calling half() on CPU can degrade/break operations. Apply fp16 only on CUDA.

-        if self.cfg.use_fp16:
-            self.model = self.model.half()
+        if self.cfg.use_fp16:
+            if self.cfg.device.type == "cuda":
+                self.model = self.model.to(dtype=torch.float16)
+            else:
+                logger.warning("use_fp16=True ignored on non-CUDA device")

181-199: Critical: avoid pre-tokenizing entire corpus on GPU — batch-tokenize inside the loop.

Current code tokenizes all inputs and moves them to the device, which will OOM on large datasets. Tokenize per batch and only move that batch.

-        encodings = self.tokenizer(
-            inputs,
-            max_length=self.cfg.model_config["max_length"],
-            padding=True,
-            truncation=True,
-            return_tensors="pt",
-        ).to(self.cfg.device)
-
         embeddings_list = []
         for i in tqdm(
             range(0, len(inputs), self.cfg.batch_size),
             disable=not show_progress or len(inputs) < 256,
         ):
-            batch = {k: v[i : i + self.cfg.batch_size] for k, v in encodings.items()}
-            outputs = self.model(**batch)
+            batch_texts = inputs[i : i + self.cfg.batch_size]
+            enc = self.tokenizer(
+                batch_texts,
+                max_length=self.cfg.model_config["max_length"],
+                padding=True,
+                truncation=True,
+                return_tensors="pt",
+            )
+            batch = {k: v.to(self.cfg.device) for k, v in enc.items()}
+            outputs = self.model(**batch)
             # Take the first token embedding (CLS) and normalize it
             embeddings = F.normalize(outputs.last_hidden_state[:, 0], p=2, dim=1)
             embeddings_list.append(embeddings.cpu())

103-115: Security posture: gate trust_remote_code and pin revisions.

trust_remote_code=True executes arbitrary repo code. Pin a revision and limit trust to allowlisted repos (even in testing mode).

-            self.tokenizer = AutoTokenizer.from_pretrained(self.cfg.model_name)
-            self.model = AutoModel.from_pretrained(
-                self.cfg.model_name,
-                add_pooling_layer=False,
-                trust_remote_code=True,
-            )
+            revision = os.environ.get("ARCTIC_MODEL_REV")  # optional: pass via config
+            allowlisted = {"Snowflake/snowflake-arctic-embed-l-v2.0"}
+            trust = self.cfg.model_name in allowlisted
+            self.tokenizer = AutoTokenizer.from_pretrained(self.cfg.model_name, revision=revision)
+            self.model = AutoModel.from_pretrained(
+                self.cfg.model_name,
+                add_pooling_layer=False,
+                trust_remote_code=trust,
+                revision=revision,
+            )
@@
-            self.tokenizer = AutoTokenizer.from_pretrained(model_path)
+            revision = os.environ.get("ARCTIC_MODEL_REV")
+            self.tokenizer = AutoTokenizer.from_pretrained(model_path, revision=revision)
             self.model = AutoModel.from_pretrained(
                 model_path,
                 add_pooling_layer=False,
-                trust_remote_code=True,
+                trust_remote_code=False,
                 local_files_only=True,
+                revision=revision,
             )

Consider surfacing revision in config instead of env var for reproducibility. Based on learnings.

🧹 Nitpick comments (8)
scripts/subset_selection/README.md (3)

64-81: Add language to fenced block to satisfy MD040 and improve readability.

Specify a language for the CLI options block (e.g., "text") to fix markdownlint MD040.

-```
+```text
 Required:
   --input <file> [<file> ...]    Input file(s) to process (JSONL, JSON, CSV, Parquet)
   --subset-sizes <sizes>         Comma-separated sizes (e.g., "0.1,0.5" or "1000,5000")

 Optional:
   --output-dir <dir>             Output directory (default: output)
   --batch-size <int>             Batch size for processing (default: 100000)
   --num-folds <int>              Number of folds/partitions (default: 50)
   --epsilon <float>              Optimization parameter (default: 160.0)
   --num-gpus <int>               Number of GPUs to use (default: auto-detect)
   --combine-files                Combine multiple input files before processing
   --testing-mode                 Enable CPU mode for testing
   --encoder-type <str>           Encoder type (default: arctic)
   --encoder-model <str>          Model name (default: Snowflake/snowflake-arctic-embed-l-v2.0)
   --template-name <str>          Template name (default: conversation)
   --seed <int>                   Random seed (default: 42)

---

`168-183`: **Add language to directory tree block to satisfy MD040.**

Mark the directory tree block as "text".


```diff
-```
+```text
 scripts/
 ├── __init__.py              # Top-level package initialization
 └── subset_selection/
     ├── __init__.py          # Subset selection package initialization
     ├── subset_selection.py  # Main subset selection logic
     ├── cli.py              # Command-line interface
     ├── requirements.txt    # Package dependencies
     ├── README.md          # This file
     ├── encoders/
     │   ├── __init__.py     # Encoder registry
     │   └── arctic_encoder.py  # Arctic embedding encoder
     └── utils/
         ├── __init__.py     # Utils initialization
         └── subset_selection_utils.py  # Utility functions

12-18: Installation note: pin Python and show single-command install (optional).

Docs currently install from scripts/subset_selection/requirements.txt. Consider adding:

  • Explicit Python version (3.12) as requested by reviewer.
  • Single command with CUDA wheel index for PyTorch (if CUDA 12.1 is target).

Proposed addition after the current snippet:

# Requires Python 3.12
python3.12 -m pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu121

Please confirm which requirements file is the canonical source (repo root vs scripts/subset_selection) and the intended CUDA toolchain for this project. Based on learnings.

scripts/subset_selection/encoders/__init__.py (1)

12-23: Avoid blanket try/except; raise precise errors.

The extra try/except re-wraps all errors as ValueError, obscuring root causes. Let the explicit unsupported-type check raise, and return directly otherwise.

-def get_encoder_class(encoder_type: str):
-    """Get the encoder class based on the encoder type."""
-    try:
-        if encoder_type not in ENCODER_REGISTRY:
-            supported_encoders = list(ENCODER_REGISTRY.keys())
-            raise ValueError(
-                f"Unsupported encoder type: '{encoder_type}'. "
-                f"Supported types are: {supported_encoders}"
-            )
-        return ENCODER_REGISTRY[encoder_type]
-    except Exception as e:
-        raise ValueError(f"Error getting encoder class: {str(e)}") from e
+def get_encoder_class(encoder_type: str):
+    """Get the encoder class based on the encoder type."""
+    if encoder_type not in ENCODER_REGISTRY:
+        supported = list(ENCODER_REGISTRY.keys())
+        raise ValueError(f"Unsupported encoder type: '{encoder_type}'. Supported types: {supported}")
+    return ENCODER_REGISTRY[encoder_type]
scripts/subset_selection/cli.py (1)

106-114: Clarify subset-size parsing; support '%' to avoid ambiguity.

Current logic treats "1" as absolute size 1, not 100%. Add "%" support and keep decimals for percentages to prevent surprises.

-    subset_sizes = []
-    for size_str in args.subset_sizes.split(","):
-        size_str = size_str.strip()
-        if "." in size_str:
-            subset_sizes.append(float(size_str))
-        else:
-            subset_sizes.append(int(size_str))
+    subset_sizes = []
+    for raw in args.subset_sizes.split(","):
+        s = raw.strip()
+        if s.endswith("%"):
+            subset_sizes.append(float(s[:-1]) / 100.0)
+            continue
+        # Decimals (e.g., 0.1) are treated as percentages; integers are absolute counts
+        subset_sizes.append(float(s) if "." in s else int(s))

Consider documenting the "%" option in README. Based on learnings.

scripts/subset_selection/utils/subset_selection_utils.py (1)

14-18: Do not call logging.basicConfig in library code.

basicConfig at import time can override app logging. Let the application configure logging; keep a module logger only.

-# Configure logging
-logging.basicConfig(
-    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
-)
 logger = logging.getLogger(__name__)
scripts/subset_selection/subset_selection.py (2)

31-36: Avoid setting logging.basicConfig in a library module.

Leave logging configuration to the application/CLI; keep module logger only.

-# Configure logging
-logging.basicConfig(
-    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
-)
 logger = logging.getLogger(__name__)

275-277: Name percentage subsets as whole percentages (e.g., 10.0%).

Prefer human-friendly names like "percent_10.0" instead of "percent_0.1".

-        if isinstance(size_spec, float):
-            return f"percent_{size_spec:.1f}"
+        if isinstance(size_spec, float):
+            pct = (size_spec * 100.0) if size_spec <= 1 else size_spec
+            return f"percent_{pct:.1f}"
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 158f73b and 408e383.

📒 Files selected for processing (10)
  • scripts/subset_selection/.gitignore (1 hunks)
  • scripts/subset_selection/README.md (1 hunks)
  • scripts/subset_selection/__init__.py (1 hunks)
  • scripts/subset_selection/cli.py (1 hunks)
  • scripts/subset_selection/encoders/__init__.py (1 hunks)
  • scripts/subset_selection/encoders/arctic_encoder.py (1 hunks)
  • scripts/subset_selection/requirements.txt (1 hunks)
  • scripts/subset_selection/subset_selection.py (1 hunks)
  • scripts/subset_selection/utils/__init__.py (1 hunks)
  • scripts/subset_selection/utils/subset_selection_utils.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • scripts/subset_selection/requirements.txt
  • scripts/subset_selection/.gitignore
🧰 Additional context used
🧬 Code graph analysis (6)
scripts/subset_selection/utils/__init__.py (1)
scripts/subset_selection/utils/subset_selection_utils.py (3)
  • compute_pairwise_dense (86-147)
  • get_default_num_gpus (66-83)
  • retry_on_exception (20-63)
scripts/subset_selection/__init__.py (2)
scripts/subset_selection/subset_selection.py (8)
  • BasicConfig (39-78)
  • DataProcessor (168-576)
  • EncoderConfig (82-93)
  • ProcessingConfig (127-165)
  • SystemConfig (112-123)
  • TemplateConfig (97-108)
  • get_supported_encoders (853-861)
  • subset_datasets (864-923)
scripts/subset_selection/encoders/arctic_encoder.py (1)
  • EncoderConfig (47-55)
scripts/subset_selection/encoders/__init__.py (1)
scripts/subset_selection/encoders/arctic_encoder.py (1)
  • ArcticEmbedEncoder (58-204)
scripts/subset_selection/encoders/arctic_encoder.py (1)
scripts/subset_selection/subset_selection.py (1)
  • EncoderConfig (82-93)
scripts/subset_selection/subset_selection.py (3)
scripts/subset_selection/encoders/__init__.py (1)
  • get_encoder_class (12-23)
scripts/subset_selection/utils/subset_selection_utils.py (3)
  • compute_pairwise_dense (86-147)
  • get_default_num_gpus (66-83)
  • retry_on_exception (20-63)
scripts/subset_selection/encoders/arctic_encoder.py (1)
  • encode (170-204)
scripts/subset_selection/cli.py (1)
scripts/subset_selection/subset_selection.py (1)
  • subset_datasets (864-923)
🪛 markdownlint-cli2 (0.18.1)
scripts/subset_selection/README.md

64-64: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


168-168: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Pylint (4.0.1)
scripts/subset_selection/utils/subset_selection_utils.py

[refactor] 86-86: Too many positional arguments (7/5)

(R0917)

scripts/subset_selection/encoders/arctic_encoder.py

[refactor] 59-59: Too many positional arguments (6/5)

(R0917)


[refactor] 58-58: Too few public methods (1/2)

(R0903)

scripts/subset_selection/subset_selection.py

[refactor] 738-738: Too many branches (14/12)

(R0912)

🔇 Additional comments (2)
scripts/subset_selection/utils/__init__.py (1)

1-16: LGTM — clean re-exports.

scripts/subset_selection/__init__.py (1)

1-29: LGTM — clear public surface and re-exports.

@alimaredia alimaredia merged commit 2db1ad5 into opendatahub-io:main Oct 28, 2025
2 checks passed
@RobuRishabh RobuRishabh deleted the R1146-Only-scripts-for-subset-selection branch November 3, 2025 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants