[RHAIENG]-1146 Initial Repository Setup and Baseline Testing (only scripts) #30

RobuRishabh · 2025-10-14T00:25:54Z

Description

https://issues.redhat.com/browse/RHAIENG-1146
This PR contains subset selection-scripts .py files only.

How Has This Been Tested?

Tested it in local environment without GPU on CPU only.
Tested it on AWS with GPU.

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

New Features
- Added a comprehensive subset selection pipeline with multi-GPU support for processing datasets.
- Introduced command-line interface for easy dataset subsetting with customizable parameters.
- Added support for text embeddings using Snowflake Arctic model.
- Implemented facility location optimization for intelligent subset generation.
Documentation
- Added comprehensive README with installation, usage, and configuration guidelines.

coderabbitai · 2025-10-14T00:26:05Z

Walkthrough

This PR introduces a complete subset selection pipeline module at scripts/subset_selection/ featuring embedding generation using Snowflake's Arctic model, multi-GPU processing, facility location-based subset optimization, and CLI-driven orchestration with configurable templating and fold-based selection.

Changes

Cohort / File(s)	Summary
Configuration & Package Structure `scripts/subset_selection/__init__.py`, `scripts/subset_selection/.gitignore`	Package initializer re-exporting public API (BasicConfig, DataProcessor, EncoderConfig, etc.); .gitignore file with standard Python, environment, tooling, and IDE exclusions.
Documentation `scripts/subset_selection/README.md`, `scripts/subset_selection/requirements.txt`	README documenting package overview, installation, CLI/Python API usage, configuration parameters, and quick start; pinned dependencies for ML frameworks, data processing, subset selection, templating, and logging.
Encoder Implementation `scripts/subset_selection/encoders/arctic_encoder.py`, `scripts/subset_selection/encoders/__init__.py`	ArcticEmbedEncoder class for text embeddings using Snowflake's pretrained model with FP16 support and per-instance config; encoder registry and discovery function mapping "arctic" to the encoder class.
Core Pipeline & Orchestration `scripts/subset_selection/subset_selection.py`	Central DataProcessor orchestrating end-to-end workflow: configuration dataclasses (BasicConfig, EncoderConfig, TemplateConfig, SystemConfig, ProcessingConfig), multi-GPU embedding generation, fold-based subset selection via facility location, dataset loading/combining, templating, and comprehensive error handling.
CLI Interface `scripts/subset_selection/cli.py`	Command-line entry point with argument parsing for input files, subset sizes, output directory, GPU configuration, fold parameters, encoder settings, and template options; calls subset_datasets with parsed configuration.
Utility Functions `scripts/subset_selection/utils/subset_selection_utils.py`, `scripts/subset_selection/utils/__init__.py`	Retry decorator for exception handling with CUDA cache cleanup; GPU availability detection; batched pairwise metric computation (cosine, euclidean, RBF) with device management; utility re-exports.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI Interface
    participant DP as DataProcessor
    participant Enc as ArcticEmbedEncoder
    participant GPU1 as GPU-1<br/>(Shard Process)
    participant GPU2 as GPU-2<br/>(Shard Process)
    participant Subset as Facility<br/>Location
    participant Out as Output<br/>Persistence

    CLI->>DP: parse config & call subset_datasets()
    DP->>DP: load & combine datasets
    DP->>DP: split dataset into shards per GPU
    par GPU Embedding Generation
        DP->>GPU1: _process_dataset_shard(shard_1)
        GPU1->>Enc: encode(texts)
        Enc->>Enc: tokenize, batch, inference
        Enc-->>GPU1: embeddings
        GPU1->>GPU1: save embeddings
        GPU1-->>DP: shard complete
    and
        DP->>GPU2: _process_dataset_shard(shard_2)
        GPU2->>Enc: encode(texts)
        Enc-->>GPU2: embeddings
        GPU2->>GPU2: save embeddings
        GPU2-->>DP: shard complete
    end
    DP->>DP: merge shard embeddings
    par Fold-based Subset Selection
        DP->>Subset: process_folds_with_gpu(fold_1)
        Subset->>Subset: facility location optimization
        Subset-->>DP: subset indices
    and
        DP->>Subset: process_folds_with_gpu(fold_2)
        Subset-->>DP: subset indices
    end
    DP->>Out: _save_subset(data, metadata)
    Out-->>CLI: completion signal

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

GPU coordination & multi-processing: Multi-GPU shard distribution, process pool management, device placement logic in subset_selection.py and arctic_encoder.py requires careful verification of synchronization and memory cleanup.
Embedding generation pipeline: Batching logic, tokenization with max_length handling, CLS pooling, L2 normalization, and tensor/array output options in arctic_encoder.py need validation for correctness across diverse input shapes.
Subset selection algorithm: Fold-based facility location optimization via process_folds_with_gpu() in subset_selection.py — verify fold splitting, GPU affinity, and result aggregation.
Error handling & retry logic: Decorator in subset_selection_utils.py with CUDA-specific exception handling and GC cleanup; decorator application across DataProcessor methods in subset_selection.py.
Configuration & dataclass orchestration: Five interdependent config dataclasses flow through DataProcessor; verify initialization order and default value consistency.
Dataset loading & templating: Custom format_text logic and dataset combining in subset_selection.py must correctly preserve indices for subset extraction.

Poem

🐰✨ A subset selection rabbit hops with glee,
Arctic embeddings dance on GPU spree!
Fold by fold, the sparse forests gleam,
Facility location realizes the dream!
Multi-GPU magic, embeddings so fine,
Your data subsets shall perfectly align! 🌟

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title Check	❓ Inconclusive	The PR title "[RHAIENG]-1146 Initial Repository Setup and Baseline Testing (only scripts)" describes the nature of the changes at a high level but lacks specificity about the actual content being introduced. While the title accurately reflects that this is an initial setup with testing and includes a clarifying issue reference, the phrase "Initial Repository Setup" is quite generic and doesn't clearly communicate that the main change is introducing a complete subset selection pipeline with embeddings, utilities, and CLI infrastructure. Someone scanning the PR history would need to reference the issue number to understand what specifically is being set up, rather than gleaning it directly from the title. The title is not misleading or unrelated to the changeset, but it's overly broad and could apply to many different types of repository initializations.	Consider revising the title to be more specific about the core content being added. For example, "Add end-to-end subset selection pipeline scripts with baseline testing" or "Add subset selection module with embeddings encoder and CLI" would more clearly convey to reviewers scanning the history what this PR introduces, while still maintaining brevity and clarity.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 91.18% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 14cd429 and 0300fc6.

📒 Files selected for processing (10)

scripts/subset_selection/.gitignore (1 hunks)
scripts/subset_selection/README.md (1 hunks)
scripts/subset_selection/__init__.py (1 hunks)
scripts/subset_selection/cli.py (1 hunks)
scripts/subset_selection/encoders/__init__.py (1 hunks)
scripts/subset_selection/encoders/arctic_encoder.py (1 hunks)
scripts/subset_selection/requirements.txt (1 hunks)
scripts/subset_selection/subset_selection.py (1 hunks)
scripts/subset_selection/utils/__init__.py (1 hunks)
scripts/subset_selection/utils/subset_selection_utils.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (6)

scripts/subset_selection/utils/__init__.py (1)

scripts/subset_selection/utils/subset_selection_utils.py (3)

compute_pairwise_dense (86-147)

get_default_num_gpus (66-83)

retry_on_exception (20-63)

scripts/subset_selection/cli.py (1)

scripts/subset_selection/subset_selection.py (1)

subset_datasets (864-923)

scripts/subset_selection/__init__.py (2)

scripts/subset_selection/subset_selection.py (8)

BasicConfig (39-78)

DataProcessor (168-576)

EncoderConfig (82-93)

ProcessingConfig (127-165)

SystemConfig (112-123)

TemplateConfig (97-108)

get_supported_encoders (853-861)

subset_datasets (864-923)

scripts/subset_selection/encoders/arctic_encoder.py (1)

EncoderConfig (47-55)

scripts/subset_selection/encoders/__init__.py (1)

scripts/subset_selection/encoders/arctic_encoder.py (1)

ArcticEmbedEncoder (58-204)

scripts/subset_selection/encoders/arctic_encoder.py (1)

scripts/subset_selection/subset_selection.py (1)

EncoderConfig (82-93)

scripts/subset_selection/subset_selection.py (3)

scripts/subset_selection/encoders/__init__.py (1)

get_encoder_class (12-23)

scripts/subset_selection/utils/subset_selection_utils.py (3)

compute_pairwise_dense (86-147)

get_default_num_gpus (66-83)

retry_on_exception (20-63)

scripts/subset_selection/encoders/arctic_encoder.py (2)

EncoderConfig (47-55)

encode (170-204)

🪛 markdownlint-cli2 (0.18.1)

scripts/subset_selection/README.md

64-64: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

168-168: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Pylint (4.0.0)

scripts/subset_selection/utils/subset_selection_utils.py

[refactor] 86-86: Too many positional arguments (7/5)

(R0917)

scripts/subset_selection/encoders/arctic_encoder.py

[refactor] 59-59: Too many positional arguments (6/5)

(R0917)

[refactor] 58-58: Too few public methods (1/2)

(R0903)

scripts/subset_selection/subset_selection.py

[refactor] 738-738: Too many branches (14/12)

(R0912)

🪛 Ruff (0.14.0)

scripts/subset_selection/cli.py

1-1: Shebang is present but file is not executable

(EXE001)

116-116: f-string without any placeholders

Remove extraneous f prefix

(F541)

147-147: Consider moving this statement to an else block

(TRY300)

148-148: Do not catch blind exception: Exception

(BLE001)

scripts/subset_selection/utils/subset_selection_utils.py

34-34: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

34-34: Use explicit conversion flag

Replace with conversion flag

(RUF010)

38-40: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

39-39: Use explicit conversion flag

Replace with conversion flag

(RUF010)

44-44: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

44-44: Use explicit conversion flag

Replace with conversion flag

(RUF010)

48-48: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

48-48: Use explicit conversion flag

Replace with conversion flag

(RUF010)

52-52: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

52-52: Use explicit conversion flag

Replace with conversion flag

(RUF010)

80-82: Avoid specifying long messages outside the exception class

(TRY003)

128-128: Avoid specifying long messages outside the exception class

(TRY003)

scripts/subset_selection/encoders/__init__.py

17-20: Abstract raise to an inner function

(TRY301)

17-20: Avoid specifying long messages outside the exception class

(TRY003)

23-23: Avoid specifying long messages outside the exception class

(TRY003)

23-23: Use explicit conversion flag

Replace with conversion flag

(RUF010)

scripts/subset_selection/encoders/arctic_encoder.py

69-71: Avoid specifying long messages outside the exception class

(TRY003)

117-120: Avoid specifying long messages outside the exception class

(TRY003)

148-151: Avoid specifying long messages outside the exception class

(TRY003)

161-164: Avoid specifying long messages outside the exception class

(TRY003)

scripts/subset_selection/subset_selection.py

59-59: Avoid specifying long messages outside the exception class

(TRY003)

155-155: Prefer TypeError exception for invalid type

(TRY004)

155-155: Avoid specifying long messages outside the exception class

(TRY003)

159-159: Prefer TypeError exception for invalid type

(TRY004)

159-159: Avoid specifying long messages outside the exception class

(TRY003)

161-163: Avoid specifying long messages outside the exception class

(TRY003)

165-165: Avoid specifying long messages outside the exception class

(TRY003)

181-181: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.

(S701)

206-206: Avoid specifying long messages outside the exception class

(TRY003)

235-237: Avoid specifying long messages outside the exception class

(TRY003)

256-258: Avoid specifying long messages outside the exception class

(TRY003)

345-345: Avoid specifying long messages outside the exception class

(TRY003)

412-412: Loop control variable fold_idx not used within loop body

Rename unused fold_idx to _fold_idx

(B007)

424-427: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

494-494: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

494-494: Use explicit conversion flag

Replace with conversion flag

(RUF010)

558-558: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

558-558: Use explicit conversion flag

Replace with conversion flag

(RUF010)

614-614: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.

(S701)

640-640: Abstract raise to an inner function

(TRY301)

640-640: Avoid specifying long messages outside the exception class

(TRY003)

683-683: Consider moving this statement to an else block

(TRY300)

688-688: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

688-688: Use explicit conversion flag

Replace with conversion flag

(RUF010)

762-762: Abstract raise to an inner function

(TRY301)

762-762: Avoid specifying long messages outside the exception class

(TRY003)

831-833: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

832-832: Use explicit conversion flag

Replace with conversion flag

(RUF010)

847-847: Consider moving this statement to an else block

(TRY300)

849-849: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

849-849: Use explicit conversion flag

Replace with conversion flag

(RUF010)

916-916: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

916-916: Use explicit conversion flag

Replace with conversion flag

(RUF010)

coderabbitai · 2025-10-14T00:38:28Z

scripts/subset_selection/subset_selection.py

+        if isinstance(size_spec, float):
+            # if not in range of 0 to 1, raise error
+            if size_spec <= 0 or size_spec > 1:
+                raise ValueError(
+                    "Percentage values must be between 0(non-inclusive) and 1(inclusive)"
+                )
+            # If between 0 and 1, treat as decimal percentage (0.5 = 50%)
+            return max(1, int((size_spec) * total_samples))
+        # Treat as absolute number
+        return min(size_spec, total_samples)
+


⚠️ Potential issue | 🔴 Critical

Fix subset percentage normalization

ProcessingConfig.__post_init__ explicitly allows float subset_sizes up to 100 to represent percentages, but calculate_subset_size then rejects anything greater than 1. A config such as subset_sizes=[10.0] sails through validation yet explodes here, so the CLI fails on a perfectly valid 10% request. Normalize values >1 to fractions (and keep the validation aligned) so we honor the documented percentage semantics and avoid the runtime error.

if isinstance(size_spec, float): - # if not in range of 0 to 1, raise error - if size_spec <= 0 or size_spec > 1: - raise ValueError( - "Percentage values must be between 0(non-inclusive) and 1(inclusive)" - ) - # If between 0 and 1, treat as decimal percentage (0.5 = 50%) - return max(1, int((size_spec) * total_samples)) + percentage = size_spec / 100 if size_spec > 1 else size_spec + if percentage <= 0 or percentage > 1: + raise ValueError( + "Percentage values must be between 0 (exclusive) and 100 (inclusive)" + ) + return max(1, int(percentage * total_samples))

Make the same normalization when computing per-fold budgets so both code paths agree.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if isinstance(size_spec, float):

# if not in range of 0 to 1, raise error

if size_spec <= 0 or size_spec > 1:

raise ValueError(

"Percentage values must be between 0(non-inclusive) and 1(inclusive)"

)

# If between 0 and 1, treat as decimal percentage (0.5 = 50%)

return max(1, int((size_spec) * total_samples))

# Treat as absolute number

return min(size_spec, total_samples)

if isinstance(size_spec, float):

percentage = size_spec / 100 if size_spec > 1 else size_spec

if percentage <= 0 or percentage > 1:

raise ValueError(

"Percentage values must be between 0 (exclusive) and 100 (inclusive)"

)

return max(1, int(percentage * total_samples))

# Treat as absolute number

return min(size_spec, total_samples)

🧰 Tools

🪛 Ruff (0.14.0)

256-258: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In scripts/subset_selection/subset_selection.py around lines 253 to 263, float size_spec values greater than 1 are currently rejected even though config allows percentages expressed as >1 (e.g., 10.0 meaning 10%); adjust calculate_subset_size to first normalize float percentage values >1 by dividing by 100 (so 10.0 -> 0.10), then validate the resulting fraction is >0 and <=1 and compute the sample count; also apply the same normalization logic in the per-fold budget computation path so both code paths treat percentage semantics consistently.

coderabbitai · 2025-10-14T00:38:29Z

scripts/subset_selection/subset_selection.py

+        for fold_idx, fold_indices in gpu_folds_info:
+            try:
+                logger.info(f"Processing fold {fold_idx + 1} on GPU {gpu_id}")
+
+                fold_embeddings = embeddings[fold_indices].to(device)
+
+                logger.info(f"Computing similarity matrix for fold {fold_idx + 1}")
+                max_sim_mat = compute_pairwise_dense(
+                    fold_embeddings,
+                    batch_size=50000,
+                    metric="cosine",
+                    device=device,
+                    scaling="additive",
+                )
+                similarity_matrix = max_sim_mat.cpu().numpy()
+
+                subsets = {}
+                ds_func = FacilityLocationFunction(
+                    n=similarity_matrix.shape[0],
+                    sijs=similarity_matrix,
+                    mode="dense",
+                    separate_rep=False,
+                )
+
+                for size_spec in subset_sizes:
+                    if isinstance(size_spec, float):
+                        # Percentage-based selection
+                        budget = max(
+                            1, math.ceil(size_spec * similarity_matrix.shape[0])
+                        )
+                    else:
+                        # Absolute number-based selection
+                        budget = max(
+                            1,
+                            math.ceil(
+                                size_spec * (similarity_matrix.shape[0] / total_samples)
+                            ),
+                        )
+
+                    logger.info(
+                        f"Selecting subset of size {budget} for fold {fold_idx + 1}"
+                    )
+
+                    subset_result = ds_func.maximize(
+                        budget=budget,
+                        optimizer="LazierThanLazyGreedy",
+                        epsilon=epsilon,
+                        stopIfZeroGain=False,
+                        stopIfNegativeGain=False,
+                        verbose=False,
+                    )
+
+                    subset_indices = [fold_indices[x[0]] for x in subset_result]
+                    subset_gains = [x[1] for x in subset_result]
+                    subsets[size_spec] = {
+                        "indices": subset_indices,
+                        "gains": subset_gains,
+                    }
+


⚠️ Potential issue | 🔴 Critical

Skip empty folds and cap per-fold budgets

When len(embeddings) < num_folds we manufacture empty folds. Likewise, requesting an absolute subset larger than the dataset (e.g. subset_sizes=[500] on a 200-row corpus) makes the fold-level budget exceed the fold cardinality. In both cases FacilityLocationFunction.maximize is asked to pick at least one item from a fold that has zero (or fewer than budget) candidates, and we crash. Guard the empty folds and clamp budgets to the fold size so the optimizer only runs on feasible sets.

- for fold_idx, fold_indices in gpu_folds_info: + for fold_idx, fold_indices in gpu_folds_info: + fold_size = len(fold_indices) + if fold_size == 0: + logger.info(f"Skipping empty fold {fold_idx + 1} on GPU {gpu_id}") + continue try: @@ - for size_spec in subset_sizes: + for size_spec in subset_sizes: if isinstance(size_spec, float): - # Percentage-based selection - budget = max( - 1, math.ceil(size_spec * similarity_matrix.shape[0]) - ) + percentage = size_spec / 100 if size_spec > 1 else size_spec + budget = max(1, math.ceil(percentage * fold_size)) else: # Absolute number-based selection - budget = max( - 1, - math.ceil( - size_spec * (similarity_matrix.shape[0] / total_samples) - ), - ) + budget = max( + 1, + math.ceil(size_spec * (fold_size / total_samples)), + ) + budget = min(budget, fold_size) + if budget == 0: + continue

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for fold_idx, fold_indices in gpu_folds_info:

try:

logger.info(f"Processing fold {fold_idx + 1} on GPU {gpu_id}")

fold_embeddings = embeddings[fold_indices].to(device)

logger.info(f"Computing similarity matrix for fold {fold_idx + 1}")

max_sim_mat = compute_pairwise_dense(

fold_embeddings,

batch_size=50000,

metric="cosine",

device=device,

scaling="additive",

)

similarity_matrix = max_sim_mat.cpu().numpy()

subsets = {}

ds_func = FacilityLocationFunction(

n=similarity_matrix.shape[0],

sijs=similarity_matrix,

mode="dense",

separate_rep=False,

)

for size_spec in subset_sizes:

if isinstance(size_spec, float):

# Percentage-based selection

budget = max(

1, math.ceil(size_spec * similarity_matrix.shape[0])

)

else:

# Absolute number-based selection

budget = max(

1,

math.ceil(

size_spec * (similarity_matrix.shape[0] / total_samples)

),

)

logger.info(

f"Selecting subset of size {budget} for fold {fold_idx + 1}"

)

subset_result = ds_func.maximize(

budget=budget,

optimizer="LazierThanLazyGreedy",

epsilon=epsilon,

stopIfZeroGain=False,

stopIfNegativeGain=False,

verbose=False,

)

subset_indices = [fold_indices[x[0]] for x in subset_result]

subset_gains = [x[1] for x in subset_result]

subsets[size_spec] = {

"indices": subset_indices,

"gains": subset_gains,

}

for fold_idx, fold_indices in gpu_folds_info:

fold_size = len(fold_indices)

if fold_size == 0:

logger.info(f"Skipping empty fold {fold_idx + 1} on GPU {gpu_id}")

continue

try:

logger.info(f"Processing fold {fold_idx + 1} on GPU {gpu_id}")

fold_embeddings = embeddings[fold_indices].to(device)

logger.info(f"Computing similarity matrix for fold {fold_idx + 1}")

max_sim_mat = compute_pairwise_dense(

fold_embeddings,

batch_size=50000,

metric="cosine",

device=device,

scaling="additive",

)

similarity_matrix = max_sim_mat.cpu().numpy()

subsets = {}

ds_func = FacilityLocationFunction(

n=similarity_matrix.shape[0],

sijs=similarity_matrix,

mode="dense",

separate_rep=False,

)

for size_spec in subset_sizes:

if isinstance(size_spec, float):

percentage = size_spec / 100 if size_spec > 1 else size_spec

budget = max(1, math.ceil(percentage * fold_size))

else:

# Absolute number-based selection

budget = max(

1,

math.ceil(size_spec * (fold_size / total_samples)),

)

budget = min(budget, fold_size)

if budget == 0:

continue

logger.info(

f"Selecting subset of size {budget} for fold {fold_idx + 1}"

)

subset_result = ds_func.maximize(

budget=budget,

optimizer="LazierThanLazyGreedy",

epsilon=epsilon,

stopIfZeroGain=False,

stopIfNegativeGain=False,

verbose=False,

)

subset_indices = [fold_indices[x[0]] for x in subset_result]

subset_gains = [x[1] for x in subset_result]

subsets[size_spec] = {

"indices": subset_indices,

"gains": subset_gains,

}

coderabbitai · 2025-10-14T00:38:29Z

scripts/subset_selection/utils/subset_selection_utils.py

+            distance = torch.cdist(a, b)
+            squared_distance = distance**2
+            avg_dist = torch.mean(squared_distance)
+            torch.div(squared_distance, kw * avg_dist, out=squared_distance)
+            torch.exp(-squared_distance, out=squared_distance)
+            return squared_distance


⚠️ Potential issue | 🟠 Major

Clamp the RBF scaling factor to avoid NaNs

When all embeddings in a batch are identical (or nearly so), squared_distance becomes 0 everywhere, making avg_dist zero. Dividing by kw * avg_dist (Line 126) therefore performs a 0/0, yielding NaNs that propagate through the similarity matrix and break downstream selection. Clamp the scale away from zero before dividing.

if metric == "rbf": distance = torch.cdist(a, b) squared_distance = distance**2 - avg_dist = torch.mean(squared_distance) - torch.div(squared_distance, kw * avg_dist, out=squared_distance) + avg_dist = torch.mean(squared_distance) + eps = torch.finfo(squared_distance.dtype).eps + scale = kw * torch.clamp(avg_dist, min=eps) + torch.div(squared_distance, scale, out=squared_distance) torch.exp(-squared_distance, out=squared_distance) return squared_distance

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

distance = torch.cdist(a, b)

squared_distance = distance**2

avg_dist = torch.mean(squared_distance)

torch.div(squared_distance, kw * avg_dist, out=squared_distance)

torch.exp(-squared_distance, out=squared_distance)

return squared_distance

if metric == "rbf":

distance = torch.cdist(a, b)

squared_distance = distance**2

- avg_dist = torch.mean(squared_distance)

avg_dist = torch.mean(squared_distance)

eps = torch.finfo(squared_distance.dtype).eps

scale = kw * torch.clamp(avg_dist, min=eps)

torch.div(squared_distance, scale, out=squared_distance)

torch.exp(-squared_distance, out=squared_distance)

return squared_distance

🤖 Prompt for AI Agents

In scripts/subset_selection/utils/subset_selection_utils.py around lines 122 to 127, the RBF scaling divides by kw * avg_dist which can be zero for identical embeddings causing 0/0 NaNs; clamp avg_dist away from zero before dividing (e.g. compute eps = torch.finfo(squared_distance.dtype).eps * squared_distance.numel() or a small constant like 1e-12 on the same device/dtype, then use avg_dist_clamped = torch.clamp_min(avg_dist, eps) or max(avg_dist, eps) and divide by kw * avg_dist_clamped) so the division cannot produce NaN and retains device/dtype consistency.

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (4)

scripts/subset_selection/README.md (3)
103-103: Add language specifier to code block.

The code block should specify a language for proper syntax highlighting and to comply with Markdown best practices.

Apply this diff:
-```
+```text
 Required:
   --input <file> [<file> ...]    Input file(s) to process (JSONL, JSON, CSV, Parquet)
   --subset-sizes <sizes>         Comma-separated sizes (e.g., "0.1,0.5" or "1000,5000")
Based on static analysis hints.

207-207: Add language specifier to code block.

The code block should specify a language for proper syntax highlighting.

Apply this diff:
-```
+```text
 scripts/
 ├── __init__.py              # Top-level package initialization
 └── subset_selection/
Based on static analysis hints.

270-270: Add language specifier to code block.

The code block should specify a language for proper syntax highlighting.

Apply this diff:
-```
+```text
 RuntimeError: Cannot re-initialize CUDA in forked subprocess
Based on static analysis hints.

</blockquote></details>
<details>
<summary>scripts/subset_selection/cli.py (1)</summary><blockquote>

`124-124`: **Remove unnecessary f-string prefix.**

The string has no placeholders, so the `f` prefix is unnecessary.

Apply this diff:

```diff
-    print(f"Starting subset selection...")
+    print("Starting subset selection...")
Based on static analysis hints.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0300fc6 and eac077a.

📒 Files selected for processing (3)

scripts/subset_selection/README.md (1 hunks)
scripts/subset_selection/cli.py (1 hunks)
scripts/subset_selection/requirements.txt (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

scripts/subset_selection/cli.py (1)

scripts/subset_selection/subset_selection.py (1)

subset_datasets (864-923)

🪛 markdownlint-cli2 (0.18.1)

scripts/subset_selection/README.md

103-103: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

207-207: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

270-270: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 OSV Scanner (2.2.3)

scripts/subset_selection/requirements.txt

[CRITICAL] 1-1: torch 2.5.1+cu121: undefined

(PYSEC-2025-41)

[CRITICAL] 1-1: torch 2.5.1+cu121: PyTorch susceptible to local Denial of Service

(GHSA-3749-ghw9-m3mg)

[CRITICAL] 1-1: torch 2.5.1+cu121: PyTorch: torch.load with weights_only=True leads to remote code execution

(GHSA-53q9-r3pm-6pq6)

[CRITICAL] 1-1: torch 2.5.1+cu121: PyTorch Improper Resource Shutdown or Release vulnerability

(GHSA-887c-mr87-cxwp)

🪛 Ruff (0.14.0)

scripts/subset_selection/cli.py

1-1: Shebang is present but file is not executable

(EXE001)

124-124: f-string without any placeholders

Remove extraneous f prefix

(F541)

155-155: Consider moving this statement to an else block

(TRY300)

156-156: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (10)

scripts/subset_selection/README.md (7)

1-58: LGTM! Clear installation and model setup documentation.

The overview, installation instructions, and model setup options are well-documented and provide clear guidance for users.

59-100: LGTM! Comprehensive CLI usage examples.

The CLI usage examples cover a wide range of scenarios and provide clear guidance for different use cases.

122-161: LGTM! Clear Python API documentation.

The Python API examples are well-structured and demonstrate both basic and advanced usage patterns.

162-204: LGTM! Thorough configuration documentation.

The configuration section provides detailed parameter descriptions with helpful recommendations based on dataset size.

224-243: LGTM! Clear documentation of encoders and outputs.

The supported encoders and output files sections are concise and informative.

280-312: LGTM! Comprehensive troubleshooting guidance.

The troubleshooting section provides clear solutions for common issues users may encounter.

313-328: LGTM! Helpful notes for users.

The notes section provides important context about dataset size recommendations, GPU requirements, and performance considerations.

scripts/subset_selection/cli.py (3)

13-101: LGTM! Well-structured argument parsing.

The argument parsing is comprehensive with clear help text and appropriate defaults for all parameters.

103-122: LGTM! Proper multiprocessing setup and argument parsing.

The multiprocessing setup for CUDA compatibility is appropriate, and the subset size parsing correctly handles both percentage and absolute count formats.

147-162: LGTM! Appropriate exception handling for CLI entry point.

The broad exception handling is acceptable for a CLI entry point, as it ensures user-friendly error messages are displayed regardless of the error type. The function properly returns exit codes for success (0) and failure (1).

scripts/subset_selection/cli.py

scripts/subset_selection/README.md

scripts/subset_selection/requirements.txt

alimaredia · 2025-10-22T15:32:50Z

@RobuRishabh at a high level I can tell this took a lot of work and looks great to me. I really like how you got rid of the config file.

At a high level there's a few small things I would change about this PR:

There's no mention of the author or the original repository this work came from
Since we're not trying to make a regular package out of these scripts all of the __init__.py's can be dropped (https://stackoverflow.com/a/37140173) and relative imports can be used. The code in encoders/__init__.py can go into subset_selection.py.
This is a personal preference, but I don't see the need to have a cli.py. I would combine that file with subset_selection.py and just have subselection.py be the entry point for the user.

scripts/subset_selection/README.md

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (6)

scripts/subset_selection/requirements.txt (2)

38-49: Blocker: CUDA-only wheels and pinned +cu121 builds break CPU installs and CI

The nvidia-* CUDA wheels and +cu121 variants force a CUDA 12.1 Linux stack; pip fails on CPU-only machines and many CI runners. Make the default requirements CPU-friendly and move CUDA wheels to an optional GPU requirements file or extras. Also drop the explicit triton pin (bundled by torch) to avoid resolver conflicts.

Suggested approach:

requirements.txt (CPU-default): remove lines 38–49 and use generic torch/vision/audio wheels.

requirements-gpu-cu121.txt: contain the CUDA builds and instruct installing with the PyTorch extra index.

Example edits (CPU default):
- nvidia-cublas-cu12==12.1.3.1
- nvidia-cuda-cupti-cu12==12.1.105
- nvidia-cuda-nvrtc-cu12==12.1.105
- nvidia-cuda-runtime-cu12==12.1.105
- nvidia-cudnn-cu12==9.1.0.70
- nvidia-cufft-cu12==11.0.2.54
- nvidia-curand-cu12==10.3.2.106
- nvidia-cusolver-cu12==11.4.5.107
- nvidia-cusparse-cu12==12.1.0.106
- nvidia-nccl-cu12==2.21.5
- nvidia-nvjitlink-cu12==12.9.86
- nvidia-nvtx-cu12==12.1.105
- torch==2.5.1+cu121
- torchaudio==2.5.1+cu121
- torchvision==0.20.1+cu121
- triton==3.1.0
+ torch==2.5.1
+ torchaudio==2.5.1
+ torchvision==0.20.1
Provide a GPU file with the CUDA-tagged trio and doc: pip install -r requirements-gpu-cu121.txt --extra-index-url https://download.pytorch.org/whl/cu121.

Please confirm the CPU path installs on a clean Python 3.12 environment and that GPU users can install via the GPU file.

Also applies to: 70-76

70-72: Security: pinned PyTorch build flagged by multiple critical advisories

torch==2.5.1+cu121 is flagged (GHSA-3749-ghw9-m3mg, GHSA-53q9-r3pm-6pq6, GHSA-887c-mr87-cxwp). Upgrade to a patched PyTorch release and align torchvision/torchaudio accordingly. If upgrade is blocked, document mitigations and avoid torch.load on untrusted inputs.
What is the first PyTorch version that fixes GHSA-53q9-r3pm-6pq6 and GHSA-887c-mr87-cxwp? Provide official PyTorch release notes or advisories.

scripts/subset_selection/README.md (1)

241-244: Replace developer-specific absolute path with a project-relative path

Use a repo-relative path to avoid confusion and leaking local FS details.
-# Navigate to project root
-cd /Users/roburishabh/Github/odh-data-processing
+# Navigate to project root
+cd odh-data-processing

scripts/subset_selection/utils/subset_selection_utils.py (1)

122-127: Clamp RBF scale to avoid NaNs when avg_dist ≈ 0

Division by kw * avg_dist can produce NaNs for identical embeddings. Clamp the denominator away from zero.

-            avg_dist = torch.mean(squared_distance)
-            torch.div(squared_distance, kw * avg_dist, out=squared_distance)
+            avg_dist = torch.mean(squared_distance)
+            eps = torch.finfo(squared_distance.dtype).eps
+            scale = kw * torch.clamp(avg_dist, min=eps)
+            torch.div(squared_distance, scale, out=squared_distance)

scripts/subset_selection/subset_selection.py (2)

277-285: Fix percentage handling: accept 0–100 floats and normalize

Config allows floats up to 100, but this path rejects values >1. Normalize percentages >1 to fractions and validate 0–100.

-        if isinstance(size_spec, float):
-            # if not in range of 0 to 1, raise error
-            if size_spec <= 0 or size_spec > 1:
-                raise ValueError(
-                    "Percentage values must be between 0(non-inclusive) and 1(inclusive)"
-                )
-            # If between 0 and 1, treat as decimal percentage (0.5 = 50%)
-            return max(1, int((size_spec) * total_samples))
+        if isinstance(size_spec, float):
+            percentage = size_spec / 100 if size_spec > 1 else size_spec
+            if percentage <= 0 or percentage > 1:
+                raise ValueError("Percentage values must be between 0 (exclusive) and 100 (inclusive)")
+            return max(1, int(percentage * total_samples))

816-875: Robustness: skip empty folds and cap per-fold budgets

Empty folds and over-large budgets crash the optimizer. Skip zero-size folds and clamp budget ≤ fold size. Normalize float percentages >1 here too.

-        for fold_idx, fold_indices in gpu_folds_info:
+        for fold_idx, fold_indices in gpu_folds_info:
+            fold_size = len(fold_indices)
+            if fold_size == 0:
+                logger.info("Skipping empty fold %d on GPU %d", fold_idx + 1, gpu_id)
+                continue
@@
-                for size_spec in subset_sizes:
+                for size_spec in subset_sizes:
                     if isinstance(size_spec, float):
-                        # Percentage-based selection
-                        budget = max(
-                            1, math.ceil(size_spec * similarity_matrix.shape[0])
-                        )
+                        percentage = size_spec / 100 if size_spec > 1 else size_spec
+                        budget = max(1, math.ceil(percentage * fold_size))
                     else:
-                        # Absolute number-based selection
-                        budget = max(
-                            1,
-                            math.ceil(
-                                size_spec * (similarity_matrix.shape[0] / total_samples)
-                            ),
-                        )
+                        budget = max(1, math.ceil(size_spec * (fold_size / total_samples)))
+                    budget = min(budget, fold_size)
+                    if budget == 0:
+                        continue

🧹 Nitpick comments (12)

scripts/subset_selection/README.md (3)
21-23: Document CPU vs GPU install flows

Add a CPU-friendly install (default) and an optional GPU install with the PyTorch extra index, aligning with the split requirements.
-```bash
-pip install -r scripts/subset_selection/requirements.txt --extra-index-url https://download.pytorch.org/whl/cu121
-```
+```bash
+# CPU (default)
+pip install -r scripts/subset_selection/requirements.txt
+
+# GPU (CUDA 12.1)
+pip install -r scripts/subset_selection/requirements-gpu-cu121.txt \
+  --extra-index-url https://download.pytorch.org/whl/cu121
+```
99-117: Add languages to fenced code blocks for linting and readability

Several code fences lack a language identifier (MD040). Tag them with bash, python, or text as appropriate.

Also applies to: 203-215, 262-271

329-331: Fix bare URL formatting

Use Markdown link syntax to avoid MD034.
-The original codebase can be found at: [https://github.com/krishnatejakk/DataCurate4LLMs](https://github.com/krishnatejakk/DataCurate4LLMs)
+The original codebase can be found at: [krishnatejakk/DataCurate4LLMs](https://github.com/krishnatejakk/DataCurate4LLMs)
scripts/subset_selection/encoders/arctic_encoder.py (2)
104-110: UX: testing_mode log claims “model not found” without checking

You log “Model not found locally” unconditionally in testing mode. Check os.path.exists(model_path) before warning, and prefer local files even in testing mode to avoid redundant downloads.
-        if hasattr(self.cfg, "testing_mode") and self.cfg.testing_mode:
-            logger.warning(
-                f"Model not found locally at {model_path}. "
-                "Testing mode enabled - downloading from HuggingFace..."
-            )
+        if hasattr(self.cfg, "testing_mode") and self.cfg.testing_mode:
+            if not os.path.exists(model_path):
+                logger.warning(
+                    f"Model not found locally at {model_path}. "
+                    "Testing mode enabled - downloading from HuggingFace..."
+                )
+            else:
+                logger.info(f"Using local model path in testing mode: {model_path}")
Also applies to: 116-121

19-23: Dead code

safe_print is unused. Remove or use it.
scripts/subset_selection/utils/subset_selection_utils.py (2)
13-18: Library code shouldn’t call logging.basicConfig

Configure handlers in the application, not in reusable modules. Remove basicConfig and rely on module loggers.
-logging.basicConfig(
-    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
-)
 logger = logging.getLogger(__name__)
34-40: Preserve tracebacks with logger.exception

Replace logger.error(..., str(e)) with logger.exception("...") to keep stack traces.
-                logger.error(f"GPU out of memory on attempt {attempt + 1}: {str(e)}")
+                logger.exception("GPU out of memory on attempt %d", attempt + 1)
...
-                logger.error(
-                    f"PyTorch runtime error on attempt {attempt + 1}: {str(e)}"
-                )
+                logger.exception("PyTorch runtime error on attempt %d", attempt + 1)
...
-                logger.error(f"Value error on attempt {attempt + 1}: {str(e)}")
+                logger.exception("Value error on attempt %d", attempt + 1)
...
-                logger.error(f"Type error on attempt {attempt + 1}: {str(e)}")
+                logger.exception("Type error on attempt %d", attempt + 1)
...
-                logger.error(f"Index error on attempt {attempt + 1}: {str(e)}")
+                logger.exception("Index error on attempt %d", attempt + 1)
...
-        raise last_exception
+        raise last_exception
Also applies to: 44-52, 80-82, 128-129
scripts/subset_selection/subset_selection.py (4)
361-369: Use Pool context manager to ensure clean teardown

Prefer with Pool(...) as pool: over manual close/join/terminate.
-        pool = Pool(processes=num_gpus)
-        try:
-            shard_files = pool.map(_process_dataset_shard, args_list)
-        finally:
-            pool.close()
-            pool.join()
-            pool.terminate()
+        with Pool(processes=num_gpus) as pool:
+            shard_files = pool.map(_process_dataset_shard, args_list)
426-433: Same: use Pool context manager here too

Mirror the pattern to avoid leaks on exceptions.
-        pool = Pool(processes=self.config.system.num_gpus)
-        try:
-            gpu_results = pool.map(process_folds_with_gpu, gpu_assignments)
-        finally:
-            pool.close()
-            pool.join()
-            pool.terminate()
+        with Pool(processes=self.config.system.num_gpus) as pool:
+            gpu_results = pool.map(process_folds_with_gpu, gpu_assignments)
659-681: Memory pressure: avoid buffering all embeddings in RAM

all_embeddings holds all batches before writing, which can blow memory on large shards. Stream directly to HDF5 (resizable dataset) or flush periodically.

I can provide a streaming HDF5 writer snippet if you want to adopt it in this function.

Also applies to: 682-691, 703-717

1105-1111: Nit: remove unused f-string prefix

print(f"Starting subset selection...") has no placeholders.
-    print(f"Starting subset selection...")
+    print("Starting subset selection...")
scripts/subset_selection/__main__.py (1)
6-6: Use a relative intra‑package import.

More robust within the package and avoids relying on the top‑level package name.
-from scripts.subset_selection.subset_selection import main
+from .subset_selection import main

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eac077a and e69bef8.

📒 Files selected for processing (7)

scripts/subset_selection/.gitignore (1 hunks)
scripts/subset_selection/README.md (1 hunks)
scripts/subset_selection/__main__.py (1 hunks)
scripts/subset_selection/encoders/arctic_encoder.py (1 hunks)
scripts/subset_selection/requirements.txt (1 hunks)
scripts/subset_selection/subset_selection.py (1 hunks)
scripts/subset_selection/utils/subset_selection_utils.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

scripts/subset_selection/.gitignore

🧰 Additional context used

🧬 Code graph analysis (3)

scripts/subset_selection/__main__.py (1)

scripts/subset_selection/subset_selection.py (1)

main (1081-1139)

scripts/subset_selection/encoders/arctic_encoder.py (1)

scripts/subset_selection/subset_selection.py (1)

EncoderConfig (106-117)

scripts/subset_selection/subset_selection.py (2)

scripts/subset_selection/encoders/arctic_encoder.py (3)

ArcticEmbedEncoder (58-204)

EncoderConfig (47-55)

encode (170-204)

scripts/subset_selection/utils/subset_selection_utils.py (3)

compute_pairwise_dense (86-147)

get_default_num_gpus (66-83)

retry_on_exception (20-63)

🪛 markdownlint-cli2 (0.18.1)

scripts/subset_selection/README.md

99-99: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

203-203: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

262-262: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

327-327: Bare URL used

(MD034, no-bare-urls)

🪛 OSV Scanner (2.2.3)

scripts/subset_selection/requirements.txt

[CRITICAL] 1-1: torch 2.5.1+cu121: undefined

(PYSEC-2025-41)

[CRITICAL] 1-1: torch 2.5.1+cu121: PyTorch susceptible to local Denial of Service

(GHSA-3749-ghw9-m3mg)

[CRITICAL] 1-1: torch 2.5.1+cu121: PyTorch: torch.load with weights_only=True leads to remote code execution

(GHSA-53q9-r3pm-6pq6)

[CRITICAL] 1-1: torch 2.5.1+cu121: PyTorch Improper Resource Shutdown or Release vulnerability

(GHSA-887c-mr87-cxwp)

🪛 Pylint (4.0.1)

scripts/subset_selection/utils/subset_selection_utils.py

[refactor] 86-86: Too many positional arguments (7/5)

(R0917)

scripts/subset_selection/encoders/arctic_encoder.py

[refactor] 59-59: Too many positional arguments (6/5)

(R0917)

[refactor] 58-58: Too few public methods (1/2)

(R0903)

scripts/subset_selection/subset_selection.py

[refactor] 362-362: Consider using 'with' for resource-allocating operations

(R1732)

[refactor] 426-426: Consider using 'with' for resource-allocating operations

(R1732)

[refactor] 613-613: Too many statements (57/50)

(R0915)

[refactor] 786-786: Too many branches (16/12)

(R0912)

[refactor] 786-786: Too many statements (53/50)

(R0915)

🪛 Ruff (0.14.1)

scripts/subset_selection/utils/subset_selection_utils.py

34-34: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

34-34: Use explicit conversion flag

Replace with conversion flag

(RUF010)

38-40: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

39-39: Use explicit conversion flag

Replace with conversion flag

(RUF010)

44-44: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

44-44: Use explicit conversion flag

Replace with conversion flag

(RUF010)

48-48: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

48-48: Use explicit conversion flag

Replace with conversion flag

(RUF010)

52-52: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

52-52: Use explicit conversion flag

Replace with conversion flag

(RUF010)

80-82: Avoid specifying long messages outside the exception class

(TRY003)

128-128: Avoid specifying long messages outside the exception class

(TRY003)

scripts/subset_selection/__main__.py

1-1: Shebang is present but file is not executable

(EXE001)

scripts/subset_selection/encoders/arctic_encoder.py

69-71: Avoid specifying long messages outside the exception class

(TRY003)

117-120: Avoid specifying long messages outside the exception class

(TRY003)

148-151: Avoid specifying long messages outside the exception class

(TRY003)

161-164: Avoid specifying long messages outside the exception class

(TRY003)

scripts/subset_selection/subset_selection.py

53-56: Abstract raise to an inner function

(TRY301)

53-56: Avoid specifying long messages outside the exception class

(TRY003)

59-59: Avoid specifying long messages outside the exception class

(TRY003)

59-59: Use explicit conversion flag

Replace with conversion flag

(RUF010)

83-83: Avoid specifying long messages outside the exception class

(TRY003)

179-179: Prefer TypeError exception for invalid type

(TRY004)

179-179: Avoid specifying long messages outside the exception class

(TRY003)

183-183: Prefer TypeError exception for invalid type

(TRY004)

183-183: Avoid specifying long messages outside the exception class

(TRY003)

185-187: Avoid specifying long messages outside the exception class

(TRY003)

189-189: Avoid specifying long messages outside the exception class

(TRY003)

205-205: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.

(S701)

230-230: Avoid specifying long messages outside the exception class

(TRY003)

259-261: Avoid specifying long messages outside the exception class

(TRY003)

280-282: Avoid specifying long messages outside the exception class

(TRY003)

374-374: Avoid specifying long messages outside the exception class

(TRY003)

458-461: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

528-528: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

528-528: Use explicit conversion flag

Replace with conversion flag

(RUF010)

592-592: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

592-592: Use explicit conversion flag

Replace with conversion flag

(RUF010)

648-648: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.

(S701)

674-674: Abstract raise to an inner function

(TRY301)

674-674: Avoid specifying long messages outside the exception class

(TRY003)

727-727: Consider moving this statement to an else block

(TRY300)

732-732: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

732-732: Use explicit conversion flag

Replace with conversion flag

(RUF010)

810-810: Abstract raise to an inner function

(TRY301)

810-810: Avoid specifying long messages outside the exception class

(TRY003)

879-881: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

880-880: Use explicit conversion flag

Replace with conversion flag

(RUF010)

900-900: Consider moving this statement to an else block

(TRY300)

902-902: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

902-902: Use explicit conversion flag

Replace with conversion flag

(RUF010)

973-973: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

973-973: Use explicit conversion flag

Replace with conversion flag

(RUF010)

1105-1105: f-string without any placeholders

Remove extraneous f prefix

(F541)

1136-1136: Consider moving this statement to an else block

(TRY300)

1137-1137: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (2)

scripts/subset_selection/requirements.txt (1)
35-37: Sanity-check numeric stack in isolation
Use python3 (matching CI) instead of python3.12 to avoid interpreter mismatches
Exclude GPU-specific PyTorch wheels (torch==2.5.1+cu121) when verifying core numeric libs, e.g.:
grep -Ev '^torch==' scripts/subset_selection/requirements.txt > tmp-reqs.txt
Run:
python3 -m venv /tmp/odh-reqs && source /tmp/odh-reqs/bin/activate
pip install --upgrade pip
pip install -r tmp-reqs.txt
python - <<'PY'
import numpy, scipy, sklearn, numba, llvmlite; print("OK")
PY
scripts/subset_selection/__main__.py (1)

8-10: Entry point looks good. Document invocation.

LGTM. Please ensure README shows usage: python -m scripts.subset_selection ... so users don’t try to run the file directly.

scripts/subset_selection/__main__.py

coderabbitai · 2025-10-22T20:40:10Z

scripts/subset_selection/encoders/arctic_encoder.py

+        # In testing mode, allow direct download from HuggingFace
+        if hasattr(self.cfg, "testing_mode") and self.cfg.testing_mode:
+            logger.warning(
+                f"Model not found locally at {model_path}. "
+                "Testing mode enabled - downloading from HuggingFace..."
+            )
+            self.tokenizer = AutoTokenizer.from_pretrained(self.cfg.model_name)
+            self.model = AutoModel.from_pretrained(
+                self.cfg.model_name,
+                add_pooling_layer=False,
+                trust_remote_code=True,
+            )
+        else:


⚠️ Potential issue | 🟠 Major

Security posture: unrestricted remote code execution during model load

trust_remote_code=True allows arbitrary code from model repos. For reproducibility and safety, pin a specific revision (commit hash) and maintain an allowlist of approved repos; set trust_remote_code only for those.

- self.tokenizer = AutoTokenizer.from_pretrained(self.cfg.model_name) - self.model = AutoModel.from_pretrained( - self.cfg.model_name, + revision = os.environ.get("ARCTIC_MODEL_REV", None) + self.tokenizer = AutoTokenizer.from_pretrained(self.cfg.model_name, revision=revision) + self.model = AutoModel.from_pretrained( + self.cfg.model_name, add_pooling_layer=False, - trust_remote_code=True, + trust_remote_code=True, # allowlisted repo + revision=revision, )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# In testing mode, allow direct download from HuggingFace

if hasattr(self.cfg, "testing_mode") and self.cfg.testing_mode:

logger.warning(

f"Model not found locally at {model_path}. "

"Testing mode enabled - downloading from HuggingFace..."

)

self.tokenizer = AutoTokenizer.from_pretrained(self.cfg.model_name)

self.model = AutoModel.from_pretrained(

self.cfg.model_name,

add_pooling_layer=False,

trust_remote_code=True,

)

else:

# In testing mode, allow direct download from HuggingFace

if hasattr(self.cfg, "testing_mode") and self.cfg.testing_mode:

logger.warning(

f"Model not found locally at {model_path}. "

"Testing mode enabled - downloading from HuggingFace..."

)

revision = os.environ.get("ARCTIC_MODEL_REV", None)

self.tokenizer = AutoTokenizer.from_pretrained(

self.cfg.model_name,

revision=revision,

)

self.model = AutoModel.from_pretrained(

self.cfg.model_name,

add_pooling_layer=False,

trust_remote_code=True, # allowlisted repo

revision=revision,

)

else:

🤖 Prompt for AI Agents

In scripts/subset_selection/encoders/arctic_encoder.py around lines 103 to 115, the code loads a remote HF model with trust_remote_code=True which permits arbitrary remote code execution; change the loader to require a pinned revision (pass revision=<commit hash> from config) and only enable trust_remote_code when the model repo is on an approved allowlist. Concretely: add a cfg field for revision (fail if missing in testing_mode), add or read an allowlist of approved model IDs, check that cfg.model_name is in that allowlist before setting trust_remote_code=True (otherwise set it False), and pass revision=self.cfg.revision to both AutoTokenizer.from_pretrained and AutoModel.from_pretrained; log a warning and abort/raise if testing_mode requests an unapproved repo or missing revision.

coderabbitai · 2025-10-22T20:40:10Z

scripts/subset_selection/encoders/arctic_encoder.py

+        encodings = self.tokenizer(
+            inputs,
+            max_length=self.cfg.model_config["max_length"],
+            padding=True,
+            truncation=True,
+            return_tensors="pt",
+        ).to(self.cfg.device)
+
+        embeddings_list = []
+        for i in tqdm(
+            range(0, len(inputs), self.cfg.batch_size),
+            disable=not show_progress or len(inputs) < 256,
+        ):
+            batch = {k: v[i : i + self.cfg.batch_size] for k, v in encodings.items()}
+            outputs = self.model(**batch)
+            # Take the first token embedding (CLS) and normalize it
+            embeddings = F.normalize(outputs.last_hidden_state[:, 0], p=2, dim=1)
+            embeddings_list.append(embeddings.cpu())
+


⚠️ Potential issue | 🔴 Critical

Critical: pre-tokenizing all inputs then moving to GPU will OOM on large batches

self.tokenizer(...).to(self.cfg.device) tokenizes the entire input list and moves all tensors to GPU, defeating batching. Tokenize and move per batch inside the loop.

- encodings = self.tokenizer( - inputs, - max_length=self.cfg.model_config["max_length"], - padding=True, - truncation=True, - return_tensors="pt", - ).to(self.cfg.device) - - embeddings_list = [] - for i in tqdm( + embeddings_list = [] + for i in tqdm( range(0, len(inputs), self.cfg.batch_size), disable=not show_progress or len(inputs) < 256, ): - batch = {k: v[i : i + self.cfg.batch_size] for k, v in encodings.items()} + batch_texts = inputs[i : i + self.cfg.batch_size] + enc = self.tokenizer( + batch_texts, + max_length=self.cfg.model_config["max_length"], + padding=True, + truncation=True, + return_tensors="pt", + ) + batch = {k: v.to(self.cfg.device) for k, v in enc.items()} outputs = self.model(**batch)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

encodings = self.tokenizer(

inputs,

max_length=self.cfg.model_config["max_length"],

padding=True,

truncation=True,

return_tensors="pt",

).to(self.cfg.device)

embeddings_list = []

for i in tqdm(

range(0, len(inputs), self.cfg.batch_size),

disable=not show_progress or len(inputs) < 256,

):

batch = {k: v[i : i + self.cfg.batch_size] for k, v in encodings.items()}

outputs = self.model(**batch)

# Take the first token embedding (CLS) and normalize it

embeddings = F.normalize(outputs.last_hidden_state[:, 0], p=2, dim=1)

embeddings_list.append(embeddings.cpu())

embeddings_list = []

for i in tqdm(

range(0, len(inputs), self.cfg.batch_size),

disable=not show_progress or len(inputs) < 256,

):

batch_texts = inputs[i : i + self.cfg.batch_size]

enc = self.tokenizer(

batch_texts,

max_length=self.cfg.model_config["max_length"],

padding=True,

truncation=True,

return_tensors="pt",

)

batch = {k: v.to(self.cfg.device) for k, v in enc.items()}

outputs = self.model(**batch)

# Take the first token embedding (CLS) and normalize it

embeddings = F.normalize(outputs.last_hidden_state[:, 0], p=2, dim=1)

embeddings_list.append(embeddings.cpu())

🤖 Prompt for AI Agents

In scripts/subset_selection/encoders/arctic_encoder.py around lines 181 to 199, the code currently tokenizes the entire inputs list and calls .to(self.cfg.device) on the resulting tensors which moves all tokenized data to GPU and can OOM for large datasets; change to tokenizing inside the loop per batch and only move that batch's tensors to the device. Specifically, remove the single pre-tokenize call, and inside the tqdm loop build a slice of raw inputs, call self.tokenizer(batch_inputs, max_length=..., padding=True, truncation=True, return_tensors="pt") and then call .to(self.cfg.device) on that batch dict before passing to model; keep collecting normalized embeddings.cpu() into embeddings_list as before.

coderabbitai · 2025-10-22T20:40:10Z

scripts/subset_selection/subset_selection.py

+        extension = input_file.split(".")[-1]
+        if extension in ["json", "jsonl"]:
+            subset_data.to_json(output_file, orient="records", lines=True)
+        elif extension == "csv":
+            subset_data.to_csv(output_file, index=False)
+        elif extension == "parquet":
+            subset_data.to_parquet(output_file)
+


⚠️ Potential issue | 🟡 Minor

Write JSON vs JSONL correctly

to_json(..., lines=True) produces JSONL. Only set lines=True for .jsonl; use lines=False for .json.

- if extension in ["json", "jsonl"]: - subset_data.to_json(output_file, orient="records", lines=True) + if extension == "jsonl": + subset_data.to_json(output_file, orient="records", lines=True) + elif extension == "json": + subset_data.to_json(output_file, orient="records", lines=False)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

extension = input_file.split(".")[-1]

if extension in ["json", "jsonl"]:

subset_data.to_json(output_file, orient="records", lines=True)

elif extension == "csv":

subset_data.to_csv(output_file, index=False)

elif extension == "parquet":

subset_data.to_parquet(output_file)

extension = input_file.split(".")[-1]

if extension == "jsonl":

subset_data.to_json(output_file, orient="records", lines=True)

elif extension == "json":

subset_data.to_json(output_file, orient="records", lines=False)

elif extension == "csv":

subset_data.to_csv(output_file, index=False)

elif extension == "parquet":

subset_data.to_parquet(output_file)

🤖 Prompt for AI Agents

In scripts/subset_selection/subset_selection.py around lines 604 to 611, the code always calls to_json(..., lines=True) which writes JSONL; change the logic to write JSONL only when the extension is ".jsonl" and write normal JSON for ".json" (i.e., call to_json(..., lines=False) or omit lines), keeping orient="records" for both if desired; update the extension check to distinguish "json" vs "jsonl" and call to_json with lines=False for ".json" and lines=True for ".jsonl".

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

notebooks/use-cases/document-conversion-vlm.ipynb (1)
60-65: Fix undefined variable in exception example.

The new try/except snippet calls converter.convert(file_path) and logs file_path, but in this notebook file_path isn’t defined until after convert succeeds. Copying the snippet as-is will raise a NameError. Please either call converter.convert(file) (and log file) or show how to define file_path before the conversion.
-    result = converter.convert(file_path)
+    result = converter.convert(file)
 ...
-    print(f"❌ Failed to convert {file_path}: {str(e)}")
+    print(f"❌ Failed to convert {file}: {e}")
notebooks/tutorials/dataset-preparation-for-RAG.ipynb (1)
485-488: Pin revision and restrict trust_remote_code when loading HF model.

Unpinned trust_remote_code=True permits arbitrary code execution and non-reproducible results. Pin a commit and optionally enforce an allowlist; set device explicitly.

Replace the model init cell with:
import os, torch
ALLOWED = {"Snowflake/snowflake-arctic-embed-l"}
REV = os.environ.get("ARCTIC_MODEL_REV")  # set to pinned commit hash
if REV is None:
    raise ValueError("Set ARCTIC_MODEL_REV to a pinned commit for reproducibility.")

embedding_model = SentenceTransformer(
    "Snowflake/snowflake-arctic-embed-l",
    trust_remote_code=("Snowflake/snowflake-arctic-embed-l" in ALLOWED),
    revision=REV,
    device="cuda" if torch.cuda.is_available() else "cpu",
)

♻️ Duplicate comments (7)

scripts/subset_selection/__main__.py (1)

1-1: Drop the shebang (module entrypoint).

Remove the shebang to avoid Ruff EXE001; this module is invoked via python -m.
-#!/usr/bin/env python3

scripts/subset_selection/encoders/arctic_encoder.py (2)

104-116: Harden HF model loading: pin revision and restrict trust_remote_code.

Remote loading with trust_remote_code=True without a pinned revision allows arbitrary code execution and non-reproducible runs. Require a pinned revision and only allow trust_remote_code for an explicit allowlist.

@@
-        # In testing mode, allow direct download from HuggingFace
-        if hasattr(self.cfg, "testing_mode") and self.cfg.testing_mode:
+        # In testing mode, allow direct download from HuggingFace (with pinned revision and allowlist)
+        ALLOWLIST = {"Snowflake/snowflake-arctic-embed-l-v2.0"}
+        revision = os.environ.get("ARCTIC_MODEL_REV", None)
+        if hasattr(self.cfg, "testing_mode") and self.cfg.testing_mode:
             logger.warning(
                 f"Model not found locally at {model_path}. "
                 "Testing mode enabled - downloading from HuggingFace..."
             )
-            self.tokenizer = AutoTokenizer.from_pretrained(self.cfg.model_name)
+            if revision is None:
+                raise ValueError("ARCTIC_MODEL_REV must be set to a pinned commit hash when testing_mode=True")
+            self.tokenizer = AutoTokenizer.from_pretrained(self.cfg.model_name, revision=revision)
             self.model = AutoModel.from_pretrained(
                 self.cfg.model_name,
                 add_pooling_layer=False,
-                trust_remote_code=True,
+                trust_remote_code=(self.cfg.model_name in ALLOWLIST),
+                revision=revision,
             )
@@
-            self.tokenizer = AutoTokenizer.from_pretrained(model_path)
+            self.tokenizer = AutoTokenizer.from_pretrained(model_path)
             self.model = AutoModel.from_pretrained(
                 model_path,
                 add_pooling_layer=False,
-                trust_remote_code=True,
+                trust_remote_code=(self.cfg.model_name in ALLOWLIST),
                 local_files_only=True,
             )

Also applies to: 123-129

182-189: Avoid OOM: tokenize and move to device per-batch, not for the entire corpus.

Current code pre-tokenizes and moves all tensors to GPU, which will OOM on large inputs. Tokenize inside the loop and move only that batch.

-        encodings = self.tokenizer(
-            inputs,
-            max_length=self.cfg.model_config["max_length"],
-            padding=True,
-            truncation=True,
-            return_tensors="pt",
-        ).to(self.cfg.device)
-
         embeddings_list = []
         for i in tqdm(
             range(0, len(inputs), self.cfg.batch_size),
             disable=not show_progress or len(inputs) < 256,
         ):
-            batch = {k: v[i : i + self.cfg.batch_size] for k, v in encodings.items()}
-            outputs = self.model(**batch)
+            batch_texts = inputs[i : i + self.cfg.batch_size]
+            enc = self.tokenizer(
+                batch_texts,
+                max_length=self.cfg.model_config["max_length"],
+                padding=True,
+                truncation=True,
+                return_tensors="pt",
+            )
+            batch = {k: v.to(self.cfg.device) for k, v in enc.items()}
+            outputs = self.model(**batch)

Also applies to: 191-199

scripts/subset_selection/utils/subset_selection_utils.py (1)

123-128: Clamp RBF scale to avoid 0/0 NaNs.

When all vectors are identical, avg_dist=0 leads to NaNs. Clamp before division.

         if metric == "rbf":
             distance = torch.cdist(a, b)
             squared_distance = distance**2
-            avg_dist = torch.mean(squared_distance)
-            torch.div(squared_distance, kw * avg_dist, out=squared_distance)
+            avg_dist = torch.mean(squared_distance)
+            eps = torch.finfo(squared_distance.dtype).eps
+            scale = kw * torch.clamp(avg_dist, min=eps)
+            torch.div(squared_distance, scale, out=squared_distance)
             torch.exp(-squared_distance, out=squared_distance)
             return squared_distance

scripts/subset_selection/subset_selection.py (3)

278-286: Fix percentage normalization (floats >1 should mean percent).

Config allows floats up to 100, but here values >1 are rejected. Normalize >1 by /100 and validate.

-        if isinstance(size_spec, float):
-            # if not in range of 0 to 1, raise error
-            if size_spec <= 0 or size_spec > 1:
-                raise ValueError(
-                    "Percentage values must be between 0(non-inclusive) and 1(inclusive)"
-                )
-            # If between 0 and 1, treat as decimal percentage (0.5 = 50%)
-            return max(1, int((size_spec) * total_samples))
+        if isinstance(size_spec, float):
+            percentage = size_spec / 100 if size_spec > 1 else size_spec
+            if percentage <= 0 or percentage > 1:
+                raise ValueError("Percentage values must be between 0 (exclusive) and 100 (inclusive)")
+            return max(1, int(percentage * total_samples))

831-889: Skip empty folds and cap per-fold budgets to fold size.

Empty folds and oversized budgets crash the optimizer. Guard folds and clamp budgets; also normalize float percentages >1 here.

         for fold_idx, fold_indices in gpu_folds_info:
             try:
                 logger.info(f"Processing fold {fold_idx + 1} on GPU {gpu_id}")
 
-                fold_embeddings = embeddings[fold_indices].to(device)
+                fold_embeddings = embeddings[fold_indices].to(device)
+                fold_size = fold_embeddings.shape[0]
+                if fold_size == 0:
+                    logger.info(f"Skipping empty fold {fold_idx + 1} on GPU {gpu_id}")
+                    continue
@@
-                for size_spec in subset_sizes:
+                for size_spec in subset_sizes:
                     if isinstance(size_spec, float):
-                        # Percentage-based selection
-                        budget = max(
-                            1, math.ceil(size_spec * similarity_matrix.shape[0])
-                        )
+                        percentage = size_spec / 100 if size_spec > 1 else size_spec
+                        budget = max(1, math.ceil(percentage * fold_size))
                     else:
-                        # Absolute number-based selection
-                        budget = max(
-                            1,
-                            math.ceil(
-                                size_spec * (similarity_matrix.shape[0] / total_samples)
-                            ),
-                        )
+                        budget = max(1, math.ceil(size_spec * (fold_size / total_samples)))
+                    budget = min(budget, fold_size)
+                    if budget == 0:
+                        continue

610-617: Write JSON vs JSONL correctly.

Use lines=True only for .jsonl; write standard JSON for .json.

-        if extension in ["json", "jsonl"]:
-            subset_data.to_json(output_file, orient="records", lines=True)
+        if extension == "jsonl":
+            subset_data.to_json(output_file, orient="records", lines=True)
+        elif extension == "json":
+            subset_data.to_json(output_file, orient="records", lines=False)

🧹 Nitpick comments (6)

scripts/subset_selection/encoders/arctic_encoder.py (1)

75-77: Optional: support Apple Metal (MPS) device.

Improve device fallback to prefer MPS on Apple if CUDA is unavailable.

-        self.device = device or torch.device(
-            "cuda" if torch.cuda.is_available() else "cpu"
-        )
+        if device is not None:
+            self.device = device
+        else:
+            if torch.cuda.is_available():
+                self.device = torch.device("cuda")
+            elif getattr(torch.backends, "mps", None) and torch.backends.mps.is_available():
+                self.device = torch.device("mps")
+            else:
+                self.device = torch.device("cpu")

notebooks/tutorials/dataset-preparation-for-RAG.ipynb (1)

43-44: Reproducibility: pin package versions (or use a constraints file).

Unpinned installs can break the notebook over time. Pin docling and sentence-transformers (you already pin pymilvus).

Example:
%pip install -qq "docling==2.43.0" "sentence-transformers==3.2.1" "numpy==2.0.2" "pymilvus[milvus_lite]==2.6.2"

scripts/subset_selection/utils/subset_selection_utils.py (2)

107-107: Preserve tensor dtype in results.

Initialize results with the input dtype to avoid unintended downcasting.

-    results = torch.zeros(n_samples1, n_samples2, device="cpu")
+    results = torch.zeros(n_samples1, n_samples2, device="cpu", dtype=tensor1.dtype)

35-35: Use logger.exception to keep stack traces on retries.

Retain traceback for failed attempts to aid debugging.

-                logger.error(f"GPU out of memory on attempt {attempt + 1}: {str(e)}")
+                logger.exception("GPU out of memory on attempt %d: %s", attempt + 1, e)
@@
-                logger.error(
-                    f"PyTorch runtime error on attempt {attempt + 1}: {str(e)}"
-                )
+                logger.exception("PyTorch runtime error on attempt %d: %s", attempt + 1, e)
@@
-                logger.error(f"Value error on attempt {attempt + 1}: {str(e)}")
+                logger.exception("Value error on attempt %d: %s", attempt + 1, e)
@@
-                logger.error(f"Type error on attempt {attempt + 1}: {str(e)}")
+                logger.exception("Type error on attempt %d: %s", attempt + 1, e)
@@
-                logger.error(f"Index error on attempt {attempt + 1}: {str(e)}")
+                logger.exception("Index error on attempt %d: %s", attempt + 1, e)

Also applies to: 39-41, 45-45, 49-49, 53-53

scripts/subset_selection/subset_selection.py (2)

365-372: Use context managers for Pools to avoid resource leaks.

with Pool(...) ensures proper cleanup even on exceptions.

-        pool = Pool(processes=num_gpus)
-        try:
-            shard_files = pool.map(_process_dataset_shard, args_list)
-        finally:
-            pool.close()
-            pool.join()
-            pool.terminate()
+        with Pool(processes=num_gpus) as pool:
+            shard_files = pool.map(_process_dataset_shard, args_list)
@@
-        pool = Pool(processes=self.config.system.num_gpus)
-        try:
-            gpu_results = pool.map(process_folds_with_gpu, gpu_assignments)
-        finally:
-            pool.close()
-            pool.join()
-            pool.terminate()
+        with Pool(processes=self.config.system.num_gpus) as pool:
+            gpu_results = pool.map(process_folds_with_gpu, gpu_assignments)

Also applies to: 429-436

206-209: Optional: enable Jinja2 autoescape for safety.

Set autoescape to reduce surprises when rendering user-provided text.

-from jinja2 import BaseLoader, Environment
+from jinja2 import BaseLoader, Environment, select_autoescape
@@
-        self.env = Environment(loader=BaseLoader())
+        self.env = Environment(loader=BaseLoader(), autoescape=select_autoescape(default_for_string=True))
@@
-        env = Environment(loader=BaseLoader())
+        env = Environment(loader=BaseLoader(), autoescape=select_autoescape(default_for_string=True))

Also applies to: 656-657

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1d0eb3c and 7a580fa.

📒 Files selected for processing (8)

notebooks/tutorials/dataset-preparation-for-RAG.ipynb (38 hunks)
notebooks/use-cases/document-conversion-standard.ipynb (18 hunks)
notebooks/use-cases/document-conversion-vlm.ipynb (17 hunks)
notebooks/use-cases/hybrid-chunking-demonstration.ipynb (20 hunks)
scripts/subset_selection/__main__.py (1 hunks)
scripts/subset_selection/encoders/arctic_encoder.py (1 hunks)
scripts/subset_selection/subset_selection.py (1 hunks)
scripts/subset_selection/utils/subset_selection_utils.py (1 hunks)

✅ Files skipped from review due to trivial changes (2)

notebooks/use-cases/hybrid-chunking-demonstration.ipynb
notebooks/use-cases/document-conversion-standard.ipynb

🧰 Additional context used

🧬 Code graph analysis (3)

scripts/subset_selection/encoders/arctic_encoder.py (1)

scripts/subset_selection/subset_selection.py (1)

EncoderConfig (107-118)

scripts/subset_selection/subset_selection.py (2)

scripts/subset_selection/encoders/arctic_encoder.py (3)

ArcticEmbedEncoder (59-205)

EncoderConfig (48-56)

encode (171-205)

scripts/subset_selection/utils/subset_selection_utils.py (3)

compute_pairwise_dense (87-148)

get_default_num_gpus (67-84)

retry_on_exception (21-64)

scripts/subset_selection/__main__.py (1)

scripts/subset_selection/subset_selection.py (1)

main (1094-1152)

🪛 Pylint (4.0.1)

scripts/subset_selection/encoders/arctic_encoder.py

[refactor] 60-60: Too many positional arguments (6/5)

(R0917)

[refactor] 59-59: Too few public methods (1/2)

(R0903)

scripts/subset_selection/utils/subset_selection_utils.py

[refactor] 87-87: Too many positional arguments (7/5)

(R0917)

scripts/subset_selection/subset_selection.py

[refactor] 365-365: Consider using 'with' for resource-allocating operations

(R1732)

[refactor] 429-429: Consider using 'with' for resource-allocating operations

(R1732)

[refactor] 619-619: Too many statements (57/50)

(R0915)

[refactor] 800-800: Too many branches (16/12)

(R0912)

[refactor] 800-800: Too many statements (53/50)

(R0915)

🪛 Ruff (0.14.1)

scripts/subset_selection/encoders/arctic_encoder.py

70-72: Avoid specifying long messages outside the exception class

(TRY003)

118-121: Avoid specifying long messages outside the exception class

(TRY003)

149-152: Avoid specifying long messages outside the exception class

(TRY003)

162-165: Avoid specifying long messages outside the exception class

(TRY003)

scripts/subset_selection/utils/subset_selection_utils.py

35-35: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

35-35: Use explicit conversion flag

Replace with conversion flag

(RUF010)

39-41: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

40-40: Use explicit conversion flag

Replace with conversion flag

(RUF010)

45-45: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

45-45: Use explicit conversion flag

Replace with conversion flag

(RUF010)

49-49: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

49-49: Use explicit conversion flag

Replace with conversion flag

(RUF010)

53-53: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

53-53: Use explicit conversion flag

Replace with conversion flag

(RUF010)

81-83: Avoid specifying long messages outside the exception class

(TRY003)

129-129: Avoid specifying long messages outside the exception class

(TRY003)

scripts/subset_selection/subset_selection.py

54-57: Abstract raise to an inner function

(TRY301)

54-57: Avoid specifying long messages outside the exception class

(TRY003)

60-60: Avoid specifying long messages outside the exception class

(TRY003)

60-60: Use explicit conversion flag

Replace with conversion flag

(RUF010)

84-84: Avoid specifying long messages outside the exception class

(TRY003)

180-180: Prefer TypeError exception for invalid type

(TRY004)

180-180: Avoid specifying long messages outside the exception class

(TRY003)

184-184: Prefer TypeError exception for invalid type

(TRY004)

184-184: Avoid specifying long messages outside the exception class

(TRY003)

186-188: Avoid specifying long messages outside the exception class

(TRY003)

190-190: Avoid specifying long messages outside the exception class

(TRY003)

206-206: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.

(S701)

231-231: Avoid specifying long messages outside the exception class

(TRY003)

260-262: Avoid specifying long messages outside the exception class

(TRY003)

281-283: Avoid specifying long messages outside the exception class

(TRY003)

377-377: Avoid specifying long messages outside the exception class

(TRY003)

534-534: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

534-534: Use explicit conversion flag

Replace with conversion flag

(RUF010)

598-598: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

598-598: Use explicit conversion flag

Replace with conversion flag

(RUF010)

656-656: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.

(S701)

684-684: Abstract raise to an inner function

(TRY301)

684-684: Avoid specifying long messages outside the exception class

(TRY003)

741-741: Consider moving this statement to an else block

(TRY300)

746-746: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

746-746: Use explicit conversion flag

Replace with conversion flag

(RUF010)

824-824: Abstract raise to an inner function

(TRY301)

824-824: Avoid specifying long messages outside the exception class

(TRY003)

893-895: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

894-894: Use explicit conversion flag

Replace with conversion flag

(RUF010)

914-914: Consider moving this statement to an else block

(TRY300)

916-916: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

916-916: Use explicit conversion flag

Replace with conversion flag

(RUF010)

987-987: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

987-987: Use explicit conversion flag

Replace with conversion flag

(RUF010)

1149-1149: Consider moving this statement to an else block

(TRY300)

1150-1150: Do not catch blind exception: Exception

(BLE001)

scripts/subset_selection/__main__.py

1-1: Shebang is present but file is not executable

(EXE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: execute_tests (notebooks/use-cases/document-conversion-standard.ipynb, https://raw.githubusercont...

coderabbitai · 2025-10-22T21:51:17Z

scripts/subset_selection/encoders/arctic_encoder.py

+        if self.cfg.use_fp16:
+            self.model = self.model.half()
+
+        self.model = self.model.to(self.cfg.device)
+        logger.info(f"Model loaded on device: {self.cfg.device}")
+


⚠️ Potential issue | 🟠 Major

Guard fp16 on CPU.

Calling .half() on CPU models can degrade or break ops. Apply fp16 only on CUDA; warn otherwise.

- if self.cfg.use_fp16: - self.model = self.model.half() + if self.cfg.use_fp16: + if self.cfg.device.type == "cuda": + self.model = self.model.to(dtype=torch.float16) + else: + logger.warning("use_fp16=True ignored on non-CUDA device")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if self.cfg.use_fp16:

self.model = self.model.half()

self.model = self.model.to(self.cfg.device)

logger.info(f"Model loaded on device: {self.cfg.device}")

if self.cfg.use_fp16:

if self.cfg.device.type == "cuda":

self.model = self.model.to(dtype=torch.float16)

else:

logger.warning("use_fp16=True ignored on non-CUDA device")

self.model = self.model.to(self.cfg.device)

logger.info(f"Model loaded on device: {self.cfg.device}")

alimaredia · 2025-10-23T12:05:13Z

@RobuRishabh it looks like you accidently pushed notebooks that have already been merged into this PR.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (9)

scripts/subset_selection/encoders/arctic_encoder.py (3)

104-115: Pin HF revision and restrict trust_remote_code (remote code execution risk).

Loading with trust_remote_code=True without a pinned revision allows arbitrary remote code execution and non-reproducible builds. Gate trust_remote_code behind an allowlist and require a pinned revision (commit SHA) for both tokenizer and model. Also pass revision when downloading; keep local_files_only=True for local paths.

Apply:

@@
-        # In testing mode, allow direct download from HuggingFace
-        if hasattr(self.cfg, "testing_mode") and self.cfg.testing_mode:
+        # In testing mode, allow direct download from HuggingFace (with pinned revision and allowlist)
+        if hasattr(self.cfg, "testing_mode") and self.cfg.testing_mode:
             logger.warning(
                 f"Model not found locally at {model_path}. "
                 "Testing mode enabled - downloading from HuggingFace..."
             )
-            self.tokenizer = AutoTokenizer.from_pretrained(self.cfg.model_name)
-            self.model = AutoModel.from_pretrained(
-                self.cfg.model_name,
-                add_pooling_layer=False,
-                trust_remote_code=True,
-            )
+            revision = os.environ.get("ARCTIC_MODEL_REV")
+            allowlist = {"Snowflake/snowflake-arctic-embed-l-v2.0"}
+            trust_rc = self.cfg.model_name in allowlist
+            if revision is None:
+                raise RuntimeError("ARCTIC_MODEL_REV env var must be set to a pinned commit SHA in testing_mode")
+            self.tokenizer = AutoTokenizer.from_pretrained(self.cfg.model_name, revision=revision)
+            self.model = AutoModel.from_pretrained(
+                self.cfg.model_name,
+                add_pooling_layer=False,
+                trust_remote_code=trust_rc,
+                revision=revision,
+            )
         else:
             if not os.path.exists(model_path):
                 raise ValueError(
                     f"Model not found in available models: {self.cfg.model_name}\n"
                     "Please run `ilab model download` and download the necessary model"
                 )
 
-            self.tokenizer = AutoTokenizer.from_pretrained(model_path)
+            self.tokenizer = AutoTokenizer.from_pretrained(model_path)
             self.model = AutoModel.from_pretrained(
                 model_path,
                 add_pooling_layer=False,
-                trust_remote_code=True,
+                trust_remote_code=True,  # local allowlisted artifact
                 local_files_only=True,
             )

Also applies to: 117-130

131-135: Guard fp16 on CPU.

.half() on CPU can break ops. Apply fp16 only on CUDA; warn otherwise.

-        if self.cfg.use_fp16:
-            self.model = self.model.half()
+        if self.cfg.use_fp16:
+            if self.cfg.device.type == "cuda":
+                self.model = self.model.to(dtype=torch.float16)
+            else:
+                logger.warning("use_fp16=True ignored on non-CUDA device")

182-200: Batch-tokenize and move per-batch to device to avoid OOM.

Pre-tokenizing the entire corpus and calling .to(device) moves all tensors to GPU at once. Tokenize inside the loop and move only that batch.

-        encodings = self.tokenizer(
-            inputs,
-            max_length=self.cfg.model_config["max_length"],
-            padding=True,
-            truncation=True,
-            return_tensors="pt",
-        ).to(self.cfg.device)
-
-        embeddings_list = []
-        for i in tqdm(
+        embeddings_list = []
+        for i in tqdm(
             range(0, len(inputs), self.cfg.batch_size),
             disable=not show_progress or len(inputs) < 256,
         ):
-            batch = {k: v[i : i + self.cfg.batch_size] for k, v in encodings.items()}
+            batch_texts = inputs[i : i + self.cfg.batch_size]
+            enc = self.tokenizer(
+                batch_texts,
+                max_length=self.cfg.model_config["max_length"],
+                padding=True,
+                truncation=True,
+                return_tensors="pt",
+            )
+            batch = {k: v.to(self.cfg.device) for k, v in enc.items()}
             outputs = self.model(**batch)
             # Take the first token embedding (CLS) and normalize it
             embeddings = F.normalize(outputs.last_hidden_state[:, 0], p=2, dim=1)
             embeddings_list.append(embeddings.cpu())

scripts/subset_selection/README.md (1)

269-276: Replace developer-specific absolute path with project-relative path.
-# Navigate to project root
-cd /Users/roburishabh/Github/odh-data-processing
+# Navigate to project root
+cd odh-data-processing

scripts/subset_selection/__main__.py (1)

1-1: Drop the shebang; module entry points don’t need it.
-#!/usr/bin/env python3

scripts/subset_selection/utils/subset_selection_utils.py (1)

122-129: Clamp RBF scaling to avoid NaNs when avg_dist ≈ 0.

Division by kw * avg_dist can be 0/0 for identical embeddings. Clamp the scale away from zero.

         if metric == "rbf":
             distance = torch.cdist(a, b)
             squared_distance = distance**2
-            avg_dist = torch.mean(squared_distance)
-            torch.div(squared_distance, kw * avg_dist, out=squared_distance)
+            avg_dist = torch.mean(squared_distance)
+            eps = torch.finfo(squared_distance.dtype).eps
+            scale = kw * torch.clamp(avg_dist, min=eps)
+            torch.div(squared_distance, scale, out=squared_distance)
             torch.exp(-squared_distance, out=squared_distance)
             return squared_distance

scripts/subset_selection/subset_selection.py (3)

278-287: Fix percentage normalization inconsistency (floats up to 100 allowed in config).

ProcessingConfig.__post_init__ allows floats in 0–100, but calculate_subset_size rejects floats >1. Normalize floats >1 to percentages.

-        if isinstance(size_spec, float):
-            # if not in range of 0 to 1, raise error
-            if size_spec <= 0 or size_spec > 1:
-                raise ValueError(
-                    "Percentage values must be between 0(non-inclusive) and 1(inclusive)"
-                )
-            # If between 0 and 1, treat as decimal percentage (0.5 = 50%)
-            return max(1, int((size_spec) * total_samples))
+        if isinstance(size_spec, float):
+            percentage = size_spec / 100.0 if size_spec > 1 else size_spec
+            if percentage <= 0 or percentage > 1:
+                raise ValueError("Percentage values must be in (0, 100]")
+            return max(1, int(percentage * total_samples))

829-887: Skip empty folds and cap per‑fold budgets to fold size.

Empty folds and over‑budget requests crash FacilityLocationFunction.maximize. Skip zero‑sized folds and clamp budgets.

-        for fold_idx, fold_indices in gpu_folds_info:
+        for fold_idx, fold_indices in gpu_folds_info:
+            fold_size = len(fold_indices)
+            if fold_size == 0:
+                logger.info("Skipping empty fold %d on GPU %d", fold_idx + 1, gpu_id)
+                continue
@@
-                for size_spec in subset_sizes:
+                for size_spec in subset_sizes:
                     if isinstance(size_spec, float):
-                        # Percentage-based selection
-                        budget = max(
-                            1, math.ceil(size_spec * similarity_matrix.shape[0])
-                        )
+                        percentage = size_spec / 100.0 if size_spec > 1 else size_spec
+                        budget = max(1, math.ceil(percentage * fold_size))
                     else:
                         # Absolute number-based selection
-                        budget = max(
-                            1,
-                            math.ceil(
-                                size_spec * (similarity_matrix.shape[0] / total_samples)
-                            ),
-                        )
+                        budget = max(1, math.ceil(size_spec * (fold_size / total_samples)))
+                    budget = min(budget, fold_size)
+                    if budget == 0:
+                        continue

608-615: Write JSON vs JSONL correctly.

Only use lines=True for .jsonl; for .json write a normal JSON array.

-        if extension in ["json", "jsonl"]:
-            subset_data.to_json(output_file, orient="records", lines=True)
+        if extension == "jsonl":
+            subset_data.to_json(output_file, orient="records", lines=True)
+        elif extension == "json":
+            subset_data.to_json(output_file, orient="records", lines=False)
         elif extension == "csv":
             subset_data.to_csv(output_file, index=False)
         elif extension == "parquet":
             subset_data.to_parquet(output_file)

🧹 Nitpick comments (6)

scripts/subset_selection/README.md (1)
99-116: Docs polish: fenced code languages + install note.

Add language to fenced blocks to satisfy markdownlint (CLI options, tree, traceback).

Prefer repo-root pinned install as per reviewer guidance; keep CUDA index hint.
-```
+```text
@@
-```
+```
@@
-```
+```text
@@
-```
+```python
@@
-The original codebase can be found at: [https://github.com/krishnatejakk/DataCurate4LLMs](https://github.com/krishnatejakk/DataCurate4LLMs)
+The original codebase can be found at: [krishnatejakk/DataCurate4LLMs](https://github.com/krishnatejakk/DataCurate4LLMs)
Optionally, update Installation to:
- pip install -r scripts/subset_selection/requirements.txt --extra-index-url https://download.pytorch.org/whl/cu121
+ pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu121
Would you like me to push the README tweaks as a follow-up commit?

Also applies to: 231-242, 290-299, 353-359
scripts/subset_selection/utils/subset_selection_utils.py (2)
35-53: Prefer logger.exception to retain stack traces on failures.

Switch logger.error(f"...{str(e)}") to logger.exception("...%s", e) to capture tracebacks; applies to all except blocks in this wrapper.
-                logger.error(f"GPU out of memory on attempt {attempt + 1}: {str(e)}")
+                logger.exception("GPU out of memory on attempt %d: %s", attempt + 1, e)
@@
-                logger.error(
-                    f"PyTorch runtime error on attempt {attempt + 1}: {str(e)}"
-                )
+                logger.exception("PyTorch runtime error on attempt %d: %s", attempt + 1, e)
@@
-                logger.error(f"Value error on attempt {attempt + 1}: {str(e)}")
+                logger.exception("Value error on attempt %d: %s", attempt + 1, e)
@@
-                logger.error(f"Type error on attempt {attempt + 1}: {str(e)}")
+                logger.exception("Type error on attempt %d: %s", attempt + 1, e)
@@
-                logger.error(f"Index error on attempt {attempt + 1}: {str(e)}")
+                logger.exception("Index error on attempt %d: %s", attempt + 1, e)
96-108: Memory caution: NxN result matrix can be large.

results is n1 x n2 on CPU; for big folds this can be GBs. Consider streaming selection or memory-mapped storage if folds exceed a threshold, or compute sparse top‑k similarities per row to reduce footprint.

Would you like a prototype that backs results with a memmap and yields row-chunks to the optimizer?
scripts/subset_selection/subset_selection.py (3)
365-372: Use context manager for multiprocessing.Pool to ensure cleanup.

Replace manual close/join/terminate with with Pool(...) as pool: for safer resource management and clearer intent.
-        pool = Pool(processes[num_gpus])
-        try:
-            shard_files = pool.map(_process_dataset_shard, args_list)
-        finally:
-            pool.close()
-            pool.join()
-            pool.terminate()
+        with Pool(processes=num_gpus) as pool:
+            shard_files = pool.map(_process_dataset_shard, args_list)
@@
-        pool = Pool(processes=self.config.system.num_gpus)
-        try:
-            gpu_results = pool.map(process_folds_with_gpu, gpu_assignments)
-        finally:
-            pool.close()
-            pool.join()
-            pool.terminate()
+        with Pool(processes=self.config.system.num_gpus) as pool:
+            gpu_results = pool.map(process_folds_with_gpu, gpu_assignments)
Also applies to: 429-436

654-675: Jinja2 autoescape off by default; enable for safety in templating.

Even for CLI utilities, enabling autoescape for common text types is safer.
-from jinja2 import BaseLoader, Environment
+from jinja2 import BaseLoader, Environment, select_autoescape
@@
-        self.env = Environment(loader=BaseLoader())
+        self.env = Environment(loader=BaseLoader(), autoescape=select_autoescape(["html", "xml"]))
@@
-        env = Environment(loader=BaseLoader())
+        env = Environment(loader=BaseLoader(), autoescape=select_autoescape(["html", "xml"]))
Also applies to: 206-213

531-534: Preserve stack traces in logs for easier debugging.

Use logger.exception("...%s", e) inside except blocks instead of logger.error(...).

I can push a quick pass converting these to logger.exception if you’d like.

Also applies to: 595-598, 984-986, 891-894, 914-919

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7a580fa and abb5c87.

📒 Files selected for processing (5)

scripts/subset_selection/README.md (1 hunks)
scripts/subset_selection/__main__.py (1 hunks)
scripts/subset_selection/encoders/arctic_encoder.py (1 hunks)
scripts/subset_selection/subset_selection.py (1 hunks)
scripts/subset_selection/utils/subset_selection_utils.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

scripts/subset_selection/__main__.py (1)

scripts/subset_selection/subset_selection.py (1)

main (1093-1174)

scripts/subset_selection/encoders/arctic_encoder.py (1)

scripts/subset_selection/subset_selection.py (1)

EncoderConfig (107-118)

scripts/subset_selection/subset_selection.py (2)

scripts/subset_selection/encoders/arctic_encoder.py (3)

ArcticEmbedEncoder (59-205)

EncoderConfig (48-56)

encode (171-205)

scripts/subset_selection/utils/subset_selection_utils.py (3)

compute_pairwise_dense (87-148)

get_default_num_gpus (67-84)

retry_on_exception (21-64)

🪛 LanguageTool

scripts/subset_selection/README.md

[style] ~129-~129: This adverb was used twice in the sentence. Consider removing one of them or replacing them with a synonym.
Context: ... = exactly 1000 samples - "5000" = exactly 5000 samples - Example: `--subset-si...

(ADVERB_REPETITION_PREMIUM)

🪛 markdownlint-cli2 (0.18.1)

scripts/subset_selection/README.md

99-99: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

231-231: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

290-290: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

355-355: Bare URL used

(MD034, no-bare-urls)

🪛 Pylint (4.0.1)

scripts/subset_selection/utils/subset_selection_utils.py

[refactor] 87-87: Too many positional arguments (7/5)

(R0917)

scripts/subset_selection/encoders/arctic_encoder.py

[refactor] 60-60: Too many positional arguments (6/5)

(R0917)

[refactor] 59-59: Too few public methods (1/2)

(R0903)

scripts/subset_selection/subset_selection.py

[refactor] 365-365: Consider using 'with' for resource-allocating operations

(R1732)

[refactor] 429-429: Consider using 'with' for resource-allocating operations

(R1732)

[refactor] 617-617: Too many statements (57/50)

(R0915)

[refactor] 798-798: Too many branches (16/12)

(R0912)

[refactor] 798-798: Too many statements (53/50)

(R0915)

🪛 Ruff (0.14.1)

scripts/subset_selection/__main__.py

1-1: Shebang is present but file is not executable

(EXE001)

scripts/subset_selection/utils/subset_selection_utils.py

35-35: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

35-35: Use explicit conversion flag

Replace with conversion flag

(RUF010)

39-41: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

40-40: Use explicit conversion flag

Replace with conversion flag

(RUF010)

45-45: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

45-45: Use explicit conversion flag

Replace with conversion flag

(RUF010)

49-49: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

49-49: Use explicit conversion flag

Replace with conversion flag

(RUF010)

53-53: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

53-53: Use explicit conversion flag

Replace with conversion flag

(RUF010)

81-83: Avoid specifying long messages outside the exception class

(TRY003)

129-129: Avoid specifying long messages outside the exception class

(TRY003)

scripts/subset_selection/encoders/arctic_encoder.py

70-72: Avoid specifying long messages outside the exception class

(TRY003)

118-121: Avoid specifying long messages outside the exception class

(TRY003)

149-152: Avoid specifying long messages outside the exception class

(TRY003)

162-165: Avoid specifying long messages outside the exception class

(TRY003)

scripts/subset_selection/subset_selection.py

54-57: Abstract raise to an inner function

(TRY301)

54-57: Avoid specifying long messages outside the exception class

(TRY003)

60-60: Avoid specifying long messages outside the exception class

(TRY003)

60-60: Use explicit conversion flag

Replace with conversion flag

(RUF010)

84-84: Avoid specifying long messages outside the exception class

(TRY003)

180-180: Prefer TypeError exception for invalid type

(TRY004)

180-180: Avoid specifying long messages outside the exception class

(TRY003)

184-184: Prefer TypeError exception for invalid type

(TRY004)

184-184: Avoid specifying long messages outside the exception class

(TRY003)

186-188: Avoid specifying long messages outside the exception class

(TRY003)

190-190: Avoid specifying long messages outside the exception class

(TRY003)

206-206: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.

(S701)

231-231: Avoid specifying long messages outside the exception class

(TRY003)

260-262: Avoid specifying long messages outside the exception class

(TRY003)

281-283: Avoid specifying long messages outside the exception class

(TRY003)

377-377: Avoid specifying long messages outside the exception class

(TRY003)

532-532: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

532-532: Use explicit conversion flag

Replace with conversion flag

(RUF010)

596-596: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

596-596: Use explicit conversion flag

Replace with conversion flag

(RUF010)

654-654: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.

(S701)

682-682: Abstract raise to an inner function

(TRY301)

682-682: Avoid specifying long messages outside the exception class

(TRY003)

739-739: Consider moving this statement to an else block

(TRY300)

744-744: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

744-744: Use explicit conversion flag

Replace with conversion flag

(RUF010)

822-822: Abstract raise to an inner function

(TRY301)

822-822: Avoid specifying long messages outside the exception class

(TRY003)

891-893: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

892-892: Use explicit conversion flag

Replace with conversion flag

(RUF010)

912-912: Consider moving this statement to an else block

(TRY300)

914-914: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

914-914: Use explicit conversion flag

Replace with conversion flag

(RUF010)

985-985: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

985-985: Use explicit conversion flag

Replace with conversion flag

(RUF010)

1116-1118: Avoid specifying long messages outside the exception class

(TRY003)

1128-1130: Avoid specifying long messages outside the exception class

(TRY003)

1136-1136: Avoid specifying long messages outside the exception class

(TRY003)

1171-1171: Consider moving this statement to an else block

(TRY300)

1172-1172: Do not catch blind exception: Exception

(BLE001)

scripts/subset_selection/README.md

scripts/subset_selection/subset_selection.py

scripts/subset_selection/requirements.txt

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (9)

scripts/subset_selection/requirements.txt (1)

70-72: Duplicate: torch 2.5.2 remains vulnerable to critical CVEs

This issue was already flagged in previous reviews. OSV continues to report 4 CRITICAL vulnerabilities for torch 2.5.2 (PYSEC-2025-41, GHSA-3749-ghw9-m3mg, GHSA-53q9-r3pm-6pq6, GHSA-887c-mr87-cxwp). Please verify if a patched version is available or document mitigation strategy.

scripts/subset_selection/encoders/arctic_encoder.py (3)

97-132: Duplicate: Security risk with unrestricted trust_remote_code

This issue was flagged in previous reviews. All three model loading paths (custom path, local cache, HuggingFace download) use trust_remote_code=True without pinning a revision or maintaining an allowlist, allowing arbitrary remote code execution.

183-201: Duplicate: Pre-tokenizing entire input list causes OOM on large batches

This issue was flagged in previous reviews. Line 183-189 tokenizes the entire inputs list and moves all tensors to GPU via .to(self.cfg.device) before batching, defeating the purpose of batching and causing OOM on large datasets. Tokenization should happen inside the batch loop.

134-135: Duplicate: FP16 applied without checking device type

This issue was flagged in previous reviews. Calling .half() on CPU models can degrade performance or break operations. FP16 should only be applied to CUDA devices.

scripts/subset_selection/utils/subset_selection_utils.py (1)

116-122: Duplicate: RBF scaling can divide by zero causing NaNs

This issue was flagged in previous reviews. When embeddings are identical or nearly so, avg_dist becomes zero, making the division at line 120 perform 0/0 and yielding NaNs that propagate through the similarity matrix.

scripts/subset_selection/subset_selection.py (3)

277-286: Duplicate: Float percentage values > 1 are rejected despite config allowing them

This issue was flagged in previous reviews. Lines 279-282 reject float values > 1, but ProcessingConfig.__post_init__ (line 184) explicitly allows floats up to 100 to represent percentages. A config with subset_sizes=[10.0] passes validation but fails here at runtime.

607-608: Duplicate: JSON files written as JSONL

This issue was flagged in previous reviews. Line 608 uses lines=True for both .json and .jsonl extensions, producing JSONL format for .json files. Only .jsonl should use lines=True.

824-882: Duplicate: Empty folds and uncapped budgets cause crashes

This issue was flagged in previous reviews. When len(embeddings) < num_folds, empty folds are created. When subset size exceeds dataset size, per-fold budgets can exceed fold cardinality. Both cases cause FacilityLocationFunction.maximize to fail. Empty folds should be skipped and budgets capped to fold size.
scripts/subset_selection/__main__.py (1)
1-1: Shebang still present despite being marked as addressed

The shebang on line 1 is unnecessary for module execution via python -m subset_selection and triggers Ruff EXE001. Although this was marked as addressed in a previous review, the shebang remains in the code.

Apply this diff to remove it:
-#!/usr/bin/env python3
 """
 Entry point for subset selection when run as a module.
 """
Based on static analysis hints.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between abb5c87 and 158f73b.

📒 Files selected for processing (7)

scripts/subset_selection/.gitignore (1 hunks)
scripts/subset_selection/README.md (1 hunks)
scripts/subset_selection/__main__.py (1 hunks)
scripts/subset_selection/encoders/arctic_encoder.py (1 hunks)
scripts/subset_selection/requirements.txt (1 hunks)
scripts/subset_selection/subset_selection.py (1 hunks)
scripts/subset_selection/utils/subset_selection_utils.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

scripts/subset_selection/.gitignore

🧰 Additional context used

🧬 Code graph analysis (3)

scripts/subset_selection/subset_selection.py (2)

scripts/subset_selection/encoders/arctic_encoder.py (3)

ArcticEmbedEncoder (58-206)

EncoderConfig (48-55)

encode (172-206)

scripts/subset_selection/utils/subset_selection_utils.py (3)

compute_pairwise_dense (81-142)

get_default_num_gpus (67-78)

retry_on_exception (21-64)

scripts/subset_selection/encoders/arctic_encoder.py (1)

scripts/subset_selection/subset_selection.py (1)

EncoderConfig (107-118)

scripts/subset_selection/__main__.py (1)

scripts/subset_selection/subset_selection.py (1)

main (1090-1173)

🪛 LanguageTool

scripts/subset_selection/README.md

[style] ~130-~130: This adverb was used twice in the sentence. Consider removing one of them or replacing them with a synonym.
Context: ... = exactly 1000 samples - "5000" = exactly 5000 samples - Example: `--subset-si...

(ADVERB_REPETITION_PREMIUM)

🪛 markdownlint-cli2 (0.18.1)

scripts/subset_selection/README.md

100-100: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

233-233: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

300-300: Bare URL used

(MD034, no-bare-urls)

🪛 OSV Scanner (2.2.3)

scripts/subset_selection/requirements.txt

[CRITICAL] 1-1: torch 2.5.2: undefined

(PYSEC-2025-41)

[CRITICAL] 1-1: torch 2.5.2: PyTorch susceptible to local Denial of Service

(GHSA-3749-ghw9-m3mg)

[CRITICAL] 1-1: torch 2.5.2: PyTorch: torch.load with weights_only=True leads to remote code execution

(GHSA-53q9-r3pm-6pq6)

[CRITICAL] 1-1: torch 2.5.2: PyTorch Improper Resource Shutdown or Release vulnerability

(GHSA-887c-mr87-cxwp)

🪛 Pylint (4.0.1)

scripts/subset_selection/subset_selection.py

[refactor] 364-364: Consider using 'with' for resource-allocating operations

(R1732)

[refactor] 427-427: Consider using 'with' for resource-allocating operations

(R1732)

[refactor] 615-615: Too many statements (57/50)

(R0915)

[refactor] 796-796: Too many branches (15/12)

(R0912)

[refactor] 796-796: Too many statements (51/50)

(R0915)

scripts/subset_selection/encoders/arctic_encoder.py

[refactor] 59-59: Too many positional arguments (6/5)

(R0917)

[refactor] 58-58: Too few public methods (1/2)

(R0903)

scripts/subset_selection/utils/subset_selection_utils.py

[refactor] 81-81: Too many positional arguments (7/5)

(R0917)

🪛 Ruff (0.14.1)

scripts/subset_selection/subset_selection.py

54-57: Abstract raise to an inner function

(TRY301)

54-57: Avoid specifying long messages outside the exception class

(TRY003)

60-60: Avoid specifying long messages outside the exception class

(TRY003)

60-60: Use explicit conversion flag

Replace with conversion flag

(RUF010)

84-84: Avoid specifying long messages outside the exception class

(TRY003)

179-179: Prefer TypeError exception for invalid type

(TRY004)

179-179: Avoid specifying long messages outside the exception class

(TRY003)

183-183: Prefer TypeError exception for invalid type

(TRY004)

183-183: Avoid specifying long messages outside the exception class

(TRY003)

185-187: Avoid specifying long messages outside the exception class

(TRY003)

189-189: Avoid specifying long messages outside the exception class

(TRY003)

205-205: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.

(S701)

230-230: Avoid specifying long messages outside the exception class

(TRY003)

259-261: Avoid specifying long messages outside the exception class

(TRY003)

280-282: Avoid specifying long messages outside the exception class

(TRY003)

376-376: Avoid specifying long messages outside the exception class

(TRY003)

530-530: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

530-530: Use explicit conversion flag

Replace with conversion flag

(RUF010)

594-594: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

594-594: Use explicit conversion flag

Replace with conversion flag

(RUF010)

652-652: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.

(S701)

680-680: Abstract raise to an inner function

(TRY301)

680-680: Avoid specifying long messages outside the exception class

(TRY003)

737-737: Consider moving this statement to an else block

(TRY300)

742-742: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

742-742: Use explicit conversion flag

Replace with conversion flag

(RUF010)

886-888: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

887-887: Use explicit conversion flag

Replace with conversion flag

(RUF010)

907-907: Consider moving this statement to an else block

(TRY300)

909-909: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

909-909: Use explicit conversion flag

Replace with conversion flag

(RUF010)

979-979: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

979-979: Use explicit conversion flag

Replace with conversion flag

(RUF010)

1113-1115: Avoid specifying long messages outside the exception class

(TRY003)

1125-1127: Avoid specifying long messages outside the exception class

(TRY003)

1133-1133: Avoid specifying long messages outside the exception class

(TRY003)

1170-1170: Consider moving this statement to an else block

(TRY300)

1171-1171: Do not catch blind exception: Exception

(BLE001)

scripts/subset_selection/encoders/arctic_encoder.py

69-71: Avoid specifying long messages outside the exception class

(TRY003)

150-153: Avoid specifying long messages outside the exception class

(TRY003)

163-166: Avoid specifying long messages outside the exception class

(TRY003)

scripts/subset_selection/__main__.py

1-1: Shebang is present but file is not executable

(EXE001)

scripts/subset_selection/utils/subset_selection_utils.py

35-35: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

35-35: Use explicit conversion flag

Replace with conversion flag

(RUF010)

39-41: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

40-40: Use explicit conversion flag

Replace with conversion flag

(RUF010)

45-45: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

45-45: Use explicit conversion flag

Replace with conversion flag

(RUF010)

49-49: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

49-49: Use explicit conversion flag

Replace with conversion flag

(RUF010)

53-53: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

53-53: Use explicit conversion flag

Replace with conversion flag

(RUF010)

123-123: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (1)

scripts/subset_selection/README.md (1)

294-303: Credits section properly addresses attribution requirement

The addition of the Credits and Acknowledgements section properly attributes the original author (Krishnateja Killamsetty) and links to the DataCurate4LLMs repository, addressing the PR objective to add attribution.

coderabbitai · 2025-10-24T02:08:55Z

scripts/subset_selection/subset_selection.py

+from tqdm import tqdm
+
+# Local
+from encoders.arctic_encoder import ArcticEmbedEncoder


⚠️ Potential issue | 🟡 Minor

Import uses relative path instead of absolute module path

The import on line 22 uses from encoders.arctic_encoder import ArcticEmbedEncoder with a relative path. Based on feedback from alimaredia, this should use an absolute import path to properly reference the module within the package structure.

Consider updating to use the full module path or ensure the package structure supports this relative import pattern consistently.

🤖 Prompt for AI Agents

In scripts/subset_selection/subset_selection.py at line 22, the import uses a relative/non-absolute form "from encoders.arctic_encoder import ArcticEmbedEncoder"; change it to the package's absolute module path (for example "from my_package.encoders.arctic_encoder import ArcticEmbedEncoder" or whatever the repository's root package is) so imports resolve consistently; if the project uses a top-level package name adjust the prefix accordingly and update other imports or __init__.py files to maintain consistent absolute imports across the package.

Signed-off-by: roburishabh <[email protected]>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (9)

scripts/subset_selection/README.md (1)

211-214: Replace developer-specific absolute path with a portable example.

Use a relative path to avoid leaking local paths and to make instructions reusable.
 # Navigate to project root
-cd /Users/roburishabh/Github/odh-data-processing
+cd odh-data-processing
+# or: cd <project-root>

scripts/subset_selection/cli.py (1)

1-1: Make CLI script executable (shebang present).

Set the executable bit so it can be run directly.
chmod +x scripts/subset_selection/cli.py

scripts/subset_selection/utils/subset_selection_utils.py (1)

121-127: Prevent NaNs in RBF scaling when avg_dist → 0.

avg_dist can be zero for identical embeddings; divide-by-zero yields NaNs that break downstream selection. Clamp the scale.

         if metric == "rbf":
             distance = torch.cdist(a, b)
             squared_distance = distance**2
-            avg_dist = torch.mean(squared_distance)
-            torch.div(squared_distance, kw * avg_dist, out=squared_distance)
+            avg_dist = torch.mean(squared_distance)
+            eps = torch.finfo(squared_distance.dtype).eps
+            scale = kw * torch.clamp(avg_dist, min=eps)
+            torch.div(squared_distance, scale, out=squared_distance)
             torch.exp(-squared_distance, out=squared_distance)
             return squared_distance

if previously raised by bot.

scripts/subset_selection/subset_selection.py (3)

253-263: Normalize percentage floats and align with config (0–100).

Config accepts floats up to 100, but this method rejects >1. Normalize >1 by dividing by 100, then validate and compute.

-        if isinstance(size_spec, float):
-            # if not in range of 0 to 1, raise error
-            if size_spec <= 0 or size_spec > 1:
-                raise ValueError(
-                    "Percentage values must be between 0(non-inclusive) and 1(inclusive)"
-                )
-            # If between 0 and 1, treat as decimal percentage (0.5 = 50%)
-            return max(1, int((size_spec) * total_samples))
+        if isinstance(size_spec, float):
+            # Accept 0–1 (fraction) or 0–100 (percent)
+            percentage = size_spec / 100.0 if size_spec > 1 else size_spec
+            if percentage <= 0 or percentage > 1:
+                raise ValueError("Percentage values must be in (0, 100]")
+            return max(1, int(percentage * total_samples))

if previously raised by bot.

738-847: Skip empty folds and clamp per-fold budgets to fold size.

Prevents optimizer crashes when fold is empty or budget > fold size. Also normalize float percentages consistently.

-        for fold_idx, fold_indices in gpu_folds_info:
+        for fold_idx, fold_indices in gpu_folds_info:
+            fold_size = len(fold_indices)
+            if fold_size == 0:
+                logger.info(f"Skipping empty fold {fold_idx + 1} on GPU {gpu_id}")
+                continue
             try:
@@
-                for size_spec in subset_sizes:
+                for size_spec in subset_sizes:
                     if isinstance(size_spec, float):
-                        # Percentage-based selection
-                        budget = max(
-                            1, math.ceil(size_spec * similarity_matrix.shape[0])
-                        )
+                        # Percentage-based selection (accept 0–1 or 0–100)
+                        percentage = (size_spec / 100.0) if size_spec > 1 else size_spec
+                        budget = max(1, math.ceil(percentage * fold_size))
                     else:
                         # Absolute number-based selection
-                        budget = max(
-                            1,
-                            math.ceil(
-                                size_spec * (similarity_matrix.shape[0] / total_samples)
-                            ),
-                        )
+                        budget = max(1, math.ceil(size_spec * (fold_size / total_samples)))
+                    budget = min(budget, fold_size)
+                    if budget == 0:
+                        continue

if previously raised by bot.

570-577: Write JSON vs JSONL correctly.

Only use lines=True for .jsonl; write standard JSON for .json.

-        extension = input_file.split(".")[-1]
-        if extension in ["json", "jsonl"]:
-            subset_data.to_json(output_file, orient="records", lines=True)
+        extension = input_file.split(".")[-1]
+        if extension == "jsonl":
+            subset_data.to_json(output_file, orient="records", lines=True)
+        elif extension == "json":
+            subset_data.to_json(output_file, orient="records", lines=False)
         elif extension == "csv":
             subset_data.to_csv(output_file, index=False)
         elif extension == "parquet":
             subset_data.to_parquet(output_file)

if previously raised by bot.

scripts/subset_selection/encoders/arctic_encoder.py (3)

130-134: Guard fp16 on CPU devices.

Calling half() on CPU can degrade/break operations. Apply fp16 only on CUDA.

-        if self.cfg.use_fp16:
-            self.model = self.model.half()
+        if self.cfg.use_fp16:
+            if self.cfg.device.type == "cuda":
+                self.model = self.model.to(dtype=torch.float16)
+            else:
+                logger.warning("use_fp16=True ignored on non-CUDA device")

181-199: Critical: avoid pre-tokenizing entire corpus on GPU — batch-tokenize inside the loop.

Current code tokenizes all inputs and moves them to the device, which will OOM on large datasets. Tokenize per batch and only move that batch.

-        encodings = self.tokenizer(
-            inputs,
-            max_length=self.cfg.model_config["max_length"],
-            padding=True,
-            truncation=True,
-            return_tensors="pt",
-        ).to(self.cfg.device)
-
         embeddings_list = []
         for i in tqdm(
             range(0, len(inputs), self.cfg.batch_size),
             disable=not show_progress or len(inputs) < 256,
         ):
-            batch = {k: v[i : i + self.cfg.batch_size] for k, v in encodings.items()}
-            outputs = self.model(**batch)
+            batch_texts = inputs[i : i + self.cfg.batch_size]
+            enc = self.tokenizer(
+                batch_texts,
+                max_length=self.cfg.model_config["max_length"],
+                padding=True,
+                truncation=True,
+                return_tensors="pt",
+            )
+            batch = {k: v.to(self.cfg.device) for k, v in enc.items()}
+            outputs = self.model(**batch)
             # Take the first token embedding (CLS) and normalize it
             embeddings = F.normalize(outputs.last_hidden_state[:, 0], p=2, dim=1)
             embeddings_list.append(embeddings.cpu())

103-115: Security posture: gate trust_remote_code and pin revisions.

trust_remote_code=True executes arbitrary repo code. Pin a revision and limit trust to allowlisted repos (even in testing mode).

-            self.tokenizer = AutoTokenizer.from_pretrained(self.cfg.model_name)
-            self.model = AutoModel.from_pretrained(
-                self.cfg.model_name,
-                add_pooling_layer=False,
-                trust_remote_code=True,
-            )
+            revision = os.environ.get("ARCTIC_MODEL_REV")  # optional: pass via config
+            allowlisted = {"Snowflake/snowflake-arctic-embed-l-v2.0"}
+            trust = self.cfg.model_name in allowlisted
+            self.tokenizer = AutoTokenizer.from_pretrained(self.cfg.model_name, revision=revision)
+            self.model = AutoModel.from_pretrained(
+                self.cfg.model_name,
+                add_pooling_layer=False,
+                trust_remote_code=trust,
+                revision=revision,
+            )
@@
-            self.tokenizer = AutoTokenizer.from_pretrained(model_path)
+            revision = os.environ.get("ARCTIC_MODEL_REV")
+            self.tokenizer = AutoTokenizer.from_pretrained(model_path, revision=revision)
             self.model = AutoModel.from_pretrained(
                 model_path,
                 add_pooling_layer=False,
-                trust_remote_code=True,
+                trust_remote_code=False,
                 local_files_only=True,
+                revision=revision,
             )

Consider surfacing revision in config instead of env var for reproducibility. Based on learnings.

🧹 Nitpick comments (8)

scripts/subset_selection/README.md (3)

64-81: Add language to fenced block to satisfy MD040 and improve readability.

Specify a language for the CLI options block (e.g., "text") to fix markdownlint MD040.

-```
+```text
 Required:
   --input <file> [<file> ...]    Input file(s) to process (JSONL, JSON, CSV, Parquet)
   --subset-sizes <sizes>         Comma-separated sizes (e.g., "0.1,0.5" or "1000,5000")

 Optional:
   --output-dir <dir>             Output directory (default: output)
   --batch-size <int>             Batch size for processing (default: 100000)
   --num-folds <int>              Number of folds/partitions (default: 50)
   --epsilon <float>              Optimization parameter (default: 160.0)
   --num-gpus <int>               Number of GPUs to use (default: auto-detect)
   --combine-files                Combine multiple input files before processing
   --testing-mode                 Enable CPU mode for testing
   --encoder-type <str>           Encoder type (default: arctic)
   --encoder-model <str>          Model name (default: Snowflake/snowflake-arctic-embed-l-v2.0)
   --template-name <str>          Template name (default: conversation)
   --seed <int>                   Random seed (default: 42)


---

`168-183`: **Add language to directory tree block to satisfy MD040.**

Mark the directory tree block as "text".


```diff
-```
+```text
 scripts/
 ├── __init__.py              # Top-level package initialization
 └── subset_selection/
     ├── __init__.py          # Subset selection package initialization
     ├── subset_selection.py  # Main subset selection logic
     ├── cli.py              # Command-line interface
     ├── requirements.txt    # Package dependencies
     ├── README.md          # This file
     ├── encoders/
     │   ├── __init__.py     # Encoder registry
     │   └── arctic_encoder.py  # Arctic embedding encoder
     └── utils/
         ├── __init__.py     # Utils initialization
         └── subset_selection_utils.py  # Utility functions

12-18: Installation note: pin Python and show single-command install (optional).

Docs currently install from scripts/subset_selection/requirements.txt. Consider adding:

Explicit Python version (3.12) as requested by reviewer.
Single command with CUDA wheel index for PyTorch (if CUDA 12.1 is target).

Proposed addition after the current snippet:

# Requires Python 3.12
python3.12 -m pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu121

Please confirm which requirements file is the canonical source (repo root vs scripts/subset_selection) and the intended CUDA toolchain for this project. Based on learnings.

scripts/subset_selection/encoders/__init__.py (1)

12-23: Avoid blanket try/except; raise precise errors.

The extra try/except re-wraps all errors as ValueError, obscuring root causes. Let the explicit unsupported-type check raise, and return directly otherwise.

-def get_encoder_class(encoder_type: str):
-    """Get the encoder class based on the encoder type."""
-    try:
-        if encoder_type not in ENCODER_REGISTRY:
-            supported_encoders = list(ENCODER_REGISTRY.keys())
-            raise ValueError(
-                f"Unsupported encoder type: '{encoder_type}'. "
-                f"Supported types are: {supported_encoders}"
-            )
-        return ENCODER_REGISTRY[encoder_type]
-    except Exception as e:
-        raise ValueError(f"Error getting encoder class: {str(e)}") from e
+def get_encoder_class(encoder_type: str):
+    """Get the encoder class based on the encoder type."""
+    if encoder_type not in ENCODER_REGISTRY:
+        supported = list(ENCODER_REGISTRY.keys())
+        raise ValueError(f"Unsupported encoder type: '{encoder_type}'. Supported types: {supported}")
+    return ENCODER_REGISTRY[encoder_type]

scripts/subset_selection/cli.py (1)

106-114: Clarify subset-size parsing; support '%' to avoid ambiguity.

Current logic treats "1" as absolute size 1, not 100%. Add "%" support and keep decimals for percentages to prevent surprises.

-    subset_sizes = []
-    for size_str in args.subset_sizes.split(","):
-        size_str = size_str.strip()
-        if "." in size_str:
-            subset_sizes.append(float(size_str))
-        else:
-            subset_sizes.append(int(size_str))
+    subset_sizes = []
+    for raw in args.subset_sizes.split(","):
+        s = raw.strip()
+        if s.endswith("%"):
+            subset_sizes.append(float(s[:-1]) / 100.0)
+            continue
+        # Decimals (e.g., 0.1) are treated as percentages; integers are absolute counts
+        subset_sizes.append(float(s) if "." in s else int(s))

Consider documenting the "%" option in README. Based on learnings.

scripts/subset_selection/utils/subset_selection_utils.py (1)

14-18: Do not call logging.basicConfig in library code.

basicConfig at import time can override app logging. Let the application configure logging; keep a module logger only.
-# Configure logging
-logging.basicConfig(
-    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
-)
 logger = logging.getLogger(__name__)

scripts/subset_selection/subset_selection.py (2)

31-36: Avoid setting logging.basicConfig in a library module.

Leave logging configuration to the application/CLI; keep module logger only.
-# Configure logging
-logging.basicConfig(
-    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
-)
 logger = logging.getLogger(__name__)
275-277: Name percentage subsets as whole percentages (e.g., 10.0%).

Prefer human-friendly names like "percent_10.0" instead of "percent_0.1".
-        if isinstance(size_spec, float):
-            return f"percent_{size_spec:.1f}"
+        if isinstance(size_spec, float):
+            pct = (size_spec * 100.0) if size_spec <= 1 else size_spec
+            return f"percent_{pct:.1f}"

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 158f73b and 408e383.

📒 Files selected for processing (10)

scripts/subset_selection/.gitignore (1 hunks)
scripts/subset_selection/README.md (1 hunks)
scripts/subset_selection/__init__.py (1 hunks)
scripts/subset_selection/cli.py (1 hunks)
scripts/subset_selection/encoders/__init__.py (1 hunks)
scripts/subset_selection/encoders/arctic_encoder.py (1 hunks)
scripts/subset_selection/requirements.txt (1 hunks)
scripts/subset_selection/subset_selection.py (1 hunks)
scripts/subset_selection/utils/__init__.py (1 hunks)
scripts/subset_selection/utils/subset_selection_utils.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

scripts/subset_selection/requirements.txt
scripts/subset_selection/.gitignore

🧰 Additional context used

🧬 Code graph analysis (6)

scripts/subset_selection/utils/__init__.py (1)

scripts/subset_selection/utils/subset_selection_utils.py (3)

compute_pairwise_dense (86-147)

get_default_num_gpus (66-83)

retry_on_exception (20-63)

scripts/subset_selection/__init__.py (2)

scripts/subset_selection/subset_selection.py (8)

BasicConfig (39-78)

DataProcessor (168-576)

EncoderConfig (82-93)

ProcessingConfig (127-165)

SystemConfig (112-123)

TemplateConfig (97-108)

get_supported_encoders (853-861)

subset_datasets (864-923)

scripts/subset_selection/encoders/arctic_encoder.py (1)

EncoderConfig (47-55)

scripts/subset_selection/encoders/__init__.py (1)

scripts/subset_selection/encoders/arctic_encoder.py (1)

ArcticEmbedEncoder (58-204)

scripts/subset_selection/encoders/arctic_encoder.py (1)

scripts/subset_selection/subset_selection.py (1)

EncoderConfig (82-93)

scripts/subset_selection/subset_selection.py (3)

scripts/subset_selection/encoders/__init__.py (1)

get_encoder_class (12-23)

scripts/subset_selection/utils/subset_selection_utils.py (3)

compute_pairwise_dense (86-147)

get_default_num_gpus (66-83)

retry_on_exception (20-63)

scripts/subset_selection/encoders/arctic_encoder.py (1)

encode (170-204)

scripts/subset_selection/cli.py (1)

scripts/subset_selection/subset_selection.py (1)

subset_datasets (864-923)

🪛 markdownlint-cli2 (0.18.1)

scripts/subset_selection/README.md

64-64: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

168-168: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Pylint (4.0.1)

scripts/subset_selection/utils/subset_selection_utils.py

[refactor] 86-86: Too many positional arguments (7/5)

(R0917)

scripts/subset_selection/encoders/arctic_encoder.py

[refactor] 59-59: Too many positional arguments (6/5)

(R0917)

[refactor] 58-58: Too few public methods (1/2)

(R0903)

scripts/subset_selection/subset_selection.py

[refactor] 738-738: Too many branches (14/12)

(R0912)

🔇 Additional comments (2)

scripts/subset_selection/utils/__init__.py (1)

1-16: LGTM — clean re-exports.

scripts/subset_selection/__init__.py (1)

1-29: LGTM — clear public surface and re-exports.

RobuRishabh requested review from alimaredia, alinaryan, eshwarprasadS, fabianofranz and shruthis4 October 14, 2025 00:25

RobuRishabh requested a review from a team as a code owner October 14, 2025 00:25

RobuRishabh changed the title ~~R1146 only scripts for subset selection~~ [RHAIENG]-1146 Initial Repository Setup and Baseline Testing (only scripts) Oct 14, 2025

coderabbitai bot reviewed Oct 14, 2025

View reviewed changes

scripts/subset_selection/cli.py Show resolved Hide resolved

scripts/subset_selection/README.md Show resolved Hide resolved

scripts/subset_selection/requirements.txt Outdated Show resolved Hide resolved

scripts/subset_selection/requirements.txt Outdated Show resolved Hide resolved

RobuRishabh mentioned this pull request Oct 15, 2025

[RHAIENG-1146] Subset Selection Notebooks #8

Closed

3 tasks

alimaredia reviewed Oct 22, 2025

View reviewed changes

scripts/subset_selection/README.md Show resolved Hide resolved

RobuRishabh force-pushed the R1146-Only-scripts-for-subset-selection branch 3 times, most recently from ae39d24 to e69bef8 Compare October 22, 2025 20:26

coderabbitai bot reviewed Oct 22, 2025

View reviewed changes

RobuRishabh force-pushed the R1146-Only-scripts-for-subset-selection branch from e69bef8 to 1d0eb3c Compare October 22, 2025 21:29

coderabbitai bot reviewed Oct 22, 2025

View reviewed changes

RobuRishabh force-pushed the R1146-Only-scripts-for-subset-selection branch from 7a580fa to abb5c87 Compare October 23, 2025 16:24

coderabbitai bot reviewed Oct 23, 2025

View reviewed changes

alimaredia reviewed Oct 23, 2025

View reviewed changes

scripts/subset_selection/README.md Outdated Show resolved Hide resolved

alimaredia reviewed Oct 23, 2025

View reviewed changes

scripts/subset_selection/README.md Outdated Show resolved Hide resolved

alimaredia reviewed Oct 23, 2025

View reviewed changes

scripts/subset_selection/subset_selection.py Outdated Show resolved Hide resolved

alimaredia reviewed Oct 23, 2025

View reviewed changes

scripts/subset_selection/subset_selection.py Outdated Show resolved Hide resolved

alimaredia reviewed Oct 23, 2025

View reviewed changes

scripts/subset_selection/requirements.txt Outdated Show resolved Hide resolved

alimaredia reviewed Oct 23, 2025

View reviewed changes

scripts/subset_selection/requirements.txt Outdated Show resolved Hide resolved

RobuRishabh force-pushed the R1146-Only-scripts-for-subset-selection branch from abb5c87 to 158f73b Compare October 24, 2025 02:03

coderabbitai bot reviewed Oct 24, 2025

View reviewed changes

RobuRishabh added 2 commits October 27, 2025 16:57

subset selection scripts added

51e9d77

Signed-off-by: roburishabh <[email protected]>

added gitignore

408e383

Signed-off-by: roburishabh <[email protected]>

RobuRishabh force-pushed the R1146-Only-scripts-for-subset-selection branch from 158f73b to 408e383 Compare October 27, 2025 20:59

coderabbitai bot reviewed Oct 27, 2025

View reviewed changes

alimaredia approved these changes Oct 28, 2025

View reviewed changes

alimaredia merged commit 2db1ad5 into opendatahub-io:main Oct 28, 2025
2 checks passed

RobuRishabh deleted the R1146-Only-scripts-for-subset-selection branch November 3, 2025 14:30

[RHAIENG]-1146 Initial Repository Setup and Baseline Testing (only scripts) #30

[RHAIENG]-1146 Initial Repository Setup and Baseline Testing (only scripts) #30

Uh oh!

Conversation

RobuRishabh commented Oct 14, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Merge criteria:

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alimaredia commented Oct 22, 2025

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

alimaredia commented Oct 23, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

RobuRishabh commented Oct 14, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 14, 2025 •

edited

Loading