-
Notifications
You must be signed in to change notification settings - Fork 8
[RHAIENG]-1146 Initial Repository Setup and Baseline Testing (only scripts) #30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RHAIENG]-1146 Initial Repository Setup and Baseline Testing (only scripts) #30
Conversation
WalkthroughThis PR introduces a complete subset selection pipeline module at Changes
Sequence DiagramsequenceDiagram
participant CLI as CLI Interface
participant DP as DataProcessor
participant Enc as ArcticEmbedEncoder
participant GPU1 as GPU-1<br/>(Shard Process)
participant GPU2 as GPU-2<br/>(Shard Process)
participant Subset as Facility<br/>Location
participant Out as Output<br/>Persistence
CLI->>DP: parse config & call subset_datasets()
DP->>DP: load & combine datasets
DP->>DP: split dataset into shards per GPU
par GPU Embedding Generation
DP->>GPU1: _process_dataset_shard(shard_1)
GPU1->>Enc: encode(texts)
Enc->>Enc: tokenize, batch, inference
Enc-->>GPU1: embeddings
GPU1->>GPU1: save embeddings
GPU1-->>DP: shard complete
and
DP->>GPU2: _process_dataset_shard(shard_2)
GPU2->>Enc: encode(texts)
Enc-->>GPU2: embeddings
GPU2->>GPU2: save embeddings
GPU2-->>DP: shard complete
end
DP->>DP: merge shard embeddings
par Fold-based Subset Selection
DP->>Subset: process_folds_with_gpu(fold_1)
Subset->>Subset: facility location optimization
Subset-->>DP: subset indices
and
DP->>Subset: process_folds_with_gpu(fold_2)
Subset-->>DP: subset indices
end
DP->>Out: _save_subset(data, metadata)
Out-->>CLI: completion signal
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
scripts/subset_selection/.gitignore(1 hunks)scripts/subset_selection/README.md(1 hunks)scripts/subset_selection/__init__.py(1 hunks)scripts/subset_selection/cli.py(1 hunks)scripts/subset_selection/encoders/__init__.py(1 hunks)scripts/subset_selection/encoders/arctic_encoder.py(1 hunks)scripts/subset_selection/requirements.txt(1 hunks)scripts/subset_selection/subset_selection.py(1 hunks)scripts/subset_selection/utils/__init__.py(1 hunks)scripts/subset_selection/utils/subset_selection_utils.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
scripts/subset_selection/utils/__init__.py (1)
scripts/subset_selection/utils/subset_selection_utils.py (3)
compute_pairwise_dense(86-147)get_default_num_gpus(66-83)retry_on_exception(20-63)
scripts/subset_selection/cli.py (1)
scripts/subset_selection/subset_selection.py (1)
subset_datasets(864-923)
scripts/subset_selection/__init__.py (2)
scripts/subset_selection/subset_selection.py (8)
BasicConfig(39-78)DataProcessor(168-576)EncoderConfig(82-93)ProcessingConfig(127-165)SystemConfig(112-123)TemplateConfig(97-108)get_supported_encoders(853-861)subset_datasets(864-923)scripts/subset_selection/encoders/arctic_encoder.py (1)
EncoderConfig(47-55)
scripts/subset_selection/encoders/__init__.py (1)
scripts/subset_selection/encoders/arctic_encoder.py (1)
ArcticEmbedEncoder(58-204)
scripts/subset_selection/encoders/arctic_encoder.py (1)
scripts/subset_selection/subset_selection.py (1)
EncoderConfig(82-93)
scripts/subset_selection/subset_selection.py (3)
scripts/subset_selection/encoders/__init__.py (1)
get_encoder_class(12-23)scripts/subset_selection/utils/subset_selection_utils.py (3)
compute_pairwise_dense(86-147)get_default_num_gpus(66-83)retry_on_exception(20-63)scripts/subset_selection/encoders/arctic_encoder.py (2)
EncoderConfig(47-55)encode(170-204)
🪛 markdownlint-cli2 (0.18.1)
scripts/subset_selection/README.md
64-64: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
168-168: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Pylint (4.0.0)
scripts/subset_selection/utils/subset_selection_utils.py
[refactor] 86-86: Too many positional arguments (7/5)
(R0917)
scripts/subset_selection/encoders/arctic_encoder.py
[refactor] 59-59: Too many positional arguments (6/5)
(R0917)
[refactor] 58-58: Too few public methods (1/2)
(R0903)
scripts/subset_selection/subset_selection.py
[refactor] 738-738: Too many branches (14/12)
(R0912)
🪛 Ruff (0.14.0)
scripts/subset_selection/cli.py
1-1: Shebang is present but file is not executable
(EXE001)
116-116: f-string without any placeholders
Remove extraneous f prefix
(F541)
147-147: Consider moving this statement to an else block
(TRY300)
148-148: Do not catch blind exception: Exception
(BLE001)
scripts/subset_selection/utils/subset_selection_utils.py
34-34: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
34-34: Use explicit conversion flag
Replace with conversion flag
(RUF010)
38-40: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
39-39: Use explicit conversion flag
Replace with conversion flag
(RUF010)
44-44: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
44-44: Use explicit conversion flag
Replace with conversion flag
(RUF010)
48-48: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
48-48: Use explicit conversion flag
Replace with conversion flag
(RUF010)
52-52: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
52-52: Use explicit conversion flag
Replace with conversion flag
(RUF010)
80-82: Avoid specifying long messages outside the exception class
(TRY003)
128-128: Avoid specifying long messages outside the exception class
(TRY003)
scripts/subset_selection/encoders/__init__.py
17-20: Abstract raise to an inner function
(TRY301)
17-20: Avoid specifying long messages outside the exception class
(TRY003)
23-23: Avoid specifying long messages outside the exception class
(TRY003)
23-23: Use explicit conversion flag
Replace with conversion flag
(RUF010)
scripts/subset_selection/encoders/arctic_encoder.py
69-71: Avoid specifying long messages outside the exception class
(TRY003)
117-120: Avoid specifying long messages outside the exception class
(TRY003)
148-151: Avoid specifying long messages outside the exception class
(TRY003)
161-164: Avoid specifying long messages outside the exception class
(TRY003)
scripts/subset_selection/subset_selection.py
59-59: Avoid specifying long messages outside the exception class
(TRY003)
155-155: Prefer TypeError exception for invalid type
(TRY004)
155-155: Avoid specifying long messages outside the exception class
(TRY003)
159-159: Prefer TypeError exception for invalid type
(TRY004)
159-159: Avoid specifying long messages outside the exception class
(TRY003)
161-163: Avoid specifying long messages outside the exception class
(TRY003)
165-165: Avoid specifying long messages outside the exception class
(TRY003)
181-181: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.
(S701)
206-206: Avoid specifying long messages outside the exception class
(TRY003)
235-237: Avoid specifying long messages outside the exception class
(TRY003)
256-258: Avoid specifying long messages outside the exception class
(TRY003)
345-345: Avoid specifying long messages outside the exception class
(TRY003)
412-412: Loop control variable fold_idx not used within loop body
Rename unused fold_idx to _fold_idx
(B007)
424-427: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
494-494: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
494-494: Use explicit conversion flag
Replace with conversion flag
(RUF010)
558-558: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
558-558: Use explicit conversion flag
Replace with conversion flag
(RUF010)
614-614: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.
(S701)
640-640: Abstract raise to an inner function
(TRY301)
640-640: Avoid specifying long messages outside the exception class
(TRY003)
683-683: Consider moving this statement to an else block
(TRY300)
688-688: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
688-688: Use explicit conversion flag
Replace with conversion flag
(RUF010)
762-762: Abstract raise to an inner function
(TRY301)
762-762: Avoid specifying long messages outside the exception class
(TRY003)
831-833: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
832-832: Use explicit conversion flag
Replace with conversion flag
(RUF010)
847-847: Consider moving this statement to an else block
(TRY300)
849-849: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
849-849: Use explicit conversion flag
Replace with conversion flag
(RUF010)
916-916: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
916-916: Use explicit conversion flag
Replace with conversion flag
(RUF010)
| if isinstance(size_spec, float): | ||
| # if not in range of 0 to 1, raise error | ||
| if size_spec <= 0 or size_spec > 1: | ||
| raise ValueError( | ||
| "Percentage values must be between 0(non-inclusive) and 1(inclusive)" | ||
| ) | ||
| # If between 0 and 1, treat as decimal percentage (0.5 = 50%) | ||
| return max(1, int((size_spec) * total_samples)) | ||
| # Treat as absolute number | ||
| return min(size_spec, total_samples) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix subset percentage normalization
ProcessingConfig.__post_init__ explicitly allows float subset_sizes up to 100 to represent percentages, but calculate_subset_size then rejects anything greater than 1. A config such as subset_sizes=[10.0] sails through validation yet explodes here, so the CLI fails on a perfectly valid 10% request. Normalize values >1 to fractions (and keep the validation aligned) so we honor the documented percentage semantics and avoid the runtime error.
if isinstance(size_spec, float):
- # if not in range of 0 to 1, raise error
- if size_spec <= 0 or size_spec > 1:
- raise ValueError(
- "Percentage values must be between 0(non-inclusive) and 1(inclusive)"
- )
- # If between 0 and 1, treat as decimal percentage (0.5 = 50%)
- return max(1, int((size_spec) * total_samples))
+ percentage = size_spec / 100 if size_spec > 1 else size_spec
+ if percentage <= 0 or percentage > 1:
+ raise ValueError(
+ "Percentage values must be between 0 (exclusive) and 100 (inclusive)"
+ )
+ return max(1, int(percentage * total_samples))Make the same normalization when computing per-fold budgets so both code paths agree.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if isinstance(size_spec, float): | |
| # if not in range of 0 to 1, raise error | |
| if size_spec <= 0 or size_spec > 1: | |
| raise ValueError( | |
| "Percentage values must be between 0(non-inclusive) and 1(inclusive)" | |
| ) | |
| # If between 0 and 1, treat as decimal percentage (0.5 = 50%) | |
| return max(1, int((size_spec) * total_samples)) | |
| # Treat as absolute number | |
| return min(size_spec, total_samples) | |
| if isinstance(size_spec, float): | |
| percentage = size_spec / 100 if size_spec > 1 else size_spec | |
| if percentage <= 0 or percentage > 1: | |
| raise ValueError( | |
| "Percentage values must be between 0 (exclusive) and 100 (inclusive)" | |
| ) | |
| return max(1, int(percentage * total_samples)) | |
| # Treat as absolute number | |
| return min(size_spec, total_samples) |
🧰 Tools
🪛 Ruff (0.14.0)
256-258: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In scripts/subset_selection/subset_selection.py around lines 253 to 263, float
size_spec values greater than 1 are currently rejected even though config allows
percentages expressed as >1 (e.g., 10.0 meaning 10%); adjust
calculate_subset_size to first normalize float percentage values >1 by dividing
by 100 (so 10.0 -> 0.10), then validate the resulting fraction is >0 and <=1 and
compute the sample count; also apply the same normalization logic in the
per-fold budget computation path so both code paths treat percentage semantics
consistently.
| for fold_idx, fold_indices in gpu_folds_info: | ||
| try: | ||
| logger.info(f"Processing fold {fold_idx + 1} on GPU {gpu_id}") | ||
|
|
||
| fold_embeddings = embeddings[fold_indices].to(device) | ||
|
|
||
| logger.info(f"Computing similarity matrix for fold {fold_idx + 1}") | ||
| max_sim_mat = compute_pairwise_dense( | ||
| fold_embeddings, | ||
| batch_size=50000, | ||
| metric="cosine", | ||
| device=device, | ||
| scaling="additive", | ||
| ) | ||
| similarity_matrix = max_sim_mat.cpu().numpy() | ||
|
|
||
| subsets = {} | ||
| ds_func = FacilityLocationFunction( | ||
| n=similarity_matrix.shape[0], | ||
| sijs=similarity_matrix, | ||
| mode="dense", | ||
| separate_rep=False, | ||
| ) | ||
|
|
||
| for size_spec in subset_sizes: | ||
| if isinstance(size_spec, float): | ||
| # Percentage-based selection | ||
| budget = max( | ||
| 1, math.ceil(size_spec * similarity_matrix.shape[0]) | ||
| ) | ||
| else: | ||
| # Absolute number-based selection | ||
| budget = max( | ||
| 1, | ||
| math.ceil( | ||
| size_spec * (similarity_matrix.shape[0] / total_samples) | ||
| ), | ||
| ) | ||
|
|
||
| logger.info( | ||
| f"Selecting subset of size {budget} for fold {fold_idx + 1}" | ||
| ) | ||
|
|
||
| subset_result = ds_func.maximize( | ||
| budget=budget, | ||
| optimizer="LazierThanLazyGreedy", | ||
| epsilon=epsilon, | ||
| stopIfZeroGain=False, | ||
| stopIfNegativeGain=False, | ||
| verbose=False, | ||
| ) | ||
|
|
||
| subset_indices = [fold_indices[x[0]] for x in subset_result] | ||
| subset_gains = [x[1] for x in subset_result] | ||
| subsets[size_spec] = { | ||
| "indices": subset_indices, | ||
| "gains": subset_gains, | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skip empty folds and cap per-fold budgets
When len(embeddings) < num_folds we manufacture empty folds. Likewise, requesting an absolute subset larger than the dataset (e.g. subset_sizes=[500] on a 200-row corpus) makes the fold-level budget exceed the fold cardinality. In both cases FacilityLocationFunction.maximize is asked to pick at least one item from a fold that has zero (or fewer than budget) candidates, and we crash. Guard the empty folds and clamp budgets to the fold size so the optimizer only runs on feasible sets.
- for fold_idx, fold_indices in gpu_folds_info:
+ for fold_idx, fold_indices in gpu_folds_info:
+ fold_size = len(fold_indices)
+ if fold_size == 0:
+ logger.info(f"Skipping empty fold {fold_idx + 1} on GPU {gpu_id}")
+ continue
try:
@@
- for size_spec in subset_sizes:
+ for size_spec in subset_sizes:
if isinstance(size_spec, float):
- # Percentage-based selection
- budget = max(
- 1, math.ceil(size_spec * similarity_matrix.shape[0])
- )
+ percentage = size_spec / 100 if size_spec > 1 else size_spec
+ budget = max(1, math.ceil(percentage * fold_size))
else:
# Absolute number-based selection
- budget = max(
- 1,
- math.ceil(
- size_spec * (similarity_matrix.shape[0] / total_samples)
- ),
- )
+ budget = max(
+ 1,
+ math.ceil(size_spec * (fold_size / total_samples)),
+ )
+ budget = min(budget, fold_size)
+ if budget == 0:
+ continue📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for fold_idx, fold_indices in gpu_folds_info: | |
| try: | |
| logger.info(f"Processing fold {fold_idx + 1} on GPU {gpu_id}") | |
| fold_embeddings = embeddings[fold_indices].to(device) | |
| logger.info(f"Computing similarity matrix for fold {fold_idx + 1}") | |
| max_sim_mat = compute_pairwise_dense( | |
| fold_embeddings, | |
| batch_size=50000, | |
| metric="cosine", | |
| device=device, | |
| scaling="additive", | |
| ) | |
| similarity_matrix = max_sim_mat.cpu().numpy() | |
| subsets = {} | |
| ds_func = FacilityLocationFunction( | |
| n=similarity_matrix.shape[0], | |
| sijs=similarity_matrix, | |
| mode="dense", | |
| separate_rep=False, | |
| ) | |
| for size_spec in subset_sizes: | |
| if isinstance(size_spec, float): | |
| # Percentage-based selection | |
| budget = max( | |
| 1, math.ceil(size_spec * similarity_matrix.shape[0]) | |
| ) | |
| else: | |
| # Absolute number-based selection | |
| budget = max( | |
| 1, | |
| math.ceil( | |
| size_spec * (similarity_matrix.shape[0] / total_samples) | |
| ), | |
| ) | |
| logger.info( | |
| f"Selecting subset of size {budget} for fold {fold_idx + 1}" | |
| ) | |
| subset_result = ds_func.maximize( | |
| budget=budget, | |
| optimizer="LazierThanLazyGreedy", | |
| epsilon=epsilon, | |
| stopIfZeroGain=False, | |
| stopIfNegativeGain=False, | |
| verbose=False, | |
| ) | |
| subset_indices = [fold_indices[x[0]] for x in subset_result] | |
| subset_gains = [x[1] for x in subset_result] | |
| subsets[size_spec] = { | |
| "indices": subset_indices, | |
| "gains": subset_gains, | |
| } | |
| for fold_idx, fold_indices in gpu_folds_info: | |
| fold_size = len(fold_indices) | |
| if fold_size == 0: | |
| logger.info(f"Skipping empty fold {fold_idx + 1} on GPU {gpu_id}") | |
| continue | |
| try: | |
| logger.info(f"Processing fold {fold_idx + 1} on GPU {gpu_id}") | |
| fold_embeddings = embeddings[fold_indices].to(device) | |
| logger.info(f"Computing similarity matrix for fold {fold_idx + 1}") | |
| max_sim_mat = compute_pairwise_dense( | |
| fold_embeddings, | |
| batch_size=50000, | |
| metric="cosine", | |
| device=device, | |
| scaling="additive", | |
| ) | |
| similarity_matrix = max_sim_mat.cpu().numpy() | |
| subsets = {} | |
| ds_func = FacilityLocationFunction( | |
| n=similarity_matrix.shape[0], | |
| sijs=similarity_matrix, | |
| mode="dense", | |
| separate_rep=False, | |
| ) | |
| for size_spec in subset_sizes: | |
| if isinstance(size_spec, float): | |
| percentage = size_spec / 100 if size_spec > 1 else size_spec | |
| budget = max(1, math.ceil(percentage * fold_size)) | |
| else: | |
| # Absolute number-based selection | |
| budget = max( | |
| 1, | |
| math.ceil(size_spec * (fold_size / total_samples)), | |
| ) | |
| budget = min(budget, fold_size) | |
| if budget == 0: | |
| continue | |
| logger.info( | |
| f"Selecting subset of size {budget} for fold {fold_idx + 1}" | |
| ) | |
| subset_result = ds_func.maximize( | |
| budget=budget, | |
| optimizer="LazierThanLazyGreedy", | |
| epsilon=epsilon, | |
| stopIfZeroGain=False, | |
| stopIfNegativeGain=False, | |
| verbose=False, | |
| ) | |
| subset_indices = [fold_indices[x[0]] for x in subset_result] | |
| subset_gains = [x[1] for x in subset_result] | |
| subsets[size_spec] = { | |
| "indices": subset_indices, | |
| "gains": subset_gains, | |
| } |
| distance = torch.cdist(a, b) | ||
| squared_distance = distance**2 | ||
| avg_dist = torch.mean(squared_distance) | ||
| torch.div(squared_distance, kw * avg_dist, out=squared_distance) | ||
| torch.exp(-squared_distance, out=squared_distance) | ||
| return squared_distance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clamp the RBF scaling factor to avoid NaNs
When all embeddings in a batch are identical (or nearly so), squared_distance becomes 0 everywhere, making avg_dist zero. Dividing by kw * avg_dist (Line 126) therefore performs a 0/0, yielding NaNs that propagate through the similarity matrix and break downstream selection. Clamp the scale away from zero before dividing.
if metric == "rbf":
distance = torch.cdist(a, b)
squared_distance = distance**2
- avg_dist = torch.mean(squared_distance)
- torch.div(squared_distance, kw * avg_dist, out=squared_distance)
+ avg_dist = torch.mean(squared_distance)
+ eps = torch.finfo(squared_distance.dtype).eps
+ scale = kw * torch.clamp(avg_dist, min=eps)
+ torch.div(squared_distance, scale, out=squared_distance)
torch.exp(-squared_distance, out=squared_distance)
return squared_distance📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| distance = torch.cdist(a, b) | |
| squared_distance = distance**2 | |
| avg_dist = torch.mean(squared_distance) | |
| torch.div(squared_distance, kw * avg_dist, out=squared_distance) | |
| torch.exp(-squared_distance, out=squared_distance) | |
| return squared_distance | |
| if metric == "rbf": | |
| distance = torch.cdist(a, b) | |
| squared_distance = distance**2 | |
| - avg_dist = torch.mean(squared_distance) | |
| avg_dist = torch.mean(squared_distance) | |
| eps = torch.finfo(squared_distance.dtype).eps | |
| scale = kw * torch.clamp(avg_dist, min=eps) | |
| torch.div(squared_distance, scale, out=squared_distance) | |
| torch.exp(-squared_distance, out=squared_distance) | |
| return squared_distance |
🤖 Prompt for AI Agents
In scripts/subset_selection/utils/subset_selection_utils.py around lines 122 to
127, the RBF scaling divides by kw * avg_dist which can be zero for identical
embeddings causing 0/0 NaNs; clamp avg_dist away from zero before dividing (e.g.
compute eps = torch.finfo(squared_distance.dtype).eps * squared_distance.numel()
or a small constant like 1e-12 on the same device/dtype, then use
avg_dist_clamped = torch.clamp_min(avg_dist, eps) or max(avg_dist, eps) and
divide by kw * avg_dist_clamped) so the division cannot produce NaN and retains
device/dtype consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (4)
scripts/subset_selection/README.md (3)
103-103: Add language specifier to code block.The code block should specify a language for proper syntax highlighting and to comply with Markdown best practices.
Apply this diff:
-``` +```text Required: --input <file> [<file> ...] Input file(s) to process (JSONL, JSON, CSV, Parquet) --subset-sizes <sizes> Comma-separated sizes (e.g., "0.1,0.5" or "1000,5000")Based on static analysis hints.
207-207: Add language specifier to code block.The code block should specify a language for proper syntax highlighting.
Apply this diff:
-``` +```text scripts/ ├── __init__.py # Top-level package initialization └── subset_selection/Based on static analysis hints.
270-270: Add language specifier to code block.The code block should specify a language for proper syntax highlighting.
Apply this diff:
-``` +```text RuntimeError: Cannot re-initialize CUDA in forked subprocessBased on static analysis hints. </blockquote></details> <details> <summary>scripts/subset_selection/cli.py (1)</summary><blockquote> `124-124`: **Remove unnecessary f-string prefix.** The string has no placeholders, so the `f` prefix is unnecessary. Apply this diff: ```diff - print(f"Starting subset selection...") + print("Starting subset selection...")Based on static analysis hints.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
scripts/subset_selection/README.md(1 hunks)scripts/subset_selection/cli.py(1 hunks)scripts/subset_selection/requirements.txt(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
scripts/subset_selection/cli.py (1)
scripts/subset_selection/subset_selection.py (1)
subset_datasets(864-923)
🪛 markdownlint-cli2 (0.18.1)
scripts/subset_selection/README.md
103-103: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
207-207: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
270-270: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 OSV Scanner (2.2.3)
scripts/subset_selection/requirements.txt
[CRITICAL] 1-1: torch 2.5.1+cu121: undefined
(PYSEC-2025-41)
[CRITICAL] 1-1: torch 2.5.1+cu121: PyTorch susceptible to local Denial of Service
[CRITICAL] 1-1: torch 2.5.1+cu121: PyTorch: torch.load with weights_only=True leads to remote code execution
[CRITICAL] 1-1: torch 2.5.1+cu121: PyTorch Improper Resource Shutdown or Release vulnerability
🪛 Ruff (0.14.0)
scripts/subset_selection/cli.py
1-1: Shebang is present but file is not executable
(EXE001)
124-124: f-string without any placeholders
Remove extraneous f prefix
(F541)
155-155: Consider moving this statement to an else block
(TRY300)
156-156: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (10)
scripts/subset_selection/README.md (7)
1-58: LGTM! Clear installation and model setup documentation.The overview, installation instructions, and model setup options are well-documented and provide clear guidance for users.
59-100: LGTM! Comprehensive CLI usage examples.The CLI usage examples cover a wide range of scenarios and provide clear guidance for different use cases.
122-161: LGTM! Clear Python API documentation.The Python API examples are well-structured and demonstrate both basic and advanced usage patterns.
162-204: LGTM! Thorough configuration documentation.The configuration section provides detailed parameter descriptions with helpful recommendations based on dataset size.
224-243: LGTM! Clear documentation of encoders and outputs.The supported encoders and output files sections are concise and informative.
280-312: LGTM! Comprehensive troubleshooting guidance.The troubleshooting section provides clear solutions for common issues users may encounter.
313-328: LGTM! Helpful notes for users.The notes section provides important context about dataset size recommendations, GPU requirements, and performance considerations.
scripts/subset_selection/cli.py (3)
13-101: LGTM! Well-structured argument parsing.The argument parsing is comprehensive with clear help text and appropriate defaults for all parameters.
103-122: LGTM! Proper multiprocessing setup and argument parsing.The multiprocessing setup for CUDA compatibility is appropriate, and the subset size parsing correctly handles both percentage and absolute count formats.
147-162: LGTM! Appropriate exception handling for CLI entry point.The broad exception handling is acceptable for a CLI entry point, as it ensures user-friendly error messages are displayed regardless of the error type. The function properly returns exit codes for success (0) and failure (1).
|
@RobuRishabh at a high level I can tell this took a lot of work and looks great to me. I really like how you got rid of the config file. At a high level there's a few small things I would change about this PR:
|
ae39d24 to
e69bef8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (6)
scripts/subset_selection/requirements.txt (2)
38-49: Blocker: CUDA-only wheels and pinned +cu121 builds break CPU installs and CIThe nvidia-* CUDA wheels and +cu121 variants force a CUDA 12.1 Linux stack; pip fails on CPU-only machines and many CI runners. Make the default requirements CPU-friendly and move CUDA wheels to an optional GPU requirements file or extras. Also drop the explicit
tritonpin (bundled by torch) to avoid resolver conflicts.Suggested approach:
- requirements.txt (CPU-default): remove lines 38–49 and use generic torch/vision/audio wheels.
- requirements-gpu-cu121.txt: contain the CUDA builds and instruct installing with the PyTorch extra index.
Example edits (CPU default):
- nvidia-cublas-cu12==12.1.3.1 - nvidia-cuda-cupti-cu12==12.1.105 - nvidia-cuda-nvrtc-cu12==12.1.105 - nvidia-cuda-runtime-cu12==12.1.105 - nvidia-cudnn-cu12==9.1.0.70 - nvidia-cufft-cu12==11.0.2.54 - nvidia-curand-cu12==10.3.2.106 - nvidia-cusolver-cu12==11.4.5.107 - nvidia-cusparse-cu12==12.1.0.106 - nvidia-nccl-cu12==2.21.5 - nvidia-nvjitlink-cu12==12.9.86 - nvidia-nvtx-cu12==12.1.105 - torch==2.5.1+cu121 - torchaudio==2.5.1+cu121 - torchvision==0.20.1+cu121 - triton==3.1.0 + torch==2.5.1 + torchaudio==2.5.1 + torchvision==0.20.1Provide a GPU file with the CUDA-tagged trio and doc:
pip install -r requirements-gpu-cu121.txt --extra-index-url https://download.pytorch.org/whl/cu121.Please confirm the CPU path installs on a clean Python 3.12 environment and that GPU users can install via the GPU file.
Also applies to: 70-76
70-72: Security: pinned PyTorch build flagged by multiple critical advisories
torch==2.5.1+cu121is flagged (GHSA-3749-ghw9-m3mg, GHSA-53q9-r3pm-6pq6, GHSA-887c-mr87-cxwp). Upgrade to a patched PyTorch release and aligntorchvision/torchaudioaccordingly. If upgrade is blocked, document mitigations and avoidtorch.loadon untrusted inputs.What is the first PyTorch version that fixes GHSA-53q9-r3pm-6pq6 and GHSA-887c-mr87-cxwp? Provide official PyTorch release notes or advisories.scripts/subset_selection/README.md (1)
241-244: Replace developer-specific absolute path with a project-relative pathUse a repo-relative path to avoid confusion and leaking local FS details.
-# Navigate to project root -cd /Users/roburishabh/Github/odh-data-processing +# Navigate to project root +cd odh-data-processingscripts/subset_selection/utils/subset_selection_utils.py (1)
122-127: Clamp RBF scale to avoid NaNs when avg_dist ≈ 0Division by
kw * avg_distcan produce NaNs for identical embeddings. Clamp the denominator away from zero.- avg_dist = torch.mean(squared_distance) - torch.div(squared_distance, kw * avg_dist, out=squared_distance) + avg_dist = torch.mean(squared_distance) + eps = torch.finfo(squared_distance.dtype).eps + scale = kw * torch.clamp(avg_dist, min=eps) + torch.div(squared_distance, scale, out=squared_distance)scripts/subset_selection/subset_selection.py (2)
277-285: Fix percentage handling: accept 0–100 floats and normalizeConfig allows floats up to 100, but this path rejects values >1. Normalize percentages >1 to fractions and validate 0–100.
- if isinstance(size_spec, float): - # if not in range of 0 to 1, raise error - if size_spec <= 0 or size_spec > 1: - raise ValueError( - "Percentage values must be between 0(non-inclusive) and 1(inclusive)" - ) - # If between 0 and 1, treat as decimal percentage (0.5 = 50%) - return max(1, int((size_spec) * total_samples)) + if isinstance(size_spec, float): + percentage = size_spec / 100 if size_spec > 1 else size_spec + if percentage <= 0 or percentage > 1: + raise ValueError("Percentage values must be between 0 (exclusive) and 100 (inclusive)") + return max(1, int(percentage * total_samples))
816-875: Robustness: skip empty folds and cap per-fold budgetsEmpty folds and over-large budgets crash the optimizer. Skip zero-size folds and clamp
budget≤ fold size. Normalize float percentages >1 here too.- for fold_idx, fold_indices in gpu_folds_info: + for fold_idx, fold_indices in gpu_folds_info: + fold_size = len(fold_indices) + if fold_size == 0: + logger.info("Skipping empty fold %d on GPU %d", fold_idx + 1, gpu_id) + continue @@ - for size_spec in subset_sizes: + for size_spec in subset_sizes: if isinstance(size_spec, float): - # Percentage-based selection - budget = max( - 1, math.ceil(size_spec * similarity_matrix.shape[0]) - ) + percentage = size_spec / 100 if size_spec > 1 else size_spec + budget = max(1, math.ceil(percentage * fold_size)) else: - # Absolute number-based selection - budget = max( - 1, - math.ceil( - size_spec * (similarity_matrix.shape[0] / total_samples) - ), - ) + budget = max(1, math.ceil(size_spec * (fold_size / total_samples))) + budget = min(budget, fold_size) + if budget == 0: + continue
🧹 Nitpick comments (12)
scripts/subset_selection/README.md (3)
21-23: Document CPU vs GPU install flowsAdd a CPU-friendly install (default) and an optional GPU install with the PyTorch extra index, aligning with the split requirements.
-```bash -pip install -r scripts/subset_selection/requirements.txt --extra-index-url https://download.pytorch.org/whl/cu121 -``` +```bash +# CPU (default) +pip install -r scripts/subset_selection/requirements.txt + +# GPU (CUDA 12.1) +pip install -r scripts/subset_selection/requirements-gpu-cu121.txt \ + --extra-index-url https://download.pytorch.org/whl/cu121 +```
99-117: Add languages to fenced code blocks for linting and readabilitySeveral code fences lack a language identifier (MD040). Tag them with
bash,python, ortextas appropriate.Also applies to: 203-215, 262-271
329-331: Fix bare URL formattingUse Markdown link syntax to avoid MD034.
-The original codebase can be found at: [https://github.com/krishnatejakk/DataCurate4LLMs](https://github.com/krishnatejakk/DataCurate4LLMs) +The original codebase can be found at: [krishnatejakk/DataCurate4LLMs](https://github.com/krishnatejakk/DataCurate4LLMs)scripts/subset_selection/encoders/arctic_encoder.py (2)
104-110: UX: testing_mode log claims “model not found” without checkingYou log “Model not found locally” unconditionally in testing mode. Check
os.path.exists(model_path)before warning, and prefer local files even in testing mode to avoid redundant downloads.- if hasattr(self.cfg, "testing_mode") and self.cfg.testing_mode: - logger.warning( - f"Model not found locally at {model_path}. " - "Testing mode enabled - downloading from HuggingFace..." - ) + if hasattr(self.cfg, "testing_mode") and self.cfg.testing_mode: + if not os.path.exists(model_path): + logger.warning( + f"Model not found locally at {model_path}. " + "Testing mode enabled - downloading from HuggingFace..." + ) + else: + logger.info(f"Using local model path in testing mode: {model_path}")Also applies to: 116-121
19-23: Dead code
safe_printis unused. Remove or use it.scripts/subset_selection/utils/subset_selection_utils.py (2)
13-18: Library code shouldn’t call logging.basicConfigConfigure handlers in the application, not in reusable modules. Remove
basicConfigand rely on module loggers.-logging.basicConfig( - level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s" -) logger = logging.getLogger(__name__)
34-40: Preserve tracebacks with logger.exceptionReplace
logger.error(..., str(e))withlogger.exception("...")to keep stack traces.- logger.error(f"GPU out of memory on attempt {attempt + 1}: {str(e)}") + logger.exception("GPU out of memory on attempt %d", attempt + 1) ... - logger.error( - f"PyTorch runtime error on attempt {attempt + 1}: {str(e)}" - ) + logger.exception("PyTorch runtime error on attempt %d", attempt + 1) ... - logger.error(f"Value error on attempt {attempt + 1}: {str(e)}") + logger.exception("Value error on attempt %d", attempt + 1) ... - logger.error(f"Type error on attempt {attempt + 1}: {str(e)}") + logger.exception("Type error on attempt %d", attempt + 1) ... - logger.error(f"Index error on attempt {attempt + 1}: {str(e)}") + logger.exception("Index error on attempt %d", attempt + 1) ... - raise last_exception + raise last_exceptionAlso applies to: 44-52, 80-82, 128-129
scripts/subset_selection/subset_selection.py (4)
361-369: Use Pool context manager to ensure clean teardownPrefer
with Pool(...) as pool:over manual close/join/terminate.- pool = Pool(processes=num_gpus) - try: - shard_files = pool.map(_process_dataset_shard, args_list) - finally: - pool.close() - pool.join() - pool.terminate() + with Pool(processes=num_gpus) as pool: + shard_files = pool.map(_process_dataset_shard, args_list)
426-433: Same: use Pool context manager here tooMirror the pattern to avoid leaks on exceptions.
- pool = Pool(processes=self.config.system.num_gpus) - try: - gpu_results = pool.map(process_folds_with_gpu, gpu_assignments) - finally: - pool.close() - pool.join() - pool.terminate() + with Pool(processes=self.config.system.num_gpus) as pool: + gpu_results = pool.map(process_folds_with_gpu, gpu_assignments)
659-681: Memory pressure: avoid buffering all embeddings in RAM
all_embeddingsholds all batches before writing, which can blow memory on large shards. Stream directly to HDF5 (resizable dataset) or flush periodically.I can provide a streaming HDF5 writer snippet if you want to adopt it in this function.
Also applies to: 682-691, 703-717
1105-1111: Nit: remove unused f-string prefix
print(f"Starting subset selection...")has no placeholders.- print(f"Starting subset selection...") + print("Starting subset selection...")scripts/subset_selection/__main__.py (1)
6-6: Use a relative intra‑package import.More robust within the package and avoids relying on the top‑level package name.
-from scripts.subset_selection.subset_selection import main +from .subset_selection import main
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
scripts/subset_selection/.gitignore(1 hunks)scripts/subset_selection/README.md(1 hunks)scripts/subset_selection/__main__.py(1 hunks)scripts/subset_selection/encoders/arctic_encoder.py(1 hunks)scripts/subset_selection/requirements.txt(1 hunks)scripts/subset_selection/subset_selection.py(1 hunks)scripts/subset_selection/utils/subset_selection_utils.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- scripts/subset_selection/.gitignore
🧰 Additional context used
🧬 Code graph analysis (3)
scripts/subset_selection/__main__.py (1)
scripts/subset_selection/subset_selection.py (1)
main(1081-1139)
scripts/subset_selection/encoders/arctic_encoder.py (1)
scripts/subset_selection/subset_selection.py (1)
EncoderConfig(106-117)
scripts/subset_selection/subset_selection.py (2)
scripts/subset_selection/encoders/arctic_encoder.py (3)
ArcticEmbedEncoder(58-204)EncoderConfig(47-55)encode(170-204)scripts/subset_selection/utils/subset_selection_utils.py (3)
compute_pairwise_dense(86-147)get_default_num_gpus(66-83)retry_on_exception(20-63)
🪛 markdownlint-cli2 (0.18.1)
scripts/subset_selection/README.md
99-99: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
203-203: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
262-262: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
327-327: Bare URL used
(MD034, no-bare-urls)
🪛 OSV Scanner (2.2.3)
scripts/subset_selection/requirements.txt
[CRITICAL] 1-1: torch 2.5.1+cu121: undefined
(PYSEC-2025-41)
[CRITICAL] 1-1: torch 2.5.1+cu121: PyTorch susceptible to local Denial of Service
[CRITICAL] 1-1: torch 2.5.1+cu121: PyTorch: torch.load with weights_only=True leads to remote code execution
[CRITICAL] 1-1: torch 2.5.1+cu121: PyTorch Improper Resource Shutdown or Release vulnerability
🪛 Pylint (4.0.1)
scripts/subset_selection/utils/subset_selection_utils.py
[refactor] 86-86: Too many positional arguments (7/5)
(R0917)
scripts/subset_selection/encoders/arctic_encoder.py
[refactor] 59-59: Too many positional arguments (6/5)
(R0917)
[refactor] 58-58: Too few public methods (1/2)
(R0903)
scripts/subset_selection/subset_selection.py
[refactor] 362-362: Consider using 'with' for resource-allocating operations
(R1732)
[refactor] 426-426: Consider using 'with' for resource-allocating operations
(R1732)
[refactor] 613-613: Too many statements (57/50)
(R0915)
[refactor] 786-786: Too many branches (16/12)
(R0912)
[refactor] 786-786: Too many statements (53/50)
(R0915)
🪛 Ruff (0.14.1)
scripts/subset_selection/utils/subset_selection_utils.py
34-34: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
34-34: Use explicit conversion flag
Replace with conversion flag
(RUF010)
38-40: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
39-39: Use explicit conversion flag
Replace with conversion flag
(RUF010)
44-44: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
44-44: Use explicit conversion flag
Replace with conversion flag
(RUF010)
48-48: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
48-48: Use explicit conversion flag
Replace with conversion flag
(RUF010)
52-52: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
52-52: Use explicit conversion flag
Replace with conversion flag
(RUF010)
80-82: Avoid specifying long messages outside the exception class
(TRY003)
128-128: Avoid specifying long messages outside the exception class
(TRY003)
scripts/subset_selection/__main__.py
1-1: Shebang is present but file is not executable
(EXE001)
scripts/subset_selection/encoders/arctic_encoder.py
69-71: Avoid specifying long messages outside the exception class
(TRY003)
117-120: Avoid specifying long messages outside the exception class
(TRY003)
148-151: Avoid specifying long messages outside the exception class
(TRY003)
161-164: Avoid specifying long messages outside the exception class
(TRY003)
scripts/subset_selection/subset_selection.py
53-56: Abstract raise to an inner function
(TRY301)
53-56: Avoid specifying long messages outside the exception class
(TRY003)
59-59: Avoid specifying long messages outside the exception class
(TRY003)
59-59: Use explicit conversion flag
Replace with conversion flag
(RUF010)
83-83: Avoid specifying long messages outside the exception class
(TRY003)
179-179: Prefer TypeError exception for invalid type
(TRY004)
179-179: Avoid specifying long messages outside the exception class
(TRY003)
183-183: Prefer TypeError exception for invalid type
(TRY004)
183-183: Avoid specifying long messages outside the exception class
(TRY003)
185-187: Avoid specifying long messages outside the exception class
(TRY003)
189-189: Avoid specifying long messages outside the exception class
(TRY003)
205-205: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.
(S701)
230-230: Avoid specifying long messages outside the exception class
(TRY003)
259-261: Avoid specifying long messages outside the exception class
(TRY003)
280-282: Avoid specifying long messages outside the exception class
(TRY003)
374-374: Avoid specifying long messages outside the exception class
(TRY003)
458-461: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
528-528: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
528-528: Use explicit conversion flag
Replace with conversion flag
(RUF010)
592-592: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
592-592: Use explicit conversion flag
Replace with conversion flag
(RUF010)
648-648: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.
(S701)
674-674: Abstract raise to an inner function
(TRY301)
674-674: Avoid specifying long messages outside the exception class
(TRY003)
727-727: Consider moving this statement to an else block
(TRY300)
732-732: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
732-732: Use explicit conversion flag
Replace with conversion flag
(RUF010)
810-810: Abstract raise to an inner function
(TRY301)
810-810: Avoid specifying long messages outside the exception class
(TRY003)
879-881: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
880-880: Use explicit conversion flag
Replace with conversion flag
(RUF010)
900-900: Consider moving this statement to an else block
(TRY300)
902-902: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
902-902: Use explicit conversion flag
Replace with conversion flag
(RUF010)
973-973: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
973-973: Use explicit conversion flag
Replace with conversion flag
(RUF010)
1105-1105: f-string without any placeholders
Remove extraneous f prefix
(F541)
1136-1136: Consider moving this statement to an else block
(TRY300)
1137-1137: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (2)
scripts/subset_selection/requirements.txt (1)
35-37: Sanity-check numeric stack in isolation
- Use
python3(matching CI) instead ofpython3.12to avoid interpreter mismatches- Exclude GPU-specific PyTorch wheels (
torch==2.5.1+cu121) when verifying core numeric libs, e.g.:grep -Ev '^torch==' scripts/subset_selection/requirements.txt > tmp-reqs.txt- Run:
python3 -m venv /tmp/odh-reqs && source /tmp/odh-reqs/bin/activate pip install --upgrade pip pip install -r tmp-reqs.txt python - <<'PY' import numpy, scipy, sklearn, numba, llvmlite; print("OK") PYscripts/subset_selection/__main__.py (1)
8-10: Entry point looks good. Document invocation.LGTM. Please ensure README shows usage:
python -m scripts.subset_selection ...so users don’t try to run the file directly.
| # In testing mode, allow direct download from HuggingFace | ||
| if hasattr(self.cfg, "testing_mode") and self.cfg.testing_mode: | ||
| logger.warning( | ||
| f"Model not found locally at {model_path}. " | ||
| "Testing mode enabled - downloading from HuggingFace..." | ||
| ) | ||
| self.tokenizer = AutoTokenizer.from_pretrained(self.cfg.model_name) | ||
| self.model = AutoModel.from_pretrained( | ||
| self.cfg.model_name, | ||
| add_pooling_layer=False, | ||
| trust_remote_code=True, | ||
| ) | ||
| else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Security posture: unrestricted remote code execution during model load
trust_remote_code=True allows arbitrary code from model repos. For reproducibility and safety, pin a specific revision (commit hash) and maintain an allowlist of approved repos; set trust_remote_code only for those.
- self.tokenizer = AutoTokenizer.from_pretrained(self.cfg.model_name)
- self.model = AutoModel.from_pretrained(
- self.cfg.model_name,
+ revision = os.environ.get("ARCTIC_MODEL_REV", None)
+ self.tokenizer = AutoTokenizer.from_pretrained(self.cfg.model_name, revision=revision)
+ self.model = AutoModel.from_pretrained(
+ self.cfg.model_name,
add_pooling_layer=False,
- trust_remote_code=True,
+ trust_remote_code=True, # allowlisted repo
+ revision=revision,
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # In testing mode, allow direct download from HuggingFace | |
| if hasattr(self.cfg, "testing_mode") and self.cfg.testing_mode: | |
| logger.warning( | |
| f"Model not found locally at {model_path}. " | |
| "Testing mode enabled - downloading from HuggingFace..." | |
| ) | |
| self.tokenizer = AutoTokenizer.from_pretrained(self.cfg.model_name) | |
| self.model = AutoModel.from_pretrained( | |
| self.cfg.model_name, | |
| add_pooling_layer=False, | |
| trust_remote_code=True, | |
| ) | |
| else: | |
| # In testing mode, allow direct download from HuggingFace | |
| if hasattr(self.cfg, "testing_mode") and self.cfg.testing_mode: | |
| logger.warning( | |
| f"Model not found locally at {model_path}. " | |
| "Testing mode enabled - downloading from HuggingFace..." | |
| ) | |
| revision = os.environ.get("ARCTIC_MODEL_REV", None) | |
| self.tokenizer = AutoTokenizer.from_pretrained( | |
| self.cfg.model_name, | |
| revision=revision, | |
| ) | |
| self.model = AutoModel.from_pretrained( | |
| self.cfg.model_name, | |
| add_pooling_layer=False, | |
| trust_remote_code=True, # allowlisted repo | |
| revision=revision, | |
| ) | |
| else: |
🤖 Prompt for AI Agents
In scripts/subset_selection/encoders/arctic_encoder.py around lines 103 to 115,
the code loads a remote HF model with trust_remote_code=True which permits
arbitrary remote code execution; change the loader to require a pinned revision
(pass revision=<commit hash> from config) and only enable trust_remote_code when
the model repo is on an approved allowlist. Concretely: add a cfg field for
revision (fail if missing in testing_mode), add or read an allowlist of approved
model IDs, check that cfg.model_name is in that allowlist before setting
trust_remote_code=True (otherwise set it False), and pass
revision=self.cfg.revision to both AutoTokenizer.from_pretrained and
AutoModel.from_pretrained; log a warning and abort/raise if testing_mode
requests an unapproved repo or missing revision.
| encodings = self.tokenizer( | ||
| inputs, | ||
| max_length=self.cfg.model_config["max_length"], | ||
| padding=True, | ||
| truncation=True, | ||
| return_tensors="pt", | ||
| ).to(self.cfg.device) | ||
|
|
||
| embeddings_list = [] | ||
| for i in tqdm( | ||
| range(0, len(inputs), self.cfg.batch_size), | ||
| disable=not show_progress or len(inputs) < 256, | ||
| ): | ||
| batch = {k: v[i : i + self.cfg.batch_size] for k, v in encodings.items()} | ||
| outputs = self.model(**batch) | ||
| # Take the first token embedding (CLS) and normalize it | ||
| embeddings = F.normalize(outputs.last_hidden_state[:, 0], p=2, dim=1) | ||
| embeddings_list.append(embeddings.cpu()) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: pre-tokenizing all inputs then moving to GPU will OOM on large batches
self.tokenizer(...).to(self.cfg.device) tokenizes the entire input list and moves all tensors to GPU, defeating batching. Tokenize and move per batch inside the loop.
- encodings = self.tokenizer(
- inputs,
- max_length=self.cfg.model_config["max_length"],
- padding=True,
- truncation=True,
- return_tensors="pt",
- ).to(self.cfg.device)
-
- embeddings_list = []
- for i in tqdm(
+ embeddings_list = []
+ for i in tqdm(
range(0, len(inputs), self.cfg.batch_size),
disable=not show_progress or len(inputs) < 256,
):
- batch = {k: v[i : i + self.cfg.batch_size] for k, v in encodings.items()}
+ batch_texts = inputs[i : i + self.cfg.batch_size]
+ enc = self.tokenizer(
+ batch_texts,
+ max_length=self.cfg.model_config["max_length"],
+ padding=True,
+ truncation=True,
+ return_tensors="pt",
+ )
+ batch = {k: v.to(self.cfg.device) for k, v in enc.items()}
outputs = self.model(**batch)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| encodings = self.tokenizer( | |
| inputs, | |
| max_length=self.cfg.model_config["max_length"], | |
| padding=True, | |
| truncation=True, | |
| return_tensors="pt", | |
| ).to(self.cfg.device) | |
| embeddings_list = [] | |
| for i in tqdm( | |
| range(0, len(inputs), self.cfg.batch_size), | |
| disable=not show_progress or len(inputs) < 256, | |
| ): | |
| batch = {k: v[i : i + self.cfg.batch_size] for k, v in encodings.items()} | |
| outputs = self.model(**batch) | |
| # Take the first token embedding (CLS) and normalize it | |
| embeddings = F.normalize(outputs.last_hidden_state[:, 0], p=2, dim=1) | |
| embeddings_list.append(embeddings.cpu()) | |
| embeddings_list = [] | |
| for i in tqdm( | |
| range(0, len(inputs), self.cfg.batch_size), | |
| disable=not show_progress or len(inputs) < 256, | |
| ): | |
| batch_texts = inputs[i : i + self.cfg.batch_size] | |
| enc = self.tokenizer( | |
| batch_texts, | |
| max_length=self.cfg.model_config["max_length"], | |
| padding=True, | |
| truncation=True, | |
| return_tensors="pt", | |
| ) | |
| batch = {k: v.to(self.cfg.device) for k, v in enc.items()} | |
| outputs = self.model(**batch) | |
| # Take the first token embedding (CLS) and normalize it | |
| embeddings = F.normalize(outputs.last_hidden_state[:, 0], p=2, dim=1) | |
| embeddings_list.append(embeddings.cpu()) |
🤖 Prompt for AI Agents
In scripts/subset_selection/encoders/arctic_encoder.py around lines 181 to 199,
the code currently tokenizes the entire inputs list and calls
.to(self.cfg.device) on the resulting tensors which moves all tokenized data to
GPU and can OOM for large datasets; change to tokenizing inside the loop per
batch and only move that batch's tensors to the device. Specifically, remove the
single pre-tokenize call, and inside the tqdm loop build a slice of raw inputs,
call self.tokenizer(batch_inputs, max_length=..., padding=True, truncation=True,
return_tensors="pt") and then call .to(self.cfg.device) on that batch dict
before passing to model; keep collecting normalized embeddings.cpu() into
embeddings_list as before.
| extension = input_file.split(".")[-1] | ||
| if extension in ["json", "jsonl"]: | ||
| subset_data.to_json(output_file, orient="records", lines=True) | ||
| elif extension == "csv": | ||
| subset_data.to_csv(output_file, index=False) | ||
| elif extension == "parquet": | ||
| subset_data.to_parquet(output_file) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Write JSON vs JSONL correctly
to_json(..., lines=True) produces JSONL. Only set lines=True for .jsonl; use lines=False for .json.
- if extension in ["json", "jsonl"]:
- subset_data.to_json(output_file, orient="records", lines=True)
+ if extension == "jsonl":
+ subset_data.to_json(output_file, orient="records", lines=True)
+ elif extension == "json":
+ subset_data.to_json(output_file, orient="records", lines=False)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| extension = input_file.split(".")[-1] | |
| if extension in ["json", "jsonl"]: | |
| subset_data.to_json(output_file, orient="records", lines=True) | |
| elif extension == "csv": | |
| subset_data.to_csv(output_file, index=False) | |
| elif extension == "parquet": | |
| subset_data.to_parquet(output_file) | |
| extension = input_file.split(".")[-1] | |
| if extension == "jsonl": | |
| subset_data.to_json(output_file, orient="records", lines=True) | |
| elif extension == "json": | |
| subset_data.to_json(output_file, orient="records", lines=False) | |
| elif extension == "csv": | |
| subset_data.to_csv(output_file, index=False) | |
| elif extension == "parquet": | |
| subset_data.to_parquet(output_file) |
🤖 Prompt for AI Agents
In scripts/subset_selection/subset_selection.py around lines 604 to 611, the
code always calls to_json(..., lines=True) which writes JSONL; change the logic
to write JSONL only when the extension is ".jsonl" and write normal JSON for
".json" (i.e., call to_json(..., lines=False) or omit lines), keeping
orient="records" for both if desired; update the extension check to distinguish
"json" vs "jsonl" and call to_json with lines=False for ".json" and lines=True
for ".jsonl".
e69bef8 to
1d0eb3c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
notebooks/use-cases/document-conversion-vlm.ipynb (1)
60-65: Fix undefined variable in exception example.The new try/except snippet calls
converter.convert(file_path)and logsfile_path, but in this notebookfile_pathisn’t defined until afterconvertsucceeds. Copying the snippet as-is will raise aNameError. Please either callconverter.convert(file)(and logfile) or show how to definefile_pathbefore the conversion.- result = converter.convert(file_path) + result = converter.convert(file) ... - print(f"❌ Failed to convert {file_path}: {str(e)}") + print(f"❌ Failed to convert {file}: {e}")notebooks/tutorials/dataset-preparation-for-RAG.ipynb (1)
485-488: Pin revision and restrict trust_remote_code when loading HF model.Unpinned trust_remote_code=True permits arbitrary code execution and non-reproducible results. Pin a commit and optionally enforce an allowlist; set device explicitly.
Replace the model init cell with:
import os, torch ALLOWED = {"Snowflake/snowflake-arctic-embed-l"} REV = os.environ.get("ARCTIC_MODEL_REV") # set to pinned commit hash if REV is None: raise ValueError("Set ARCTIC_MODEL_REV to a pinned commit for reproducibility.") embedding_model = SentenceTransformer( "Snowflake/snowflake-arctic-embed-l", trust_remote_code=("Snowflake/snowflake-arctic-embed-l" in ALLOWED), revision=REV, device="cuda" if torch.cuda.is_available() else "cpu", )
♻️ Duplicate comments (7)
scripts/subset_selection/__main__.py (1)
1-1: Drop the shebang (module entrypoint).Remove the shebang to avoid Ruff EXE001; this module is invoked via
python -m.-#!/usr/bin/env python3scripts/subset_selection/encoders/arctic_encoder.py (2)
104-116: Harden HF model loading: pin revision and restrict trust_remote_code.Remote loading with trust_remote_code=True without a pinned revision allows arbitrary code execution and non-reproducible runs. Require a pinned revision and only allow trust_remote_code for an explicit allowlist.
@@ - # In testing mode, allow direct download from HuggingFace - if hasattr(self.cfg, "testing_mode") and self.cfg.testing_mode: + # In testing mode, allow direct download from HuggingFace (with pinned revision and allowlist) + ALLOWLIST = {"Snowflake/snowflake-arctic-embed-l-v2.0"} + revision = os.environ.get("ARCTIC_MODEL_REV", None) + if hasattr(self.cfg, "testing_mode") and self.cfg.testing_mode: logger.warning( f"Model not found locally at {model_path}. " "Testing mode enabled - downloading from HuggingFace..." ) - self.tokenizer = AutoTokenizer.from_pretrained(self.cfg.model_name) + if revision is None: + raise ValueError("ARCTIC_MODEL_REV must be set to a pinned commit hash when testing_mode=True") + self.tokenizer = AutoTokenizer.from_pretrained(self.cfg.model_name, revision=revision) self.model = AutoModel.from_pretrained( self.cfg.model_name, add_pooling_layer=False, - trust_remote_code=True, + trust_remote_code=(self.cfg.model_name in ALLOWLIST), + revision=revision, ) @@ - self.tokenizer = AutoTokenizer.from_pretrained(model_path) + self.tokenizer = AutoTokenizer.from_pretrained(model_path) self.model = AutoModel.from_pretrained( model_path, add_pooling_layer=False, - trust_remote_code=True, + trust_remote_code=(self.cfg.model_name in ALLOWLIST), local_files_only=True, )Also applies to: 123-129
182-189: Avoid OOM: tokenize and move to device per-batch, not for the entire corpus.Current code pre-tokenizes and moves all tensors to GPU, which will OOM on large inputs. Tokenize inside the loop and move only that batch.
- encodings = self.tokenizer( - inputs, - max_length=self.cfg.model_config["max_length"], - padding=True, - truncation=True, - return_tensors="pt", - ).to(self.cfg.device) - embeddings_list = [] for i in tqdm( range(0, len(inputs), self.cfg.batch_size), disable=not show_progress or len(inputs) < 256, ): - batch = {k: v[i : i + self.cfg.batch_size] for k, v in encodings.items()} - outputs = self.model(**batch) + batch_texts = inputs[i : i + self.cfg.batch_size] + enc = self.tokenizer( + batch_texts, + max_length=self.cfg.model_config["max_length"], + padding=True, + truncation=True, + return_tensors="pt", + ) + batch = {k: v.to(self.cfg.device) for k, v in enc.items()} + outputs = self.model(**batch)Also applies to: 191-199
scripts/subset_selection/utils/subset_selection_utils.py (1)
123-128: Clamp RBF scale to avoid 0/0 NaNs.When all vectors are identical, avg_dist=0 leads to NaNs. Clamp before division.
if metric == "rbf": distance = torch.cdist(a, b) squared_distance = distance**2 - avg_dist = torch.mean(squared_distance) - torch.div(squared_distance, kw * avg_dist, out=squared_distance) + avg_dist = torch.mean(squared_distance) + eps = torch.finfo(squared_distance.dtype).eps + scale = kw * torch.clamp(avg_dist, min=eps) + torch.div(squared_distance, scale, out=squared_distance) torch.exp(-squared_distance, out=squared_distance) return squared_distancescripts/subset_selection/subset_selection.py (3)
278-286: Fix percentage normalization (floats >1 should mean percent).Config allows floats up to 100, but here values >1 are rejected. Normalize >1 by /100 and validate.
- if isinstance(size_spec, float): - # if not in range of 0 to 1, raise error - if size_spec <= 0 or size_spec > 1: - raise ValueError( - "Percentage values must be between 0(non-inclusive) and 1(inclusive)" - ) - # If between 0 and 1, treat as decimal percentage (0.5 = 50%) - return max(1, int((size_spec) * total_samples)) + if isinstance(size_spec, float): + percentage = size_spec / 100 if size_spec > 1 else size_spec + if percentage <= 0 or percentage > 1: + raise ValueError("Percentage values must be between 0 (exclusive) and 100 (inclusive)") + return max(1, int(percentage * total_samples))
831-889: Skip empty folds and cap per-fold budgets to fold size.Empty folds and oversized budgets crash the optimizer. Guard folds and clamp budgets; also normalize float percentages >1 here.
for fold_idx, fold_indices in gpu_folds_info: try: logger.info(f"Processing fold {fold_idx + 1} on GPU {gpu_id}") - fold_embeddings = embeddings[fold_indices].to(device) + fold_embeddings = embeddings[fold_indices].to(device) + fold_size = fold_embeddings.shape[0] + if fold_size == 0: + logger.info(f"Skipping empty fold {fold_idx + 1} on GPU {gpu_id}") + continue @@ - for size_spec in subset_sizes: + for size_spec in subset_sizes: if isinstance(size_spec, float): - # Percentage-based selection - budget = max( - 1, math.ceil(size_spec * similarity_matrix.shape[0]) - ) + percentage = size_spec / 100 if size_spec > 1 else size_spec + budget = max(1, math.ceil(percentage * fold_size)) else: - # Absolute number-based selection - budget = max( - 1, - math.ceil( - size_spec * (similarity_matrix.shape[0] / total_samples) - ), - ) + budget = max(1, math.ceil(size_spec * (fold_size / total_samples))) + budget = min(budget, fold_size) + if budget == 0: + continue
610-617: Write JSON vs JSONL correctly.Use lines=True only for .jsonl; write standard JSON for .json.
- if extension in ["json", "jsonl"]: - subset_data.to_json(output_file, orient="records", lines=True) + if extension == "jsonl": + subset_data.to_json(output_file, orient="records", lines=True) + elif extension == "json": + subset_data.to_json(output_file, orient="records", lines=False)
🧹 Nitpick comments (6)
scripts/subset_selection/encoders/arctic_encoder.py (1)
75-77: Optional: support Apple Metal (MPS) device.Improve device fallback to prefer MPS on Apple if CUDA is unavailable.
- self.device = device or torch.device( - "cuda" if torch.cuda.is_available() else "cpu" - ) + if device is not None: + self.device = device + else: + if torch.cuda.is_available(): + self.device = torch.device("cuda") + elif getattr(torch.backends, "mps", None) and torch.backends.mps.is_available(): + self.device = torch.device("mps") + else: + self.device = torch.device("cpu")notebooks/tutorials/dataset-preparation-for-RAG.ipynb (1)
43-44: Reproducibility: pin package versions (or use a constraints file).Unpinned installs can break the notebook over time. Pin docling and sentence-transformers (you already pin pymilvus).
Example:
%pip install -qq "docling==2.43.0" "sentence-transformers==3.2.1" "numpy==2.0.2" "pymilvus[milvus_lite]==2.6.2"scripts/subset_selection/utils/subset_selection_utils.py (2)
107-107: Preserve tensor dtype in results.Initialize results with the input dtype to avoid unintended downcasting.
- results = torch.zeros(n_samples1, n_samples2, device="cpu") + results = torch.zeros(n_samples1, n_samples2, device="cpu", dtype=tensor1.dtype)
35-35: Use logger.exception to keep stack traces on retries.Retain traceback for failed attempts to aid debugging.
- logger.error(f"GPU out of memory on attempt {attempt + 1}: {str(e)}") + logger.exception("GPU out of memory on attempt %d: %s", attempt + 1, e) @@ - logger.error( - f"PyTorch runtime error on attempt {attempt + 1}: {str(e)}" - ) + logger.exception("PyTorch runtime error on attempt %d: %s", attempt + 1, e) @@ - logger.error(f"Value error on attempt {attempt + 1}: {str(e)}") + logger.exception("Value error on attempt %d: %s", attempt + 1, e) @@ - logger.error(f"Type error on attempt {attempt + 1}: {str(e)}") + logger.exception("Type error on attempt %d: %s", attempt + 1, e) @@ - logger.error(f"Index error on attempt {attempt + 1}: {str(e)}") + logger.exception("Index error on attempt %d: %s", attempt + 1, e)Also applies to: 39-41, 45-45, 49-49, 53-53
scripts/subset_selection/subset_selection.py (2)
365-372: Use context managers for Pools to avoid resource leaks.with Pool(...) ensures proper cleanup even on exceptions.
- pool = Pool(processes=num_gpus) - try: - shard_files = pool.map(_process_dataset_shard, args_list) - finally: - pool.close() - pool.join() - pool.terminate() + with Pool(processes=num_gpus) as pool: + shard_files = pool.map(_process_dataset_shard, args_list) @@ - pool = Pool(processes=self.config.system.num_gpus) - try: - gpu_results = pool.map(process_folds_with_gpu, gpu_assignments) - finally: - pool.close() - pool.join() - pool.terminate() + with Pool(processes=self.config.system.num_gpus) as pool: + gpu_results = pool.map(process_folds_with_gpu, gpu_assignments)Also applies to: 429-436
206-209: Optional: enable Jinja2 autoescape for safety.Set autoescape to reduce surprises when rendering user-provided text.
-from jinja2 import BaseLoader, Environment +from jinja2 import BaseLoader, Environment, select_autoescape @@ - self.env = Environment(loader=BaseLoader()) + self.env = Environment(loader=BaseLoader(), autoescape=select_autoescape(default_for_string=True)) @@ - env = Environment(loader=BaseLoader()) + env = Environment(loader=BaseLoader(), autoescape=select_autoescape(default_for_string=True))Also applies to: 656-657
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
notebooks/tutorials/dataset-preparation-for-RAG.ipynb(38 hunks)notebooks/use-cases/document-conversion-standard.ipynb(18 hunks)notebooks/use-cases/document-conversion-vlm.ipynb(17 hunks)notebooks/use-cases/hybrid-chunking-demonstration.ipynb(20 hunks)scripts/subset_selection/__main__.py(1 hunks)scripts/subset_selection/encoders/arctic_encoder.py(1 hunks)scripts/subset_selection/subset_selection.py(1 hunks)scripts/subset_selection/utils/subset_selection_utils.py(1 hunks)
✅ Files skipped from review due to trivial changes (2)
- notebooks/use-cases/hybrid-chunking-demonstration.ipynb
- notebooks/use-cases/document-conversion-standard.ipynb
🧰 Additional context used
🧬 Code graph analysis (3)
scripts/subset_selection/encoders/arctic_encoder.py (1)
scripts/subset_selection/subset_selection.py (1)
EncoderConfig(107-118)
scripts/subset_selection/subset_selection.py (2)
scripts/subset_selection/encoders/arctic_encoder.py (3)
ArcticEmbedEncoder(59-205)EncoderConfig(48-56)encode(171-205)scripts/subset_selection/utils/subset_selection_utils.py (3)
compute_pairwise_dense(87-148)get_default_num_gpus(67-84)retry_on_exception(21-64)
scripts/subset_selection/__main__.py (1)
scripts/subset_selection/subset_selection.py (1)
main(1094-1152)
🪛 Pylint (4.0.1)
scripts/subset_selection/encoders/arctic_encoder.py
[refactor] 60-60: Too many positional arguments (6/5)
(R0917)
[refactor] 59-59: Too few public methods (1/2)
(R0903)
scripts/subset_selection/utils/subset_selection_utils.py
[refactor] 87-87: Too many positional arguments (7/5)
(R0917)
scripts/subset_selection/subset_selection.py
[refactor] 365-365: Consider using 'with' for resource-allocating operations
(R1732)
[refactor] 429-429: Consider using 'with' for resource-allocating operations
(R1732)
[refactor] 619-619: Too many statements (57/50)
(R0915)
[refactor] 800-800: Too many branches (16/12)
(R0912)
[refactor] 800-800: Too many statements (53/50)
(R0915)
🪛 Ruff (0.14.1)
scripts/subset_selection/encoders/arctic_encoder.py
70-72: Avoid specifying long messages outside the exception class
(TRY003)
118-121: Avoid specifying long messages outside the exception class
(TRY003)
149-152: Avoid specifying long messages outside the exception class
(TRY003)
162-165: Avoid specifying long messages outside the exception class
(TRY003)
scripts/subset_selection/utils/subset_selection_utils.py
35-35: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
35-35: Use explicit conversion flag
Replace with conversion flag
(RUF010)
39-41: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
40-40: Use explicit conversion flag
Replace with conversion flag
(RUF010)
45-45: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
45-45: Use explicit conversion flag
Replace with conversion flag
(RUF010)
49-49: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
49-49: Use explicit conversion flag
Replace with conversion flag
(RUF010)
53-53: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
53-53: Use explicit conversion flag
Replace with conversion flag
(RUF010)
81-83: Avoid specifying long messages outside the exception class
(TRY003)
129-129: Avoid specifying long messages outside the exception class
(TRY003)
scripts/subset_selection/subset_selection.py
54-57: Abstract raise to an inner function
(TRY301)
54-57: Avoid specifying long messages outside the exception class
(TRY003)
60-60: Avoid specifying long messages outside the exception class
(TRY003)
60-60: Use explicit conversion flag
Replace with conversion flag
(RUF010)
84-84: Avoid specifying long messages outside the exception class
(TRY003)
180-180: Prefer TypeError exception for invalid type
(TRY004)
180-180: Avoid specifying long messages outside the exception class
(TRY003)
184-184: Prefer TypeError exception for invalid type
(TRY004)
184-184: Avoid specifying long messages outside the exception class
(TRY003)
186-188: Avoid specifying long messages outside the exception class
(TRY003)
190-190: Avoid specifying long messages outside the exception class
(TRY003)
206-206: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.
(S701)
231-231: Avoid specifying long messages outside the exception class
(TRY003)
260-262: Avoid specifying long messages outside the exception class
(TRY003)
281-283: Avoid specifying long messages outside the exception class
(TRY003)
377-377: Avoid specifying long messages outside the exception class
(TRY003)
534-534: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
534-534: Use explicit conversion flag
Replace with conversion flag
(RUF010)
598-598: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
598-598: Use explicit conversion flag
Replace with conversion flag
(RUF010)
656-656: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.
(S701)
684-684: Abstract raise to an inner function
(TRY301)
684-684: Avoid specifying long messages outside the exception class
(TRY003)
741-741: Consider moving this statement to an else block
(TRY300)
746-746: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
746-746: Use explicit conversion flag
Replace with conversion flag
(RUF010)
824-824: Abstract raise to an inner function
(TRY301)
824-824: Avoid specifying long messages outside the exception class
(TRY003)
893-895: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
894-894: Use explicit conversion flag
Replace with conversion flag
(RUF010)
914-914: Consider moving this statement to an else block
(TRY300)
916-916: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
916-916: Use explicit conversion flag
Replace with conversion flag
(RUF010)
987-987: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
987-987: Use explicit conversion flag
Replace with conversion flag
(RUF010)
1149-1149: Consider moving this statement to an else block
(TRY300)
1150-1150: Do not catch blind exception: Exception
(BLE001)
scripts/subset_selection/__main__.py
1-1: Shebang is present but file is not executable
(EXE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: execute_tests (notebooks/use-cases/document-conversion-standard.ipynb, https://raw.githubusercont...
| if self.cfg.use_fp16: | ||
| self.model = self.model.half() | ||
|
|
||
| self.model = self.model.to(self.cfg.device) | ||
| logger.info(f"Model loaded on device: {self.cfg.device}") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard fp16 on CPU.
Calling .half() on CPU models can degrade or break ops. Apply fp16 only on CUDA; warn otherwise.
- if self.cfg.use_fp16:
- self.model = self.model.half()
+ if self.cfg.use_fp16:
+ if self.cfg.device.type == "cuda":
+ self.model = self.model.to(dtype=torch.float16)
+ else:
+ logger.warning("use_fp16=True ignored on non-CUDA device")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if self.cfg.use_fp16: | |
| self.model = self.model.half() | |
| self.model = self.model.to(self.cfg.device) | |
| logger.info(f"Model loaded on device: {self.cfg.device}") | |
| if self.cfg.use_fp16: | |
| if self.cfg.device.type == "cuda": | |
| self.model = self.model.to(dtype=torch.float16) | |
| else: | |
| logger.warning("use_fp16=True ignored on non-CUDA device") | |
| self.model = self.model.to(self.cfg.device) | |
| logger.info(f"Model loaded on device: {self.cfg.device}") |
|
@RobuRishabh it looks like you accidently pushed notebooks that have already been merged into this PR. |
7a580fa to
abb5c87
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (9)
scripts/subset_selection/encoders/arctic_encoder.py (3)
104-115: Pin HF revision and restrict trust_remote_code (remote code execution risk).Loading with
trust_remote_code=Truewithout a pinned revision allows arbitrary remote code execution and non-reproducible builds. Gatetrust_remote_codebehind an allowlist and require a pinnedrevision(commit SHA) for both tokenizer and model. Also passrevisionwhen downloading; keeplocal_files_only=Truefor local paths.Apply:
@@ - # In testing mode, allow direct download from HuggingFace - if hasattr(self.cfg, "testing_mode") and self.cfg.testing_mode: + # In testing mode, allow direct download from HuggingFace (with pinned revision and allowlist) + if hasattr(self.cfg, "testing_mode") and self.cfg.testing_mode: logger.warning( f"Model not found locally at {model_path}. " "Testing mode enabled - downloading from HuggingFace..." ) - self.tokenizer = AutoTokenizer.from_pretrained(self.cfg.model_name) - self.model = AutoModel.from_pretrained( - self.cfg.model_name, - add_pooling_layer=False, - trust_remote_code=True, - ) + revision = os.environ.get("ARCTIC_MODEL_REV") + allowlist = {"Snowflake/snowflake-arctic-embed-l-v2.0"} + trust_rc = self.cfg.model_name in allowlist + if revision is None: + raise RuntimeError("ARCTIC_MODEL_REV env var must be set to a pinned commit SHA in testing_mode") + self.tokenizer = AutoTokenizer.from_pretrained(self.cfg.model_name, revision=revision) + self.model = AutoModel.from_pretrained( + self.cfg.model_name, + add_pooling_layer=False, + trust_remote_code=trust_rc, + revision=revision, + ) else: if not os.path.exists(model_path): raise ValueError( f"Model not found in available models: {self.cfg.model_name}\n" "Please run `ilab model download` and download the necessary model" ) - self.tokenizer = AutoTokenizer.from_pretrained(model_path) + self.tokenizer = AutoTokenizer.from_pretrained(model_path) self.model = AutoModel.from_pretrained( model_path, add_pooling_layer=False, - trust_remote_code=True, + trust_remote_code=True, # local allowlisted artifact local_files_only=True, )Also applies to: 117-130
131-135: Guard fp16 on CPU.
.half()on CPU can break ops. Apply fp16 only on CUDA; warn otherwise.- if self.cfg.use_fp16: - self.model = self.model.half() + if self.cfg.use_fp16: + if self.cfg.device.type == "cuda": + self.model = self.model.to(dtype=torch.float16) + else: + logger.warning("use_fp16=True ignored on non-CUDA device")
182-200: Batch-tokenize and move per-batch to device to avoid OOM.Pre-tokenizing the entire corpus and calling
.to(device)moves all tensors to GPU at once. Tokenize inside the loop and move only that batch.- encodings = self.tokenizer( - inputs, - max_length=self.cfg.model_config["max_length"], - padding=True, - truncation=True, - return_tensors="pt", - ).to(self.cfg.device) - - embeddings_list = [] - for i in tqdm( + embeddings_list = [] + for i in tqdm( range(0, len(inputs), self.cfg.batch_size), disable=not show_progress or len(inputs) < 256, ): - batch = {k: v[i : i + self.cfg.batch_size] for k, v in encodings.items()} + batch_texts = inputs[i : i + self.cfg.batch_size] + enc = self.tokenizer( + batch_texts, + max_length=self.cfg.model_config["max_length"], + padding=True, + truncation=True, + return_tensors="pt", + ) + batch = {k: v.to(self.cfg.device) for k, v in enc.items()} outputs = self.model(**batch) # Take the first token embedding (CLS) and normalize it embeddings = F.normalize(outputs.last_hidden_state[:, 0], p=2, dim=1) embeddings_list.append(embeddings.cpu())scripts/subset_selection/README.md (1)
269-276: Replace developer-specific absolute path with project-relative path.-# Navigate to project root -cd /Users/roburishabh/Github/odh-data-processing +# Navigate to project root +cd odh-data-processingscripts/subset_selection/__main__.py (1)
1-1: Drop the shebang; module entry points don’t need it.-#!/usr/bin/env python3scripts/subset_selection/utils/subset_selection_utils.py (1)
122-129: Clamp RBF scaling to avoid NaNs when avg_dist ≈ 0.Division by
kw * avg_distcan be 0/0 for identical embeddings. Clamp the scale away from zero.if metric == "rbf": distance = torch.cdist(a, b) squared_distance = distance**2 - avg_dist = torch.mean(squared_distance) - torch.div(squared_distance, kw * avg_dist, out=squared_distance) + avg_dist = torch.mean(squared_distance) + eps = torch.finfo(squared_distance.dtype).eps + scale = kw * torch.clamp(avg_dist, min=eps) + torch.div(squared_distance, scale, out=squared_distance) torch.exp(-squared_distance, out=squared_distance) return squared_distancescripts/subset_selection/subset_selection.py (3)
278-287: Fix percentage normalization inconsistency (floats up to 100 allowed in config).
ProcessingConfig.__post_init__allows floats in 0–100, butcalculate_subset_sizerejects floats >1. Normalize floats >1 to percentages.- if isinstance(size_spec, float): - # if not in range of 0 to 1, raise error - if size_spec <= 0 or size_spec > 1: - raise ValueError( - "Percentage values must be between 0(non-inclusive) and 1(inclusive)" - ) - # If between 0 and 1, treat as decimal percentage (0.5 = 50%) - return max(1, int((size_spec) * total_samples)) + if isinstance(size_spec, float): + percentage = size_spec / 100.0 if size_spec > 1 else size_spec + if percentage <= 0 or percentage > 1: + raise ValueError("Percentage values must be in (0, 100]") + return max(1, int(percentage * total_samples))
829-887: Skip empty folds and cap per‑fold budgets to fold size.Empty folds and over‑budget requests crash
FacilityLocationFunction.maximize. Skip zero‑sized folds and clamp budgets.- for fold_idx, fold_indices in gpu_folds_info: + for fold_idx, fold_indices in gpu_folds_info: + fold_size = len(fold_indices) + if fold_size == 0: + logger.info("Skipping empty fold %d on GPU %d", fold_idx + 1, gpu_id) + continue @@ - for size_spec in subset_sizes: + for size_spec in subset_sizes: if isinstance(size_spec, float): - # Percentage-based selection - budget = max( - 1, math.ceil(size_spec * similarity_matrix.shape[0]) - ) + percentage = size_spec / 100.0 if size_spec > 1 else size_spec + budget = max(1, math.ceil(percentage * fold_size)) else: # Absolute number-based selection - budget = max( - 1, - math.ceil( - size_spec * (similarity_matrix.shape[0] / total_samples) - ), - ) + budget = max(1, math.ceil(size_spec * (fold_size / total_samples))) + budget = min(budget, fold_size) + if budget == 0: + continue
608-615: Write JSON vs JSONL correctly.Only use
lines=Truefor.jsonl; for.jsonwrite a normal JSON array.- if extension in ["json", "jsonl"]: - subset_data.to_json(output_file, orient="records", lines=True) + if extension == "jsonl": + subset_data.to_json(output_file, orient="records", lines=True) + elif extension == "json": + subset_data.to_json(output_file, orient="records", lines=False) elif extension == "csv": subset_data.to_csv(output_file, index=False) elif extension == "parquet": subset_data.to_parquet(output_file)
🧹 Nitpick comments (6)
scripts/subset_selection/README.md (1)
99-116: Docs polish: fenced code languages + install note.
- Add language to fenced blocks to satisfy markdownlint (CLI options, tree, traceback).
- Prefer repo-root pinned install as per reviewer guidance; keep CUDA index hint.
-``` +```text @@ -``` +``` @@ -``` +```text @@ -``` +```python @@ -The original codebase can be found at: [https://github.com/krishnatejakk/DataCurate4LLMs](https://github.com/krishnatejakk/DataCurate4LLMs) +The original codebase can be found at: [krishnatejakk/DataCurate4LLMs](https://github.com/krishnatejakk/DataCurate4LLMs)Optionally, update Installation to:
- pip install -r scripts/subset_selection/requirements.txt --extra-index-url https://download.pytorch.org/whl/cu121 + pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu121Would you like me to push the README tweaks as a follow-up commit?
Also applies to: 231-242, 290-299, 353-359
scripts/subset_selection/utils/subset_selection_utils.py (2)
35-53: Prefer logger.exception to retain stack traces on failures.Switch
logger.error(f"...{str(e)}")tologger.exception("...%s", e)to capture tracebacks; applies to all except blocks in this wrapper.- logger.error(f"GPU out of memory on attempt {attempt + 1}: {str(e)}") + logger.exception("GPU out of memory on attempt %d: %s", attempt + 1, e) @@ - logger.error( - f"PyTorch runtime error on attempt {attempt + 1}: {str(e)}" - ) + logger.exception("PyTorch runtime error on attempt %d: %s", attempt + 1, e) @@ - logger.error(f"Value error on attempt {attempt + 1}: {str(e)}") + logger.exception("Value error on attempt %d: %s", attempt + 1, e) @@ - logger.error(f"Type error on attempt {attempt + 1}: {str(e)}") + logger.exception("Type error on attempt %d: %s", attempt + 1, e) @@ - logger.error(f"Index error on attempt {attempt + 1}: {str(e)}") + logger.exception("Index error on attempt %d: %s", attempt + 1, e)
96-108: Memory caution: NxN result matrix can be large.
resultsisn1 x n2on CPU; for big folds this can be GBs. Consider streaming selection or memory-mapped storage if folds exceed a threshold, or compute sparse top‑k similarities per row to reduce footprint.Would you like a prototype that backs results with a memmap and yields row-chunks to the optimizer?
scripts/subset_selection/subset_selection.py (3)
365-372: Use context manager for multiprocessing.Pool to ensure cleanup.Replace manual close/join/terminate with
with Pool(...) as pool:for safer resource management and clearer intent.- pool = Pool(processes[num_gpus]) - try: - shard_files = pool.map(_process_dataset_shard, args_list) - finally: - pool.close() - pool.join() - pool.terminate() + with Pool(processes=num_gpus) as pool: + shard_files = pool.map(_process_dataset_shard, args_list) @@ - pool = Pool(processes=self.config.system.num_gpus) - try: - gpu_results = pool.map(process_folds_with_gpu, gpu_assignments) - finally: - pool.close() - pool.join() - pool.terminate() + with Pool(processes=self.config.system.num_gpus) as pool: + gpu_results = pool.map(process_folds_with_gpu, gpu_assignments)Also applies to: 429-436
654-675: Jinja2 autoescape off by default; enable for safety in templating.Even for CLI utilities, enabling
autoescapefor common text types is safer.-from jinja2 import BaseLoader, Environment +from jinja2 import BaseLoader, Environment, select_autoescape @@ - self.env = Environment(loader=BaseLoader()) + self.env = Environment(loader=BaseLoader(), autoescape=select_autoescape(["html", "xml"])) @@ - env = Environment(loader=BaseLoader()) + env = Environment(loader=BaseLoader(), autoescape=select_autoescape(["html", "xml"]))Also applies to: 206-213
531-534: Preserve stack traces in logs for easier debugging.Use
logger.exception("...%s", e)inside except blocks instead oflogger.error(...).I can push a quick pass converting these to
logger.exceptionif you’d like.Also applies to: 595-598, 984-986, 891-894, 914-919
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
scripts/subset_selection/README.md(1 hunks)scripts/subset_selection/__main__.py(1 hunks)scripts/subset_selection/encoders/arctic_encoder.py(1 hunks)scripts/subset_selection/subset_selection.py(1 hunks)scripts/subset_selection/utils/subset_selection_utils.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
scripts/subset_selection/__main__.py (1)
scripts/subset_selection/subset_selection.py (1)
main(1093-1174)
scripts/subset_selection/encoders/arctic_encoder.py (1)
scripts/subset_selection/subset_selection.py (1)
EncoderConfig(107-118)
scripts/subset_selection/subset_selection.py (2)
scripts/subset_selection/encoders/arctic_encoder.py (3)
ArcticEmbedEncoder(59-205)EncoderConfig(48-56)encode(171-205)scripts/subset_selection/utils/subset_selection_utils.py (3)
compute_pairwise_dense(87-148)get_default_num_gpus(67-84)retry_on_exception(21-64)
🪛 LanguageTool
scripts/subset_selection/README.md
[style] ~129-~129: This adverb was used twice in the sentence. Consider removing one of them or replacing them with a synonym.
Context: ... = exactly 1000 samples - "5000" = exactly 5000 samples - Example: `--subset-si...
(ADVERB_REPETITION_PREMIUM)
🪛 markdownlint-cli2 (0.18.1)
scripts/subset_selection/README.md
99-99: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
231-231: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
290-290: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
355-355: Bare URL used
(MD034, no-bare-urls)
🪛 Pylint (4.0.1)
scripts/subset_selection/utils/subset_selection_utils.py
[refactor] 87-87: Too many positional arguments (7/5)
(R0917)
scripts/subset_selection/encoders/arctic_encoder.py
[refactor] 60-60: Too many positional arguments (6/5)
(R0917)
[refactor] 59-59: Too few public methods (1/2)
(R0903)
scripts/subset_selection/subset_selection.py
[refactor] 365-365: Consider using 'with' for resource-allocating operations
(R1732)
[refactor] 429-429: Consider using 'with' for resource-allocating operations
(R1732)
[refactor] 617-617: Too many statements (57/50)
(R0915)
[refactor] 798-798: Too many branches (16/12)
(R0912)
[refactor] 798-798: Too many statements (53/50)
(R0915)
🪛 Ruff (0.14.1)
scripts/subset_selection/__main__.py
1-1: Shebang is present but file is not executable
(EXE001)
scripts/subset_selection/utils/subset_selection_utils.py
35-35: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
35-35: Use explicit conversion flag
Replace with conversion flag
(RUF010)
39-41: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
40-40: Use explicit conversion flag
Replace with conversion flag
(RUF010)
45-45: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
45-45: Use explicit conversion flag
Replace with conversion flag
(RUF010)
49-49: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
49-49: Use explicit conversion flag
Replace with conversion flag
(RUF010)
53-53: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
53-53: Use explicit conversion flag
Replace with conversion flag
(RUF010)
81-83: Avoid specifying long messages outside the exception class
(TRY003)
129-129: Avoid specifying long messages outside the exception class
(TRY003)
scripts/subset_selection/encoders/arctic_encoder.py
70-72: Avoid specifying long messages outside the exception class
(TRY003)
118-121: Avoid specifying long messages outside the exception class
(TRY003)
149-152: Avoid specifying long messages outside the exception class
(TRY003)
162-165: Avoid specifying long messages outside the exception class
(TRY003)
scripts/subset_selection/subset_selection.py
54-57: Abstract raise to an inner function
(TRY301)
54-57: Avoid specifying long messages outside the exception class
(TRY003)
60-60: Avoid specifying long messages outside the exception class
(TRY003)
60-60: Use explicit conversion flag
Replace with conversion flag
(RUF010)
84-84: Avoid specifying long messages outside the exception class
(TRY003)
180-180: Prefer TypeError exception for invalid type
(TRY004)
180-180: Avoid specifying long messages outside the exception class
(TRY003)
184-184: Prefer TypeError exception for invalid type
(TRY004)
184-184: Avoid specifying long messages outside the exception class
(TRY003)
186-188: Avoid specifying long messages outside the exception class
(TRY003)
190-190: Avoid specifying long messages outside the exception class
(TRY003)
206-206: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.
(S701)
231-231: Avoid specifying long messages outside the exception class
(TRY003)
260-262: Avoid specifying long messages outside the exception class
(TRY003)
281-283: Avoid specifying long messages outside the exception class
(TRY003)
377-377: Avoid specifying long messages outside the exception class
(TRY003)
532-532: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
532-532: Use explicit conversion flag
Replace with conversion flag
(RUF010)
596-596: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
596-596: Use explicit conversion flag
Replace with conversion flag
(RUF010)
654-654: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.
(S701)
682-682: Abstract raise to an inner function
(TRY301)
682-682: Avoid specifying long messages outside the exception class
(TRY003)
739-739: Consider moving this statement to an else block
(TRY300)
744-744: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
744-744: Use explicit conversion flag
Replace with conversion flag
(RUF010)
822-822: Abstract raise to an inner function
(TRY301)
822-822: Avoid specifying long messages outside the exception class
(TRY003)
891-893: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
892-892: Use explicit conversion flag
Replace with conversion flag
(RUF010)
912-912: Consider moving this statement to an else block
(TRY300)
914-914: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
914-914: Use explicit conversion flag
Replace with conversion flag
(RUF010)
985-985: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
985-985: Use explicit conversion flag
Replace with conversion flag
(RUF010)
1116-1118: Avoid specifying long messages outside the exception class
(TRY003)
1128-1130: Avoid specifying long messages outside the exception class
(TRY003)
1136-1136: Avoid specifying long messages outside the exception class
(TRY003)
1171-1171: Consider moving this statement to an else block
(TRY300)
1172-1172: Do not catch blind exception: Exception
(BLE001)
abb5c87 to
158f73b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (9)
scripts/subset_selection/requirements.txt (1)
70-72: Duplicate: torch 2.5.2 remains vulnerable to critical CVEsThis issue was already flagged in previous reviews. OSV continues to report 4 CRITICAL vulnerabilities for torch 2.5.2 (PYSEC-2025-41, GHSA-3749-ghw9-m3mg, GHSA-53q9-r3pm-6pq6, GHSA-887c-mr87-cxwp). Please verify if a patched version is available or document mitigation strategy.
scripts/subset_selection/encoders/arctic_encoder.py (3)
97-132: Duplicate: Security risk with unrestricted trust_remote_codeThis issue was flagged in previous reviews. All three model loading paths (custom path, local cache, HuggingFace download) use
trust_remote_code=Truewithout pinning a revision or maintaining an allowlist, allowing arbitrary remote code execution.
183-201: Duplicate: Pre-tokenizing entire input list causes OOM on large batchesThis issue was flagged in previous reviews. Line 183-189 tokenizes the entire
inputslist and moves all tensors to GPU via.to(self.cfg.device)before batching, defeating the purpose of batching and causing OOM on large datasets. Tokenization should happen inside the batch loop.
134-135: Duplicate: FP16 applied without checking device typeThis issue was flagged in previous reviews. Calling
.half()on CPU models can degrade performance or break operations. FP16 should only be applied to CUDA devices.scripts/subset_selection/utils/subset_selection_utils.py (1)
116-122: Duplicate: RBF scaling can divide by zero causing NaNsThis issue was flagged in previous reviews. When embeddings are identical or nearly so,
avg_distbecomes zero, making the division at line 120 perform 0/0 and yielding NaNs that propagate through the similarity matrix.scripts/subset_selection/subset_selection.py (3)
277-286: Duplicate: Float percentage values > 1 are rejected despite config allowing themThis issue was flagged in previous reviews. Lines 279-282 reject float values > 1, but
ProcessingConfig.__post_init__(line 184) explicitly allows floats up to 100 to represent percentages. A config withsubset_sizes=[10.0]passes validation but fails here at runtime.
607-608: Duplicate: JSON files written as JSONLThis issue was flagged in previous reviews. Line 608 uses
lines=Truefor both.jsonand.jsonlextensions, producing JSONL format for.jsonfiles. Only.jsonlshould uselines=True.
824-882: Duplicate: Empty folds and uncapped budgets cause crashesThis issue was flagged in previous reviews. When
len(embeddings) < num_folds, empty folds are created. When subset size exceeds dataset size, per-fold budgets can exceed fold cardinality. Both cases causeFacilityLocationFunction.maximizeto fail. Empty folds should be skipped and budgets capped to fold size.scripts/subset_selection/__main__.py (1)
1-1: Shebang still present despite being marked as addressedThe shebang on line 1 is unnecessary for module execution via
python -m subset_selectionand triggers Ruff EXE001. Although this was marked as addressed in a previous review, the shebang remains in the code.Apply this diff to remove it:
-#!/usr/bin/env python3 """ Entry point for subset selection when run as a module. """Based on static analysis hints.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
scripts/subset_selection/.gitignore(1 hunks)scripts/subset_selection/README.md(1 hunks)scripts/subset_selection/__main__.py(1 hunks)scripts/subset_selection/encoders/arctic_encoder.py(1 hunks)scripts/subset_selection/requirements.txt(1 hunks)scripts/subset_selection/subset_selection.py(1 hunks)scripts/subset_selection/utils/subset_selection_utils.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- scripts/subset_selection/.gitignore
🧰 Additional context used
🧬 Code graph analysis (3)
scripts/subset_selection/subset_selection.py (2)
scripts/subset_selection/encoders/arctic_encoder.py (3)
ArcticEmbedEncoder(58-206)EncoderConfig(48-55)encode(172-206)scripts/subset_selection/utils/subset_selection_utils.py (3)
compute_pairwise_dense(81-142)get_default_num_gpus(67-78)retry_on_exception(21-64)
scripts/subset_selection/encoders/arctic_encoder.py (1)
scripts/subset_selection/subset_selection.py (1)
EncoderConfig(107-118)
scripts/subset_selection/__main__.py (1)
scripts/subset_selection/subset_selection.py (1)
main(1090-1173)
🪛 LanguageTool
scripts/subset_selection/README.md
[style] ~130-~130: This adverb was used twice in the sentence. Consider removing one of them or replacing them with a synonym.
Context: ... = exactly 1000 samples - "5000" = exactly 5000 samples - Example: `--subset-si...
(ADVERB_REPETITION_PREMIUM)
🪛 markdownlint-cli2 (0.18.1)
scripts/subset_selection/README.md
100-100: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
233-233: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
300-300: Bare URL used
(MD034, no-bare-urls)
🪛 OSV Scanner (2.2.3)
scripts/subset_selection/requirements.txt
[CRITICAL] 1-1: torch 2.5.2: undefined
(PYSEC-2025-41)
[CRITICAL] 1-1: torch 2.5.2: PyTorch susceptible to local Denial of Service
[CRITICAL] 1-1: torch 2.5.2: PyTorch: torch.load with weights_only=True leads to remote code execution
[CRITICAL] 1-1: torch 2.5.2: PyTorch Improper Resource Shutdown or Release vulnerability
🪛 Pylint (4.0.1)
scripts/subset_selection/subset_selection.py
[refactor] 364-364: Consider using 'with' for resource-allocating operations
(R1732)
[refactor] 427-427: Consider using 'with' for resource-allocating operations
(R1732)
[refactor] 615-615: Too many statements (57/50)
(R0915)
[refactor] 796-796: Too many branches (15/12)
(R0912)
[refactor] 796-796: Too many statements (51/50)
(R0915)
scripts/subset_selection/encoders/arctic_encoder.py
[refactor] 59-59: Too many positional arguments (6/5)
(R0917)
[refactor] 58-58: Too few public methods (1/2)
(R0903)
scripts/subset_selection/utils/subset_selection_utils.py
[refactor] 81-81: Too many positional arguments (7/5)
(R0917)
🪛 Ruff (0.14.1)
scripts/subset_selection/subset_selection.py
54-57: Abstract raise to an inner function
(TRY301)
54-57: Avoid specifying long messages outside the exception class
(TRY003)
60-60: Avoid specifying long messages outside the exception class
(TRY003)
60-60: Use explicit conversion flag
Replace with conversion flag
(RUF010)
84-84: Avoid specifying long messages outside the exception class
(TRY003)
179-179: Prefer TypeError exception for invalid type
(TRY004)
179-179: Avoid specifying long messages outside the exception class
(TRY003)
183-183: Prefer TypeError exception for invalid type
(TRY004)
183-183: Avoid specifying long messages outside the exception class
(TRY003)
185-187: Avoid specifying long messages outside the exception class
(TRY003)
189-189: Avoid specifying long messages outside the exception class
(TRY003)
205-205: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.
(S701)
230-230: Avoid specifying long messages outside the exception class
(TRY003)
259-261: Avoid specifying long messages outside the exception class
(TRY003)
280-282: Avoid specifying long messages outside the exception class
(TRY003)
376-376: Avoid specifying long messages outside the exception class
(TRY003)
530-530: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
530-530: Use explicit conversion flag
Replace with conversion flag
(RUF010)
594-594: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
594-594: Use explicit conversion flag
Replace with conversion flag
(RUF010)
652-652: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.
(S701)
680-680: Abstract raise to an inner function
(TRY301)
680-680: Avoid specifying long messages outside the exception class
(TRY003)
737-737: Consider moving this statement to an else block
(TRY300)
742-742: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
742-742: Use explicit conversion flag
Replace with conversion flag
(RUF010)
886-888: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
887-887: Use explicit conversion flag
Replace with conversion flag
(RUF010)
907-907: Consider moving this statement to an else block
(TRY300)
909-909: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
909-909: Use explicit conversion flag
Replace with conversion flag
(RUF010)
979-979: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
979-979: Use explicit conversion flag
Replace with conversion flag
(RUF010)
1113-1115: Avoid specifying long messages outside the exception class
(TRY003)
1125-1127: Avoid specifying long messages outside the exception class
(TRY003)
1133-1133: Avoid specifying long messages outside the exception class
(TRY003)
1170-1170: Consider moving this statement to an else block
(TRY300)
1171-1171: Do not catch blind exception: Exception
(BLE001)
scripts/subset_selection/encoders/arctic_encoder.py
69-71: Avoid specifying long messages outside the exception class
(TRY003)
150-153: Avoid specifying long messages outside the exception class
(TRY003)
163-166: Avoid specifying long messages outside the exception class
(TRY003)
scripts/subset_selection/__main__.py
1-1: Shebang is present but file is not executable
(EXE001)
scripts/subset_selection/utils/subset_selection_utils.py
35-35: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
35-35: Use explicit conversion flag
Replace with conversion flag
(RUF010)
39-41: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
40-40: Use explicit conversion flag
Replace with conversion flag
(RUF010)
45-45: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
45-45: Use explicit conversion flag
Replace with conversion flag
(RUF010)
49-49: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
49-49: Use explicit conversion flag
Replace with conversion flag
(RUF010)
53-53: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
53-53: Use explicit conversion flag
Replace with conversion flag
(RUF010)
123-123: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (1)
scripts/subset_selection/README.md (1)
294-303: Credits section properly addresses attribution requirementThe addition of the Credits and Acknowledgements section properly attributes the original author (Krishnateja Killamsetty) and links to the DataCurate4LLMs repository, addressing the PR objective to add attribution.
| from tqdm import tqdm | ||
|
|
||
| # Local | ||
| from encoders.arctic_encoder import ArcticEmbedEncoder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import uses relative path instead of absolute module path
The import on line 22 uses from encoders.arctic_encoder import ArcticEmbedEncoder with a relative path. Based on feedback from alimaredia, this should use an absolute import path to properly reference the module within the package structure.
Consider updating to use the full module path or ensure the package structure supports this relative import pattern consistently.
🤖 Prompt for AI Agents
In scripts/subset_selection/subset_selection.py at line 22, the import uses a
relative/non-absolute form "from encoders.arctic_encoder import
ArcticEmbedEncoder"; change it to the package's absolute module path (for
example "from my_package.encoders.arctic_encoder import ArcticEmbedEncoder" or
whatever the repository's root package is) so imports resolve consistently; if
the project uses a top-level package name adjust the prefix accordingly and
update other imports or __init__.py files to maintain consistent absolute
imports across the package.
Signed-off-by: roburishabh <[email protected]>
Signed-off-by: roburishabh <[email protected]>
158f73b to
408e383
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (9)
scripts/subset_selection/README.md (1)
211-214: Replace developer-specific absolute path with a portable example.Use a relative path to avoid leaking local paths and to make instructions reusable.
# Navigate to project root -cd /Users/roburishabh/Github/odh-data-processing +cd odh-data-processing +# or: cd <project-root>scripts/subset_selection/cli.py (1)
1-1: Make CLI script executable (shebang present).Set the executable bit so it can be run directly.
chmod +x scripts/subset_selection/cli.pyscripts/subset_selection/utils/subset_selection_utils.py (1)
121-127: Prevent NaNs in RBF scaling when avg_dist → 0.avg_dist can be zero for identical embeddings; divide-by-zero yields NaNs that break downstream selection. Clamp the scale.
if metric == "rbf": distance = torch.cdist(a, b) squared_distance = distance**2 - avg_dist = torch.mean(squared_distance) - torch.div(squared_distance, kw * avg_dist, out=squared_distance) + avg_dist = torch.mean(squared_distance) + eps = torch.finfo(squared_distance.dtype).eps + scale = kw * torch.clamp(avg_dist, min=eps) + torch.div(squared_distance, scale, out=squared_distance) torch.exp(-squared_distance, out=squared_distance) return squared_distanceif previously raised by bot.
scripts/subset_selection/subset_selection.py (3)
253-263: Normalize percentage floats and align with config (0–100).Config accepts floats up to 100, but this method rejects >1. Normalize >1 by dividing by 100, then validate and compute.
- if isinstance(size_spec, float): - # if not in range of 0 to 1, raise error - if size_spec <= 0 or size_spec > 1: - raise ValueError( - "Percentage values must be between 0(non-inclusive) and 1(inclusive)" - ) - # If between 0 and 1, treat as decimal percentage (0.5 = 50%) - return max(1, int((size_spec) * total_samples)) + if isinstance(size_spec, float): + # Accept 0–1 (fraction) or 0–100 (percent) + percentage = size_spec / 100.0 if size_spec > 1 else size_spec + if percentage <= 0 or percentage > 1: + raise ValueError("Percentage values must be in (0, 100]") + return max(1, int(percentage * total_samples))if previously raised by bot.
738-847: Skip empty folds and clamp per-fold budgets to fold size.Prevents optimizer crashes when fold is empty or budget > fold size. Also normalize float percentages consistently.
- for fold_idx, fold_indices in gpu_folds_info: + for fold_idx, fold_indices in gpu_folds_info: + fold_size = len(fold_indices) + if fold_size == 0: + logger.info(f"Skipping empty fold {fold_idx + 1} on GPU {gpu_id}") + continue try: @@ - for size_spec in subset_sizes: + for size_spec in subset_sizes: if isinstance(size_spec, float): - # Percentage-based selection - budget = max( - 1, math.ceil(size_spec * similarity_matrix.shape[0]) - ) + # Percentage-based selection (accept 0–1 or 0–100) + percentage = (size_spec / 100.0) if size_spec > 1 else size_spec + budget = max(1, math.ceil(percentage * fold_size)) else: # Absolute number-based selection - budget = max( - 1, - math.ceil( - size_spec * (similarity_matrix.shape[0] / total_samples) - ), - ) + budget = max(1, math.ceil(size_spec * (fold_size / total_samples))) + budget = min(budget, fold_size) + if budget == 0: + continueif previously raised by bot.
570-577: Write JSON vs JSONL correctly.Only use lines=True for .jsonl; write standard JSON for .json.
- extension = input_file.split(".")[-1] - if extension in ["json", "jsonl"]: - subset_data.to_json(output_file, orient="records", lines=True) + extension = input_file.split(".")[-1] + if extension == "jsonl": + subset_data.to_json(output_file, orient="records", lines=True) + elif extension == "json": + subset_data.to_json(output_file, orient="records", lines=False) elif extension == "csv": subset_data.to_csv(output_file, index=False) elif extension == "parquet": subset_data.to_parquet(output_file)if previously raised by bot.
scripts/subset_selection/encoders/arctic_encoder.py (3)
130-134: Guard fp16 on CPU devices.Calling half() on CPU can degrade/break operations. Apply fp16 only on CUDA.
- if self.cfg.use_fp16: - self.model = self.model.half() + if self.cfg.use_fp16: + if self.cfg.device.type == "cuda": + self.model = self.model.to(dtype=torch.float16) + else: + logger.warning("use_fp16=True ignored on non-CUDA device")
181-199: Critical: avoid pre-tokenizing entire corpus on GPU — batch-tokenize inside the loop.Current code tokenizes all inputs and moves them to the device, which will OOM on large datasets. Tokenize per batch and only move that batch.
- encodings = self.tokenizer( - inputs, - max_length=self.cfg.model_config["max_length"], - padding=True, - truncation=True, - return_tensors="pt", - ).to(self.cfg.device) - embeddings_list = [] for i in tqdm( range(0, len(inputs), self.cfg.batch_size), disable=not show_progress or len(inputs) < 256, ): - batch = {k: v[i : i + self.cfg.batch_size] for k, v in encodings.items()} - outputs = self.model(**batch) + batch_texts = inputs[i : i + self.cfg.batch_size] + enc = self.tokenizer( + batch_texts, + max_length=self.cfg.model_config["max_length"], + padding=True, + truncation=True, + return_tensors="pt", + ) + batch = {k: v.to(self.cfg.device) for k, v in enc.items()} + outputs = self.model(**batch) # Take the first token embedding (CLS) and normalize it embeddings = F.normalize(outputs.last_hidden_state[:, 0], p=2, dim=1) embeddings_list.append(embeddings.cpu())
103-115: Security posture: gate trust_remote_code and pin revisions.trust_remote_code=True executes arbitrary repo code. Pin a revision and limit trust to allowlisted repos (even in testing mode).
- self.tokenizer = AutoTokenizer.from_pretrained(self.cfg.model_name) - self.model = AutoModel.from_pretrained( - self.cfg.model_name, - add_pooling_layer=False, - trust_remote_code=True, - ) + revision = os.environ.get("ARCTIC_MODEL_REV") # optional: pass via config + allowlisted = {"Snowflake/snowflake-arctic-embed-l-v2.0"} + trust = self.cfg.model_name in allowlisted + self.tokenizer = AutoTokenizer.from_pretrained(self.cfg.model_name, revision=revision) + self.model = AutoModel.from_pretrained( + self.cfg.model_name, + add_pooling_layer=False, + trust_remote_code=trust, + revision=revision, + ) @@ - self.tokenizer = AutoTokenizer.from_pretrained(model_path) + revision = os.environ.get("ARCTIC_MODEL_REV") + self.tokenizer = AutoTokenizer.from_pretrained(model_path, revision=revision) self.model = AutoModel.from_pretrained( model_path, add_pooling_layer=False, - trust_remote_code=True, + trust_remote_code=False, local_files_only=True, + revision=revision, )Consider surfacing
revisionin config instead of env var for reproducibility. Based on learnings.
🧹 Nitpick comments (8)
scripts/subset_selection/README.md (3)
64-81: Add language to fenced block to satisfy MD040 and improve readability.Specify a language for the CLI options block (e.g., "text") to fix markdownlint MD040.
-``` +```text Required: --input <file> [<file> ...] Input file(s) to process (JSONL, JSON, CSV, Parquet) --subset-sizes <sizes> Comma-separated sizes (e.g., "0.1,0.5" or "1000,5000") Optional: --output-dir <dir> Output directory (default: output) --batch-size <int> Batch size for processing (default: 100000) --num-folds <int> Number of folds/partitions (default: 50) --epsilon <float> Optimization parameter (default: 160.0) --num-gpus <int> Number of GPUs to use (default: auto-detect) --combine-files Combine multiple input files before processing --testing-mode Enable CPU mode for testing --encoder-type <str> Encoder type (default: arctic) --encoder-model <str> Model name (default: Snowflake/snowflake-arctic-embed-l-v2.0) --template-name <str> Template name (default: conversation) --seed <int> Random seed (default: 42)--- `168-183`: **Add language to directory tree block to satisfy MD040.** Mark the directory tree block as "text". ```diff -``` +```text scripts/ ├── __init__.py # Top-level package initialization └── subset_selection/ ├── __init__.py # Subset selection package initialization ├── subset_selection.py # Main subset selection logic ├── cli.py # Command-line interface ├── requirements.txt # Package dependencies ├── README.md # This file ├── encoders/ │ ├── __init__.py # Encoder registry │ └── arctic_encoder.py # Arctic embedding encoder └── utils/ ├── __init__.py # Utils initialization └── subset_selection_utils.py # Utility functions
12-18: Installation note: pin Python and show single-command install (optional).Docs currently install from scripts/subset_selection/requirements.txt. Consider adding:
- Explicit Python version (3.12) as requested by reviewer.
- Single command with CUDA wheel index for PyTorch (if CUDA 12.1 is target).
Proposed addition after the current snippet:
# Requires Python 3.12 python3.12 -m pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu121Please confirm which requirements file is the canonical source (repo root vs scripts/subset_selection) and the intended CUDA toolchain for this project. Based on learnings.
scripts/subset_selection/encoders/__init__.py (1)
12-23: Avoid blanket try/except; raise precise errors.The extra try/except re-wraps all errors as ValueError, obscuring root causes. Let the explicit unsupported-type check raise, and return directly otherwise.
-def get_encoder_class(encoder_type: str): - """Get the encoder class based on the encoder type.""" - try: - if encoder_type not in ENCODER_REGISTRY: - supported_encoders = list(ENCODER_REGISTRY.keys()) - raise ValueError( - f"Unsupported encoder type: '{encoder_type}'. " - f"Supported types are: {supported_encoders}" - ) - return ENCODER_REGISTRY[encoder_type] - except Exception as e: - raise ValueError(f"Error getting encoder class: {str(e)}") from e +def get_encoder_class(encoder_type: str): + """Get the encoder class based on the encoder type.""" + if encoder_type not in ENCODER_REGISTRY: + supported = list(ENCODER_REGISTRY.keys()) + raise ValueError(f"Unsupported encoder type: '{encoder_type}'. Supported types: {supported}") + return ENCODER_REGISTRY[encoder_type]scripts/subset_selection/cli.py (1)
106-114: Clarify subset-size parsing; support '%' to avoid ambiguity.Current logic treats "1" as absolute size 1, not 100%. Add "%" support and keep decimals for percentages to prevent surprises.
- subset_sizes = [] - for size_str in args.subset_sizes.split(","): - size_str = size_str.strip() - if "." in size_str: - subset_sizes.append(float(size_str)) - else: - subset_sizes.append(int(size_str)) + subset_sizes = [] + for raw in args.subset_sizes.split(","): + s = raw.strip() + if s.endswith("%"): + subset_sizes.append(float(s[:-1]) / 100.0) + continue + # Decimals (e.g., 0.1) are treated as percentages; integers are absolute counts + subset_sizes.append(float(s) if "." in s else int(s))Consider documenting the "%" option in README. Based on learnings.
scripts/subset_selection/utils/subset_selection_utils.py (1)
14-18: Do not call logging.basicConfig in library code.basicConfig at import time can override app logging. Let the application configure logging; keep a module logger only.
-# Configure logging -logging.basicConfig( - level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s" -) logger = logging.getLogger(__name__)scripts/subset_selection/subset_selection.py (2)
31-36: Avoid setting logging.basicConfig in a library module.Leave logging configuration to the application/CLI; keep module logger only.
-# Configure logging -logging.basicConfig( - level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s" -) logger = logging.getLogger(__name__)
275-277: Name percentage subsets as whole percentages (e.g., 10.0%).Prefer human-friendly names like "percent_10.0" instead of "percent_0.1".
- if isinstance(size_spec, float): - return f"percent_{size_spec:.1f}" + if isinstance(size_spec, float): + pct = (size_spec * 100.0) if size_spec <= 1 else size_spec + return f"percent_{pct:.1f}"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
scripts/subset_selection/.gitignore(1 hunks)scripts/subset_selection/README.md(1 hunks)scripts/subset_selection/__init__.py(1 hunks)scripts/subset_selection/cli.py(1 hunks)scripts/subset_selection/encoders/__init__.py(1 hunks)scripts/subset_selection/encoders/arctic_encoder.py(1 hunks)scripts/subset_selection/requirements.txt(1 hunks)scripts/subset_selection/subset_selection.py(1 hunks)scripts/subset_selection/utils/__init__.py(1 hunks)scripts/subset_selection/utils/subset_selection_utils.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- scripts/subset_selection/requirements.txt
- scripts/subset_selection/.gitignore
🧰 Additional context used
🧬 Code graph analysis (6)
scripts/subset_selection/utils/__init__.py (1)
scripts/subset_selection/utils/subset_selection_utils.py (3)
compute_pairwise_dense(86-147)get_default_num_gpus(66-83)retry_on_exception(20-63)
scripts/subset_selection/__init__.py (2)
scripts/subset_selection/subset_selection.py (8)
BasicConfig(39-78)DataProcessor(168-576)EncoderConfig(82-93)ProcessingConfig(127-165)SystemConfig(112-123)TemplateConfig(97-108)get_supported_encoders(853-861)subset_datasets(864-923)scripts/subset_selection/encoders/arctic_encoder.py (1)
EncoderConfig(47-55)
scripts/subset_selection/encoders/__init__.py (1)
scripts/subset_selection/encoders/arctic_encoder.py (1)
ArcticEmbedEncoder(58-204)
scripts/subset_selection/encoders/arctic_encoder.py (1)
scripts/subset_selection/subset_selection.py (1)
EncoderConfig(82-93)
scripts/subset_selection/subset_selection.py (3)
scripts/subset_selection/encoders/__init__.py (1)
get_encoder_class(12-23)scripts/subset_selection/utils/subset_selection_utils.py (3)
compute_pairwise_dense(86-147)get_default_num_gpus(66-83)retry_on_exception(20-63)scripts/subset_selection/encoders/arctic_encoder.py (1)
encode(170-204)
scripts/subset_selection/cli.py (1)
scripts/subset_selection/subset_selection.py (1)
subset_datasets(864-923)
🪛 markdownlint-cli2 (0.18.1)
scripts/subset_selection/README.md
64-64: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
168-168: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Pylint (4.0.1)
scripts/subset_selection/utils/subset_selection_utils.py
[refactor] 86-86: Too many positional arguments (7/5)
(R0917)
scripts/subset_selection/encoders/arctic_encoder.py
[refactor] 59-59: Too many positional arguments (6/5)
(R0917)
[refactor] 58-58: Too few public methods (1/2)
(R0903)
scripts/subset_selection/subset_selection.py
[refactor] 738-738: Too many branches (14/12)
(R0912)
🔇 Additional comments (2)
scripts/subset_selection/utils/__init__.py (1)
1-16: LGTM — clean re-exports.scripts/subset_selection/__init__.py (1)
1-29: LGTM — clear public surface and re-exports.
Description
https://issues.redhat.com/browse/RHAIENG-1146
This PR contains subset selection-scripts
.pyfiles only.How Has This Been Tested?
Merge criteria:
Summary by CodeRabbit
New Features
Documentation