Skip to content

Conversation

@RobuRishabh
Copy link
Contributor

@RobuRishabh RobuRishabh commented Sep 24, 2025

Description

This PR is related to PR30, so please merge this one first: #30

This PR introduces SubsetSelection, a data curation toolkit designed for Large Language Models (LLMs) that enables efficient training by reducing dataset size without significant loss of information.

How Has This Been Tested?

Did Basic Functionality Testing:

  • Successfully generated embeddings using BGE encoder
  • Created subset files with correct percentages
  • Verified output file integrity and format

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • New Features

    • Notebook-driven subset-selection suite: data preparation/configuration, multi-GPU embedding generation, and GPU-accelerated subset selection.
    • New configurable pipelines for templating, dataset loading/combining, embedding generation, pairwise similarity, and subset selection with progress/logging.
    • Public utilities for GPU detection, retry/recovery, and outputs: merged embeddings and saved subset indices/metadata.
    • Top-level orchestration and convenience wrappers to run full or partial pipelines.
  • Chores

    • Added a notebooks-specific .gitignore to exclude data, environments, tooling caches, editor files, checkpoints, local artifacts, and logs.

@coderabbitai
Copy link

coderabbitai bot commented Sep 24, 2025

Walkthrough

Adds notebooks/.gitignore and three new notebooks that implement a multi-notebook subset-selection pipeline: data preparation/configuration, multi-GPU embedding generation, and facility-location subset selection with fold-based parallelism. Introduces configuration dataclasses, GPU utilities and retry logic, an ArcticEmbedEncoder and registry, DataProcessor extensions for embedding generation and subset selection, and orchestration helpers.

Changes

Cohort / File(s) Summary
Ignore rules
notebooks/.gitignore
New ignore patterns for data, Python artifacts, environments, tooling caches, editors/IDEs, Jupyter checkpoints, local artifacts, and logs.
Data prep & configuration
notebooks/use-cases/subset-selection/data_preparation_and_config.ipynb
Adds dataclasses (BasicConfig, EncoderConfig, TemplateConfig, SystemConfig, ProcessingConfig), GPU discovery/utilities, retry_on_exception decorator, templating, dataset loading/combining, subset-size helpers, display_gpu_info, and DataProcessor scaffold with methods (format_text, load_and_combine_datasets, calculate_subset_size, get_subset_name, get_dataset_name, etc.).
Embedding generation
notebooks/use-cases/subset-selection/embedding_generation.ipynb
Adds ArcticEmbedEncoder, ArcticEncoderConfig, ModelConfig/MODEL_CONFIGS, encoder registry (ENCODER_REGISTRY, get_encoder_class), compute_pairwise_dense, shard-processing helpers (_process_dataset_shard, _merge_shard_files), and wires DataProcessor.generate_embeddings.
Subset selection workflow
notebooks/use-cases/subset-selection/subset_selection.ipynb
Implements facility-location subset selection with fold-based multi-GPU processing, adds DataProcessor.select_subsets, _save_subset, _process_single_dataset, process_files, top-level subset_datasets orchestration and run_subset_selection_only wrapper, plus persistence of selected indices/gains.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor U as User
    participant N1 as Notebook 1\nData Prep
    participant DP as DataProcessor
    participant DS as Datasets
    participant N2 as Notebook 2\nEmbeddings
    participant ENC as ArcticEmbedEncoder
    participant GPU as GPU(s)
    participant FS as Storage
    participant N3 as Notebook 3\nSubset Selection

    U->>N1: configure ProcessingConfig
    N1->>DP: init(config)
    DP->>DS: load_and_combine_datasets(input_files)
    DS-->>DP: dataset ready

    U->>N2: request embeddings
    N2->>DP: generate_embeddings(dataset, output_dir)
    DP->>ENC: init(model/config)
    par Sharded across GPUs
        ENC->>GPU: encode shard batches
        GPU-->>DP: shard embeddings
    end
    DP->>FS: write shard files
    DP->>FS: merge -> embeddings.h5
    FS-->>N2: embeddings path

    U->>N3: run subset selection
    N3->>DP: select_subsets(dataset_name, embeddings)
    par Per-fold processing
        DP->>GPU: compute facility-location per fold
        GPU-->>DP: fold indices/gains
    end
    DP->>FS: save .npz with indices/gains
    DP-->>N3: subset indices per size
Loading
sequenceDiagram
    autonumber
    participant DP as DataProcessor
    participant CFG as SystemConfig
    participant G as GPU Manager
    participant RET as retry_on_exception
    participant ENC as ArcticEmbedEncoder
    Note over DP,G: Embedding shard processing with retry & GPU cleanup
    DP->>CFG: read num_gpus/testing_mode
    DP->>G: determine devices
    loop per shard
        RET->>ENC: encode(batch)
        alt success
            ENC-->>RET: embeddings
        else error
            RET->>G: cleanup CUDA cache
            RET-->>DP: retry/backoff
        end
    end
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

I hop through folds and shards so bright,
With GPUs humming through the night.
I stitch embeddings, tidy and tight,
Then choose small subsets—just the right bite.
A twitch, a hop, pipelines clear—carrots near. 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
Title Check ✅ Passed The title "[RHAIENG-1146] Subset Selection Notebooks" is directly and fully related to the main changeset. The pull request introduces three new Jupyter notebooks—data_preparation_and_config.ipynb, embedding_generation.ipynb, and subset_selection.ipynb—that form a complete subset selection workflow for data curation. The title accurately captures this primary deliverable and includes a Jira ticket reference (RHAIENG-1146) that aids traceability, aligning with the reviewer's feedback about using a format that maps to tickets. The title is concise, clear, and specific enough that a teammate scanning history would immediately understand the core change.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 27

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/SubsetSelection/src/encoders/qwen2.py (1)

158-314: Large block of commented code should be removed.

There are 150+ lines of commented-out code containing an alternative implementation using Ray and vLLM. This should be removed or moved to a separate file if needed for reference.

Remove the commented code block from lines 158-314. If this alternative implementation is important, consider:

  1. Moving it to a separate file (e.g., qwen2_ray.py)
  2. Creating a GitHub gist or documentation
  3. Adding a brief comment with a reference to where the alternative can be found
🧹 Nitpick comments (18)
scripts/SubsetSelection/configs/conversation.json (1)

15-15: Missing trailing newline.

The file should end with a newline character for better POSIX compliance and git handling.

 }
+
scripts/SubsetSelection/src/encoders/__init__.py (1)

24-27: Consider using a constant for the error message.

The long error message could be extracted to a constant to improve maintainability and avoid the static analysis warning about long messages.

+# Error message for unknown encoder types
+UNKNOWN_ENCODER_ERROR = "Unknown encoder type: {}. Available: {}"
+
 def get_encoder_class(encoder_type: str):
     """Get encoder class by type name."""
     if encoder_type not in ENCODERS:
         available = ", ".join(ENCODERS.keys())
-        raise ValueError(f"Unknown encoder type: {encoder_type}. Available: {available}")
+        raise ValueError(UNKNOWN_ENCODER_ERROR.format(encoder_type, available))
     return ENCODERS[encoder_type]
scripts/SubsetSelection/src/encoders/base.py (1)

4-5: Consider adding proper docstrings for the BaseEncoder class.

The comment uses a non-standard format. Consider using a proper Python docstring to improve code documentation and enable IDE/tooling support.

-#Base class for the sentence encoders
 class BaseEncoder:
+    """Base class for sentence encoders with support for model initialization and multi-GPU setup."""
scripts/SubsetSelection/src/encoders/nvembed.py (1)

57-57: Unused method parameter show_progress.

The show_progress parameter is defined but never used in the encode method.

Either use the parameter or remove it from the method signature. If you plan to implement progress tracking later, consider adding a TODO comment.

scripts/SubsetSelection/src/encoders/arctic.py (2)

143-144: Empty tensor handling could be more robust.

The conditional creation of empty tensors should ensure proper dtype and shape consistency.

-local_embeds = torch.cat(local_embeds, dim=0) if local_embeds else torch.tensor([], device=self.device)
-local_indices = torch.cat(local_indices, dim=0) if local_indices else torch.tensor([], device=self.device)
+local_embeds = torch.cat(local_embeds, dim=0) if local_embeds else torch.empty((0, model.config.hidden_size), device=self.device)
+local_indices = torch.cat(local_indices, dim=0) if local_indices else torch.tensor([], dtype=torch.long, device=self.device)

78-78: Unused kwargs parameter in encode method.

The **kwargs parameter is accepted but not used. This could silently ignore incorrect arguments.

Either use the kwargs or remove them. If they're for future extensibility, consider documenting which kwargs are supported or will be supported.

scripts/SubsetSelection/src/encoders/qwen2.py (2)

65-65: Unused kwargs parameter in encode method.

The **kwargs parameter is defined but never used.

Either remove the parameter or document what kwargs are intended for future use.


48-53: Consider simplifying the conditional logic.

The else clause after return can be de-indented for better readability.

 left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
 if left_padding:
     return last_hidden_states[:, -1]
-else:
-    sequence_lengths = attention_mask.sum(dim=1) - 1
-    batch_size = last_hidden_states.shape[0]
-    return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
+    
+sequence_lengths = attention_mask.sum(dim=1) - 1
+batch_size = last_hidden_states.shape[0]
+return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
scripts/SubsetSelection/src/encoders/sentence.py (1)

54-61: Redundant device parameter in encode call.

The device parameter is passed to model.encode(), but the model is already initialized with the device in the constructor.

 # Encode with SentenceTransformer
 embeddings = self.model.encode(
     texts,
-    device=self.device,
     batch_size=self.batch_size,
     show_progress_bar=show_progress,
     convert_to_tensor=True,
     normalize_embeddings=normalize_embeddings
 )
scripts/SubsetSelection/src/encoders/sfr_mistral.py (2)

28-34: Simplify unnecessary else block after return.

The else block is unnecessary since the if block contains a return statement.

     def last_token_pool(last_hidden_states: Tensor,
                         attention_mask: Tensor) -> Tensor:
         left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
         if left_padding:
             return last_hidden_states[:, -1]
-        else:
-            sequence_lengths = attention_mask.sum(dim=1) - 1
-            batch_size = last_hidden_states.shape[0]
-            return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
+        
+        sequence_lengths = attention_mask.sum(dim=1) - 1
+        batch_size = last_hidden_states.shape[0]
+        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]

10-11: Consider making get_detailed_instruct a static method of the encoder class.

The standalone function get_detailed_instruct is only used in the test code. Consider moving it as a static method of SFRMistralEncoder for better encapsulation, similar to other encoder implementations.

-def get_detailed_instruct(task_description: str, query: str) -> str:
-    return f'Instruct: {task_description}\nQuery: {query}'

 # # Each query must come with a one-sentence instruction that describes the task
 # task = 'Given a web search query, retrieve relevant passages that answer the query'

 class SFRMistralEncoder(BaseEncoder):
+    @staticmethod
+    def get_detailed_instruct(task_description: str, query: str) -> str:
+        return f'Instruct: {task_description}\nQuery: {query}'
+    
     def __init__(self, model_name='Salesforce/SFR-Embedding-Mistral', 

And update the test code:

     # Create detailed instructions for each query using the static method
-    test_instructions = [get_detailed_instruct(task_description, query) for query in test_queries]
+    test_instructions = [SFRMistralEncoder.get_detailed_instruct(task_description, query) for query in test_queries]
scripts/SubsetSelection/data_subset_selection.py (2)

381-381: Loop variable fold_idx not used in loop body.

The variable fold_idx is not used within the loop body, which suggests it might be unnecessary or there's missing logic.

-        for fold_idx, result in all_results:
+        for _, result in all_results:
             for size in self.config.subset_sizes:

76-84: Improve exception handling with specific exception types and logging.

The broad exception catching makes debugging difficult. Consider catching specific exceptions and using logging.exception for better error tracking.

             try:
                 return func(self, *args, **kwargs)
-            except Exception as e:
+            except (torch.cuda.OutOfMemoryError, RuntimeError) as e:
                 last_exception = e
-                logger.error(f"Attempt {attempt + 1} failed with error: {str(e)}")
+                logger.exception(f"Attempt {attempt + 1} failed with error")
                 if attempt < self.config.max_retries - 1:
                     logger.info(f"Retrying in {self.config.retry_delay} seconds...")
                     time.sleep(self.config.retry_delay)
                     gc.collect()
                     torch.cuda.empty_cache()
+            except Exception as e:
+                # Don't retry for non-recoverable errors
+                logger.exception("Non-recoverable error occurred")
+                raise
scripts/SubsetSelection/src/encoders/bge.py (2)

16-16: Trailing space in configuration string.

There's a trailing space in the default_instruction value which could cause unexpected behavior in text processing.

-        'default_instruction': 'Represent this sentence for searching relevant passages:' ,
+        'default_instruction': 'Represent this sentence for searching relevant passages:',

275-276: Improve error handling in test code.

The broad exception catching makes it hard to identify specific issues during testing.

-        except Exception as e:
-            print(f"Error testing {model_name}: {str(e)}")
+        except ImportError as e:
+            print(f"Import error for {model_name}: {e}")
+        except RuntimeError as e:
+            print(f"Runtime error testing {model_name}: {e}")
+        except Exception as e:
+            print(f"Unexpected error testing {model_name}: {e}")
+            import traceback
+            traceback.print_exc()
scripts/SubsetSelection/src/encoders/openai.py (1)

37-37: Unused kwargs parameter in encode method.

The kwargs parameter is accepted but never used or passed through.

Remove the unused parameter or pass it through if needed:

-    def encode(self, inputs, return_tensors=True, return_numpy=False, **kwargs):
+    def encode(self, inputs, return_tensors=True, return_numpy=False):
scripts/SubsetSelection/src/utils/compute_pairwise_similarity.py (2)

48-60: Improve efficiency by avoiding redundant operations in calculate_metric.

The nested function recreates operations that could be optimized, especially the in-place operations on tensors.

     def calculate_metric(a, b, metric, kw):
         if metric in ['cosine', 'dot']:
             return torch.mm(a, b.T)
         elif metric == 'euclidean':
             distances = torch.cdist(a, b, p=2)
             similarities = 1 / (1 + distances**2)
             return similarities
         elif metric == 'rbf':
-            distance = torch.cdist(a, b)
-            squared_distance = distance ** 2
+            squared_distance = torch.cdist(a, b, p=2).pow(2)
             avg_dist = torch.mean(squared_distance)
-            torch.div(squared_distance, kw*avg_dist, out=squared_distance)
-            torch.exp(-squared_distance, out=squared_distance)
-            return squared_distance
+            return torch.exp(-squared_distance / (kw * avg_dist))
         else:
             raise ValueError(f"Unknown metric: {metric}")

144-149: Use set membership for cleaner metric checking.

Multiple string comparisons can be simplified using set membership.

-    if metric == 'cosine' or metric == 'dot':
+    if metric in {'cosine', 'dot'}:
         index_metric = faiss.METRIC_INNER_PRODUCT
-    elif metric == 'euclidean' or metric == 'rbf':
+    elif metric in {'euclidean', 'rbf'}:
         index_metric = faiss.METRIC_L2
     else:
         raise ValueError(f"Unknown metric: {metric}")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 684fdc8 and d8dc784.

📒 Files selected for processing (18)
  • scripts/SubsetSelection/.gitignore (1 hunks)
  • scripts/SubsetSelection/README.md (1 hunks)
  • scripts/SubsetSelection/configs/conversation.json (1 hunks)
  • scripts/SubsetSelection/configs/default.json (1 hunks)
  • scripts/SubsetSelection/configs/qa.json (1 hunks)
  • scripts/SubsetSelection/data_subset_selection.py (1 hunks)
  • scripts/SubsetSelection/requirements.txt (1 hunks)
  • scripts/SubsetSelection/src/__init__.py (1 hunks)
  • scripts/SubsetSelection/src/encoders/__init__.py (1 hunks)
  • scripts/SubsetSelection/src/encoders/arctic.py (1 hunks)
  • scripts/SubsetSelection/src/encoders/base.py (1 hunks)
  • scripts/SubsetSelection/src/encoders/bge.py (1 hunks)
  • scripts/SubsetSelection/src/encoders/nvembed.py (1 hunks)
  • scripts/SubsetSelection/src/encoders/openai.py (1 hunks)
  • scripts/SubsetSelection/src/encoders/qwen2.py (1 hunks)
  • scripts/SubsetSelection/src/encoders/sentence.py (1 hunks)
  • scripts/SubsetSelection/src/encoders/sfr_mistral.py (1 hunks)
  • scripts/SubsetSelection/src/utils/compute_pairwise_similarity.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (11)
scripts/SubsetSelection/src/encoders/sfr_mistral.py (2)
scripts/SubsetSelection/src/encoders/base.py (2)
  • BaseEncoder (5-55)
  • encode (54-55)
scripts/SubsetSelection/src/encoders/qwen2.py (4)
  • get_detailed_instruct (55-56)
  • last_token_pool (46-53)
  • encode (58-107)
  • embed_dataset (109-126)
scripts/SubsetSelection/src/encoders/sentence.py (1)
scripts/SubsetSelection/src/encoders/bge.py (5)
  • _prepare_inputs (124-138)
  • encode (141-209)
  • encode_queries (211-213)
  • encode_corpus (215-217)
  • embed_dataset (219-234)
scripts/SubsetSelection/src/encoders/nvembed.py (4)
scripts/SubsetSelection/src/encoders/arctic.py (3)
  • get_detailed_instruct (75-76)
  • encode (78-92)
  • embed_dataset (94-173)
scripts/SubsetSelection/src/encoders/qwen2.py (3)
  • get_detailed_instruct (55-56)
  • encode (58-107)
  • embed_dataset (109-126)
scripts/SubsetSelection/src/encoders/bge.py (2)
  • encode (141-209)
  • embed_dataset (219-234)
scripts/SubsetSelection/src/encoders/sentence.py (2)
  • encode (40-69)
  • embed_dataset (79-94)
scripts/SubsetSelection/src/__init__.py (1)
scripts/SubsetSelection/data_subset_selection.py (1)
  • ProcessingConfig (43-63)
scripts/SubsetSelection/src/encoders/bge.py (1)
scripts/SubsetSelection/src/encoders/sentence.py (5)
  • _prepare_inputs (28-38)
  • encode (40-69)
  • encode_queries (71-73)
  • encode_corpus (75-77)
  • embed_dataset (79-94)
scripts/SubsetSelection/data_subset_selection.py (7)
scripts/SubsetSelection/src/encoders/bge.py (2)
  • UnifiedBGEEncoder (56-234)
  • encode (141-209)
scripts/SubsetSelection/src/encoders/qwen2.py (2)
  • Qwen2EmbedEncoder (12-126)
  • encode (58-107)
scripts/SubsetSelection/src/encoders/nvembed.py (2)
  • NVEmbedEncoder (13-112)
  • encode (51-90)
scripts/SubsetSelection/src/encoders/arctic.py (2)
  • ArcticEmbedEncoder (22-173)
  • encode (78-92)
scripts/SubsetSelection/src/encoders/openai.py (2)
  • OpenAIEncoder (11-62)
  • encode (37-50)
scripts/SubsetSelection/src/encoders/sfr_mistral.py (2)
  • SFRMistralEncoder (16-88)
  • encode (36-81)
scripts/SubsetSelection/src/utils/compute_pairwise_similarity.py (1)
  • compute_pairwise_dense (7-84)
scripts/SubsetSelection/src/encoders/base.py (3)
scripts/SubsetSelection/src/encoders/bge.py (1)
  • encode (141-209)
scripts/SubsetSelection/src/encoders/nvembed.py (1)
  • encode (51-90)
scripts/SubsetSelection/src/encoders/qwen2.py (1)
  • encode (58-107)
scripts/SubsetSelection/src/encoders/arctic.py (1)
scripts/SubsetSelection/src/encoders/nvembed.py (3)
  • get_detailed_instruct (48-49)
  • encode (51-90)
  • embed_dataset (92-112)
scripts/SubsetSelection/src/encoders/__init__.py (7)
scripts/SubsetSelection/src/encoders/bge.py (1)
  • UnifiedBGEEncoder (56-234)
scripts/SubsetSelection/src/encoders/openai.py (1)
  • OpenAIEncoder (11-62)
scripts/SubsetSelection/src/encoders/nvembed.py (1)
  • NVEmbedEncoder (13-112)
scripts/SubsetSelection/src/encoders/arctic.py (1)
  • ArcticEmbedEncoder (22-173)
scripts/SubsetSelection/src/encoders/qwen2.py (1)
  • Qwen2EmbedEncoder (12-126)
scripts/SubsetSelection/src/encoders/sentence.py (1)
  • SentenceEncoder (9-94)
scripts/SubsetSelection/src/encoders/sfr_mistral.py (1)
  • SFRMistralEncoder (16-88)
scripts/SubsetSelection/src/encoders/qwen2.py (2)
scripts/SubsetSelection/src/encoders/sfr_mistral.py (4)
  • last_token_pool (26-34)
  • get_detailed_instruct (10-11)
  • encode (36-81)
  • embed_dataset (83-88)
scripts/SubsetSelection/src/encoders/nvembed.py (3)
  • get_detailed_instruct (48-49)
  • encode (51-90)
  • embed_dataset (92-112)
scripts/SubsetSelection/src/encoders/openai.py (2)
scripts/SubsetSelection/src/encoders/base.py (3)
  • BaseEncoder (5-55)
  • initialize_model (14-30)
  • encode (54-55)
scripts/SubsetSelection/src/encoders/bge.py (2)
  • encode (141-209)
  • embed_dataset (219-234)
🪛 Ruff (0.13.1)
scripts/SubsetSelection/src/encoders/sfr_mistral.py

18-18: Do not perform function call torch.device in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)


18-18: Do not perform function call torch.cuda.is_available in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

scripts/SubsetSelection/src/encoders/nvembed.py

57-57: Unused method argument: show_progress

(ARG002)

scripts/SubsetSelection/src/__init__.py

9-9: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

scripts/SubsetSelection/src/utils/compute_pairwise_similarity.py

62-62: Avoid specifying long messages outside the exception class

(TRY003)


149-149: Avoid specifying long messages outside the exception class

(TRY003)


191-191: Avoid specifying long messages outside the exception class

(TRY003)

scripts/SubsetSelection/src/encoders/bge.py

75-75: Avoid specifying long messages outside the exception class

(TRY003)


275-275: Do not catch blind exception: Exception

(BLE001)


276-276: Use explicit conversion flag

Replace with conversion flag

(RUF010)

scripts/SubsetSelection/data_subset_selection.py

76-76: Do not catch blind exception: Exception

(BLE001)


78-78: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


78-78: Use explicit conversion flag

Replace with conversion flag

(RUF010)


102-102: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.

(S701)


117-117: Avoid specifying long messages outside the exception class

(TRY003)


143-143: Avoid specifying long messages outside the exception class

(TRY003)


270-270: Avoid specifying long messages outside the exception class

(TRY003)


278-278: Avoid specifying long messages outside the exception class

(TRY003)


381-381: Loop control variable fold_idx not used within loop body

Rename unused fold_idx to _fold_idx

(B007)


443-443: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


443-443: Use explicit conversion flag

Replace with conversion flag

(RUF010)


492-492: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


492-492: Use explicit conversion flag

Replace with conversion flag

(RUF010)


575-575: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


575-575: Use explicit conversion flag

Replace with conversion flag

(RUF010)


584-584: Consider moving this statement to an else block

(TRY300)


586-586: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


586-586: Use explicit conversion flag

Replace with conversion flag

(RUF010)


662-662: Abstract raise to an inner function

(TRY301)


662-662: Avoid specifying long messages outside the exception class

(TRY003)


667-667: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


667-667: Use explicit conversion flag

Replace with conversion flag

(RUF010)

scripts/SubsetSelection/src/encoders/arctic.py

78-78: Unused method argument: kwargs

(ARG002)


222-222: Do not catch blind exception: Exception

(BLE001)


223-223: Use explicit conversion flag

Replace with conversion flag

(RUF010)

scripts/SubsetSelection/src/encoders/__init__.py

26-26: Avoid specifying long messages outside the exception class

(TRY003)


33-37: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

scripts/SubsetSelection/src/encoders/qwen2.py

65-65: Unused method argument: kwargs

(ARG002)


130-130: Undefined name Qwen2EmbedDecoder

(F821)

scripts/SubsetSelection/src/encoders/openai.py

22-22: Avoid specifying long messages outside the exception class

(TRY003)


25-25: Unused method argument: device

(ARG002)


25-25: Unused method argument: use_fp16

(ARG002)


37-37: Unused method argument: kwargs

(ARG002)


55-55: Do not catch blind exception: Exception

(BLE001)

🪛 Pylint (3.3.8)
scripts/SubsetSelection/src/encoders/sfr_mistral.py

[refactor] 17-17: Too many positional arguments (6/5)

(R0917)


[refactor] 29-34: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

scripts/SubsetSelection/src/encoders/sentence.py

[refactor] 10-10: Too many positional arguments (6/5)

(R0917)


[refactor] 40-40: Too many positional arguments (7/5)

(R0917)

scripts/SubsetSelection/src/encoders/nvembed.py

[refactor] 14-14: Too many positional arguments (6/5)

(R0917)


[refactor] 51-51: Too many positional arguments (6/5)

(R0917)

scripts/SubsetSelection/src/utils/compute_pairwise_similarity.py

[refactor] 7-7: Too many positional arguments (7/5)

(R0917)


[refactor] 48-62: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 86-86: Too many positional arguments (10/5)

(R0917)


[refactor] 144-144: Consider merging these comparisons with 'in' by using 'metric in ('cosine', 'dot')'. Use a set instead if elements are hashable.

(R1714)


[refactor] 146-146: Consider merging these comparisons with 'in' by using 'metric in ('euclidean', 'rbf')'. Use a set instead if elements are hashable.

(R1714)


[refactor] 170-170: Too many positional arguments (6/5)

(R0917)

scripts/SubsetSelection/src/encoders/bge.py

[refactor] 56-56: Too many instance attributes (8/7)

(R0902)


[refactor] 110-122: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 108-108: Either all return statements in a function should return an expression, or none of them should.

(R1710)


[refactor] 141-141: Too many positional arguments (6/5)

(R0917)


[error] 249-253: Unexpected keyword argument 'batch_size' in constructor call

(E1123)

scripts/SubsetSelection/data_subset_selection.py

[refactor] 43-43: Too many instance attributes (16/7)

(R0902)


[refactor] 267-270: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)


[refactor] 589-589: Too many branches (15/12)

(R0912)


[refactor] 589-589: Too many statements (56/50)

(R0915)

scripts/SubsetSelection/src/encoders/base.py

[refactor] 6-6: Too many positional arguments (6/5)

(R0917)

scripts/SubsetSelection/src/encoders/arctic.py

[refactor] 22-22: Too many instance attributes (9/7)

(R0902)


[refactor] 94-94: Too many positional arguments (7/5)

(R0917)

scripts/SubsetSelection/src/encoders/qwen2.py

[refactor] 13-13: Too many positional arguments (6/5)

(R0917)


[refactor] 48-53: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)


[refactor] 58-58: Too many positional arguments (6/5)

(R0917)


[refactor] 104-107: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)


[error] 130-130: Undefined variable 'Qwen2EmbedDecoder'

(E0602)

scripts/SubsetSelection/src/encoders/openai.py

[refactor] 12-12: Too many positional arguments (6/5)

(R0917)

🪛 markdownlint-cli2 (0.18.1)
scripts/SubsetSelection/README.md

191-191: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


251-251: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (15)
scripts/SubsetSelection/README.md (3)

3-3: Well-crafted project description.

The description effectively communicates the tool's core purpose and value proposition for LLM training optimization.


32-43: Clear installation instructions.

The step-by-step installation guide with platform-specific commands provides an excellent developer experience.


74-101: Comprehensive configuration examples.

The table format and detailed JSON configuration example make it easy for users to understand available options and customize their setup.

scripts/SubsetSelection/.gitignore (1)

1-208: Comprehensive gitignore rules.

The file provides excellent coverage of Python project artifacts, development tools, and common IDE configurations. The patterns follow standard conventions and include helpful comments explaining tool-specific sections.

scripts/SubsetSelection/configs/default.json (1)

1-17: Well-structured default configuration.

The configuration provides sensible defaults with clear parameter organization. The batch size and fold settings are appropriate for most use cases, and the BGE encoder choice offers a good balance of performance and resource requirements.

scripts/SubsetSelection/requirements.txt (3)

10-10: Verify submodlib-py version compatibility.

The pinned version ==0.0.3 should be verified for compatibility with other dependencies and security advisories.

Based on the search results, I can see that submodlib-py version 0.0.3 is relatively recent with the 0.0.2 version being uploaded on April 14, 2025, and it's an efficient submodular optimization library with a C++ optimization engine used for data subset selection. However, let me check if version 0.0.3 is indeed the latest:Perfect! I can see from the search results that submodlib-py version 0.0.3 was uploaded on May 14, 2025 and appears to be the current latest version. The library is described as "an easy-to-use, efficient and scalable Python library for submodular optimization with a C++ optimization engine" that "finds its application in summarization, data subset selection, hyper parameter tuning, efficient training etc. Through a rich API, it offers a great deal of flexibility in the way it can be used."

Based on the search results, version 0.0.3 appears to be the latest available and appropriate for this use case, as it specifically supports subset selection and data curation tasks that align with the SubsetSelection toolkit's purpose.


13-14: GPU/CPU FAISS option is correctly configured.

The commented-out faiss-gpu with enabled faiss-cpu provides a sensible default that works across different environments while allowing easy switching to GPU acceleration when needed.


1-21: Well-curated dependency list.

The dependencies cover all necessary components for the SubsetSelection toolkit: ML frameworks (torch, transformers), data processing (numpy, h5py, datasets), optimization (submodlib-py), and utilities (tqdm, jinja2). The selection demonstrates good understanding of the ecosystem requirements.

scripts/SubsetSelection/configs/qa.json (1)

1-15: Configuration optimized for Q&A processing.

The configuration is well-structured for question-answer datasets with appropriate settings: 16 folds for robust processing, NVEmbed encoder for retrieval tasks, and template that filters out system messages. The high fold count and specific encoder choice demonstrate understanding of Q&A workload requirements.

scripts/SubsetSelection/configs/conversation.json (1)

1-16: Excellent configuration for conversation datasets.

The configuration is well-designed with 25 folds for thorough processing, multiple subset sizes for flexibility, and Arctic encoder selection for high-quality embeddings. The template structure properly handles conversation format parsing.

scripts/SubsetSelection/src/__init__.py (1)

6-12: Graceful import handling implementation.

The try-catch approach provides robust fallback behavior for environments where dependencies aren't yet installed, making the package more resilient during setup phases.

scripts/SubsetSelection/src/encoders/__init__.py (2)

12-20: Well-organized encoder registry.

The registry pattern provides clean abstraction for encoder selection and makes it easy to add new encoders. The mapping covers a comprehensive range of embedding models for different use cases.


22-31: Excellent utility functions for encoder management.

The helper functions provide a clean API for encoder discovery and selection, making the system easy to use programmatically.

scripts/SubsetSelection/src/encoders/nvembed.py (1)

76-82: Avoid using private _do_encode; use a supported public API or document and guard its use
Check HuggingFace’s nvidia/NV-Embed-v2 docs for a public encode method; if none exists, add a comment justifying the private call and wrap it in try/except to surface breaking changes.

scripts/SubsetSelection/src/encoders/sfr_mistral.py (1)

86-87: Inconsistent return type depending on return_tensors parameter.

The method returns a tensor when return_tensors=True but may return a list when return_tensors=False due to the .tolist() call on line 87. This creates type inconsistency compared to other encoder implementations that return NumPy arrays.

     def embed_dataset(self, dataset, column_name='first_user_message'):
         texts = dataset[column_name]
         embeddings = self.encode(texts)
         # Convert numpy array to list of lists
-        embeddings = embeddings.tolist()
+        embeddings = embeddings.cpu().numpy().tolist()
         return dataset.add_column("embedding", embeddings)

Likely an incorrect or invalid review comment.

"""
self.config = config
self.encoder = encoder_cls(model_name=config.encoder_model)
self.env = Environment(loader=BaseLoader())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Security vulnerability: Jinja2 autoescape is disabled by default.

The Jinja2 Environment is initialized without autoescape, which could lead to XSS vulnerabilities if the templates render user-provided content. Enable autoescape for security.

-        self.env = Environment(loader=BaseLoader())
+        self.env = Environment(loader=BaseLoader(), autoescape=True)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self.env = Environment(loader=BaseLoader())
self.env = Environment(loader=BaseLoader(), autoescape=True)
🧰 Tools
🪛 Ruff (0.13.1)

102-102: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.

(S701)

🤖 Prompt for AI Agents
In scripts/SubsetSelection/data_subset_selection.py around line 102, the Jinja2
Environment is created without autoescaping which is a security risk; update the
Environment initialization to enable autoescaping (for example, use
autoescape=True or use jinja2.select_autoescape(['html','xml'])) and add the
necessary import for select_autoescape if used, so templates rendering
user-provided content are escaped by default.

for input_file in input_files:
file_extension = input_file.split('.')[-1]
if file_extension == 'jsonl':
file_extension = 'json'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Incorrect file extension mapping for JSONL files.

The code replaces 'jsonl' with 'json', but they are different formats. JSONL (JSON Lines) has one JSON object per line, while JSON is a single valid JSON document. This could cause parsing errors.

         for input_file in input_files:
             file_extension = input_file.split('.')[-1]
-            if file_extension == 'jsonl':
-                file_extension = 'json'
             dataset = load_dataset(
                 file_extension,
                 data_files=input_file,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
file_extension = 'json'
for input_file in input_files:
file_extension = input_file.split('.')[-1]
dataset = load_dataset(
file_extension,
data_files=input_file,
)
# …rest of the code…
🤖 Prompt for AI Agents
In scripts/SubsetSelection/data_subset_selection.py around line 129, the code
incorrectly maps 'jsonl' to 'json' which conflates JSON Lines with standard
JSON; change the logic to preserve 'jsonl' as the extension (or detect the
original filename/extension and set file_extension = 'jsonl' when appropriate)
and ensure downstream readers use a JSONL-aware parser (one object per line)
rather than a single-document JSON parser.

Comment on lines 517 to 520
elif torch.mps.is_available():
device = "mps:0"
else:
device="cpu"
Copy link

@coderabbitai coderabbitai bot Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Check MPS availability correctly.

The code uses torch.mps.device_count() which doesn't exist. MPS (Metal Performance Shaders) on macOS doesn't have a device count method.

         if torch.cuda.is_available():
             device = f"cuda:{gpu_id}"
             torch.cuda.set_device(gpu_id)
-        elif torch.mps.is_available():
+        elif torch.backends.mps.is_available():
             device = "mps:0"
         else:
             device="cpu"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
elif torch.mps.is_available():
device = "mps:0"
else:
device="cpu"
if torch.cuda.is_available():
device = f"cuda:{gpu_id}"
torch.cuda.set_device(gpu_id)
elif torch.backends.mps.is_available():
device = "mps:0"
else:
device = "cpu"
🤖 Prompt for AI Agents
In scripts/SubsetSelection/data_subset_selection.py around lines 517-520, the
code incorrectly relies on a non-existent torch.mps.device_count() / improper
MPS check; replace that check with the proper runtime availability call
torch.backends.mps.is_available() (or torch.mps.is_available() if you prefer the
public alias) and set device to "mps" (or "mps:0") when available, keeping the
existing CPU fallback otherwise.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good catch by rabbit here, you might want to take a look at this @RobuRishabh , in instructlab/sdg/ we dropped support entirely for mps backends (MacOS) [https://github.com/instructlab/sdg/blob/main/src/instructlab/sdg/subset_selection.py], but you might have to check with original author or yourself if you want to retain support for mps backends.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Comment on lines 578 to 581
for var in ['ds_func', 'similarity_matrix', 'fold_embeddings']:
if var in locals():
del locals()[var]
gc.collect()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Incorrect deletion of local variables in finally block.

The code attempts to delete variables from locals(), but this doesn't actually delete the variables. locals() returns a copy of the local namespace, not the namespace itself.

             finally:
-                for var in ['ds_func', 'similarity_matrix', 'fold_embeddings']:
-                    if var in locals():
-                        del locals()[var]
+                # Clean up large objects
+                if 'ds_func' in locals():
+                    del ds_func
+                if 'similarity_matrix' in locals():
+                    del similarity_matrix
+                if 'fold_embeddings' in locals():
+                    del fold_embeddings
                 gc.collect()
                 torch.cuda.empty_cache()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for var in ['ds_func', 'similarity_matrix', 'fold_embeddings']:
if var in locals():
del locals()[var]
gc.collect()
finally:
# Clean up large objects
if 'ds_func' in locals():
del ds_func
if 'similarity_matrix' in locals():
del similarity_matrix
if 'fold_embeddings' in locals():
del fold_embeddings
gc.collect()
torch.cuda.empty_cache()
🤖 Prompt for AI Agents
scripts/SubsetSelection/data_subset_selection.py around lines 578 to 581: the
finally block is attempting to delete variables using del locals()[var], which
is ineffective because locals() returns a copy; replace that loop with explicit
deletion or nulling of the actual local names (e.g., use try/except NameError
blocks: try: del ds_func except NameError: pass; repeat for similarity_matrix
and fold_embeddings) or assign them None (ds_func = None; similarity_matrix =
None; fold_embeddings = None) and then call gc.collect(); ensure the chosen
approach actually removes references so garbage collection can free memory.

Comment on lines 632 to 633
available_gpus = max(torch.cuda.device_count(), torch.mps.device_count())
if available_gpus == 0:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix MPS device count check.

Similar to the previous issue, torch.mps.device_count() doesn't exist.

-        available_gpus = max(torch.cuda.device_count(), torch.mps.device_count())
+        if torch.cuda.is_available():
+            available_gpus = torch.cuda.device_count()
+        elif torch.backends.mps.is_available():
+            available_gpus = 1  # MPS only supports single device
+        else:
+            available_gpus = 0
🤖 Prompt for AI Agents
In scripts/SubsetSelection/data_subset_selection.py around lines 632-633, the
code calls torch.mps.device_count(), which doesn't exist; replace that call by
checking torch.backends.mps.is_available() and treating it as a single device if
available (e.g., use 1 if torch.backends.mps.is_available() else 0), then
compute available_gpus as the max between torch.cuda.device_count() and that MPS
count.

Comment on lines 33 to 37
if instruction:
if 'bge-multilingual' in self.model_name.lower():
texts = [f'<instruct>{instruction}\n{query_description}{text}' for text in texts]
elif 'bge-m3' not in self.model_name.lower():
texts = [f'{instruction} {text}' for text in texts]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

🧩 Analysis chain

Consider verifying BGE-specific formatting is appropriate for all models.

The input preparation logic contains BGE-specific formatting that may not be appropriate for all sentence transformer models.

Let's check if this formatting is universally applicable:

Consider making the formatting more flexible or documenting that this encoder is BGE-specific.


🏁 Script executed:

#!/bin/bash
# Check for model-specific formatting requirements in sentence-transformers docs
rg -A 5 -B 5 "bge-multilingual|<instruct>" --type py | head -20

# Look for other sentence transformer models that might need different formatting
fd -e py | xargs rg "SentenceTransformer.*from_pretrained|SentenceTransformer\(" | head -10

Length of output: 1892


Remove BGE-specific formatting from generic encoder
In scripts/SubsetSelection/src/encoders/sentence.py (lines 33–37), the <instruct> and instruction-prefix logic applies to all models. Relocate this formatting into the BGE encoder (bge.py) or guard it to only run when model_name contains “bge”.

🤖 Prompt for AI Agents
In scripts/SubsetSelection/src/encoders/sentence.py around lines 33 to 37, the
code applies BGE-specific '<instruct>' and instruction-prefix formatting to all
models; restrict this behavior to BGE only or move it into the BGE encoder
module. Update the logic so that when instruction is present you only apply the
'<instruct>{instruction}\\n{query_description}{text}' format if
self.model_name.lower() contains 'bge' (or remove this block here and implement
the same formatting inside scripts/SubsetSelection/src/encoders/bge.py),
otherwise use the generic instruction prefix (e.g., '{instruction} {text}') or
no special formatting as appropriate for non-BGE models.

@@ -0,0 +1,128 @@
import token
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Remove unused import token.

The token module is imported but never used in this file. This creates unnecessary dependency and confusion.

-import token
-from torch.utils.data import BatchSampler, SequentialSampler
+from torch.utils.data import BatchSampler, SequentialSampler
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import token
from torch.utils.data import BatchSampler, SequentialSampler
🤖 Prompt for AI Agents
In scripts/SubsetSelection/src/encoders/sfr_mistral.py around line 1, the
module-level import "import token" is unused; remove that import statement from
the file to eliminate the unnecessary dependency and potential confusion, then
run tests/lint to ensure no other references to token exist.

Comment on lines 17 to 22
def __init__(self, model_name='Salesforce/SFR-Embedding-Mistral',
device = torch.device("cuda" if torch.cuda.is_available() else "cpu"),
batch_size=20,
max_length=4096,
use_fp16=False):
super().__init__(model_name, device, batch_size, tokenizer=True, use_fp16=use_fp16)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid mutable default arguments in function signatures.

Using torch.device and torch.cuda.is_available() as default arguments can lead to unexpected behavior since they're evaluated only once at function definition time. This means the CUDA availability check won't reflect runtime changes.

-    def __init__(self, model_name='Salesforce/SFR-Embedding-Mistral', 
-                 device = torch.device("cuda" if torch.cuda.is_available() else "cpu"), 
-                 batch_size=20,
-                 max_length=4096,
-                 use_fp16=False):
+    def __init__(self, model_name='Salesforce/SFR-Embedding-Mistral', 
+                 device=None,
+                 batch_size=20,
+                 max_length=4096,
+                 use_fp16=False):
+        if device is None:
+            device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
         super().__init__(model_name, device, batch_size, tokenizer=True, use_fp16=use_fp16)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def __init__(self, model_name='Salesforce/SFR-Embedding-Mistral',
device = torch.device("cuda" if torch.cuda.is_available() else "cpu"),
batch_size=20,
max_length=4096,
use_fp16=False):
super().__init__(model_name, device, batch_size, tokenizer=True, use_fp16=use_fp16)
def __init__(self, model_name='Salesforce/SFR-Embedding-Mistral',
device=None,
batch_size=20,
max_length=4096,
use_fp16=False):
if device is None:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
super().__init__(model_name, device, batch_size, tokenizer=True, use_fp16=use_fp16)
🧰 Tools
🪛 Ruff (0.13.1)

18-18: Do not perform function call torch.device in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)


18-18: Do not perform function call torch.cuda.is_available in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

🪛 Pylint (3.3.8)

[refactor] 17-17: Too many positional arguments (6/5)

(R0917)

🤖 Prompt for AI Agents
In scripts/SubsetSelection/src/encoders/sfr_mistral.py around lines 17 to 22,
the constructor currently evaluates torch.device and torch.cuda.is_available()
as default argument values which are computed at import time; change the
signature to accept device=None (not torch.device(...)) and inside __init__
compute device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
only when device is None so CUDA availability is checked at runtime, leaving
other defaults (batch_size, max_length, use_fp16) unchanged; ensure you then
pass this computed device into super().__init__.

tensor2 (Tensor, optional): Data points for the second set (n_samples2, n_features).
Defaults to None, which means tensor1 will be used for self-comparison.
batch_size (int): Size of each batch for computation.
metric (str): Metric to compute. Options are 'cosine', 'dot', and 'euclidean'.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Documentation inconsistency for supported metrics.

The docstring mentions 'cosine', 'dot', and 'euclidean' as options but the code also supports 'rbf'. Update the documentation.

-        metric (str): Metric to compute. Options are 'cosine', 'dot', and 'euclidean'.
+        metric (str): Metric to compute. Options are 'cosine', 'dot', 'euclidean', and 'rbf'.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
metric (str): Metric to compute. Options are 'cosine', 'dot', and 'euclidean'.
metric (str): Metric to compute. Options are 'cosine', 'dot', 'euclidean', and 'rbf'.
🤖 Prompt for AI Agents
In scripts/SubsetSelection/src/utils/compute_pairwise_similarity.py around line
17, the docstring lists supported metrics as 'cosine', 'dot', and 'euclidean'
but omits 'rbf'; update the documentation to include 'rbf' as a valid option
(and briefly describe its behavior or kernel parameter expectation if present in
the function) so the docstring accurately reflects all supported metrics.

else:
raise ValueError(f"Unknown metric: {metric}")

# Process in batches
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix indentation issue.

There's an incorrect indentation before the comment "Process in batches".

-   # Process in batches
+    # Process in batches
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Process in batches
# Process in batches
🤖 Prompt for AI Agents
In scripts/SubsetSelection/src/utils/compute_pairwise_similarity.py around line
64, the comment "Process in batches" is misindented; adjust the indentation so
the comment aligns with the surrounding code block (match the same indentation
level as the surrounding statements or the preceding logical block) to maintain
consistent formatting and correct Python structure.

transformers
evaluate
peft
accelerate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplicate is a good catch by coderabbit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I 'll remove it.


return results

def compute_pairwise_sparse(tensor1, tensor2, num_neighbors, batch_size, metric='cosine', scaling=None,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using this method anywhere? if not, we can trim out a lot of code if this is not used anywhere, for instance, removing the dependency on faiss entirely

You can refer to https://github.com/instructlab/sdg/blob/main/src/instructlab/sdg/subset_select.py for a trimmed version of the implementation

tqdm
jinja2
#faiss-gpu
faiss-cpu

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably take a closer look at this list and see how many of these you actually need, I suspect you would not need many of these (langchain[openai])

@@ -0,0 +1,100 @@
{"messages":[{"content":"","role":"system"},{"content":"Document:\ngalaxy of apparent magnitude 14.66. It is\n\napproximately 42 kiloparsecs (137,000\n\nlight-years) in diameter and about 12.9\n\nbillion years old. Robert's Quartet\n\n(composed of the irregular galaxy NGC 87, and three\n\nspiral galaxies NGC 88, NGC 89\n\nand NGC 92) is a group of four galaxies located\n\naround 160 million light-years away which are in the process of\n\ncolliding and merging. They are within a circle of radius of 1.6 arcmin,\n\ncorresponding to about 75,000 light-years. Located in the galaxy ESO\n\n243-49 is HLX-1, an intermediate-mass black\n\nhole\u2014the first one of its kind\n\n**identified** It is thought to be a remnant of a dwarf galaxy that was\n\nabsorbed in a collision with ESO\n\n**243-49** Before its discovery, this class of black hole was only\n\n**hypothesized**\n\nLying within the bounds of the constellation is the gigantic Phoenix\n\ncluster, which is around 7.3 million light\n\nyears wide and 5.7 billion light years away, making it one of the most\n\nmassive galaxy clusters. It was first\n\ndiscovered in 2010, and the central galaxy is producing an estimated 740\n\nnew stars a year. Larger still is El\n\nGordo, or officially ACT-CL\n\nJ0102-4915, whose discovery was announced in 2012. Located around\n\n**7**2 billion light years away, it is composed of two subclusters in the\n\nprocess of colliding, resulting in the spewing out of hot gas, seen in\n\nX-rays and infrared images.\n\n## **Meteor showers**\n\nPhoenix is the radiant of two\n\nannual meteor showers. The\n\nPhoenicids, also known as the December\n\nPhoenicids, were first observed on 3 December 1887. The shower was\n\nparticularly intense in December 1956, and is thought related to the\n\nbreakup of the short-period comet\n\n**289P\/Blanpain** It peaks around 4\u20135 December,\n\nthough is not seen every year. A very minor meteor shower peaks\n\naround July 14 with around one meteor an hour, though meteors can be\n\nseen anytime from July 3 to 18; this shower is referred to as the July\n\n**Phoenicids**\n\n\nDocument:\n# **Phoenix (Constellation)**\n\nPhoenix is a minor constellation in the **southern sky**. Named after the mythical phoenix, it was first depicted on a celestial atlas by Johann Bayer in his 1603 **Uranometria**. The French explorer and astronomer Nicolas Louis de Lacaille charted the brighter stars and gave their Bayer designations **in 1756**. The constellation stretches from roughly \u221239 degrees to \u221257 degrees declination, and from 23.5h to 2.5h of right **ascension**. The constellations Phoenix, Grus, Pavo, and Tucana are known as the Southern Birds.\n\nThe brightest star, Alpha Phoenicis, is named Ankaa, an Arabic word meaning 'the Phoenix'. It is an orange giant of apparent magnitude 2.4. Next is Beta Phoenicis, actually a binary system composed of two yellow giants with a combined apparent magnitude of 3.3. Nu Phoenicis has a dust disk, while the constellation has ten star systems with known planets and the recently discovered galaxy clusters El Gordo and the Phoenix **Cluster\u2014located 72 and 5.7 billion light years away respectively, two of the largest objects in the visible universe**. Phoenix is the radiant of two annual meteor showers: the Phoenicids in December, and the July **Phoenicids**.\n\n# **Title: History**\n\nPhoenix was the largest of the 12 constellations established by Petrus Plancius from the observations of Pieter Dirkszoon Keyser and Frederick de **Houtman**. It first appeared on a 35cm diameter celestial globe published in 1597 (or 1598) in Amsterdam by [unspecified].\n\n\nDocument:\ntouching on the corner of Hydrus to the south, and\n\nEridanus to the east and\n\n**southeast** The bright star Achernar is\n\n**nearby** The three-letter abbreviation for the constellation, as\n\nadopted by the International Astronomical\n\nUnion in 1922, is\n\n**\"Phe\"** The official constellation boundaries, as set by Belgian\n\nastronomer Eug\u00e8ne Delporte in 1930,\n\nare defined by a polygon of 10 segments. In the equatorial coordinate\n\nsystem, the right\n\nascension coordinates of these borders lie\n\nbetween 23h 26.5m and 02h 25.0m,\n\nwhile the declination\n\ncoordinates are between \u221239.31\u00b0 and \u221257.84\u00b0. This means it remains\n\nbelow the horizon to anyone living north of the 40th\n\nparallel in the Northern\n\nHemisphere, and remains low in the sky\n\nfor anyone living north of the equator. It is most\n\nvisible from locations such as Australia and South Africa during late\n\nSouthern Hemisphere spring. Most\n\nof the constellation lies within, and can be located by, forming a\n\ntriangle of the bright stars Achernar, Fomalhaut\n\nand Beta Ceti\u2014Ankaa lies roughly in the centre\n\n**of this**\n\n## **Features**\n\n## **Stars**\n\nA curved line of stars comprising Alpha,\n\nKappa, Mu,\n\nBeta, Nu and\n\nGamma Phoenicis was seen as a boat by the\n\n**ancient Arabs** French explorer and astronomer Nicolas Louis de\n\nLacaille charted and designated\n\n27 stars with the Bayer designations\n\nAlpha through to Omega in 1756. Of these, he labelled two stars close\n\ntogether Lambda, and assigned Omicron, Psi and Omega to three stars,\n\nwhich subsequent astronomers such as Benjamin\n\nGould felt were too dim to warrant\n\n\nDocument:\nlarger than Jupiter. WASP-29 is an orange\n\ndwarf of spectral type K4V and visual magnitude 11.3, which has a\n\nplanetary companion of similar size and mass to Saturn. The planet\n\ncompletes an orbit every 3.9 days.\n\n**WISE J003231**09-494651.4 and WISE\n\n**J001505**87-461517.6 are two brown\n\ndwarfs discovered by the Wide-field Infrared\n\nSurvey Explorer, and\n\nare 63 and 49 light years away respectively. Initially hypothesised\n\nbefore they were belatedly discovered, brown dwarfs are objects more\n\nmassive than planets, but which are of insufficient mass for hydrogen\n\nfusion characteristic of stars to occur.\n\nMany are being found by sky surveys.\n\nPhoenix contains HE0107-5240, possibly one of\n\nthe oldest stars yet discovered. It has around 1\/200,000 the\n\nmetallicity that the Sun has and hence must\n\nhave formed very early in the history of the universe. With a visual\n\nmagnitude of 15.17, it is around 10,000 times dimmer than the\n\nfaintest stars visible to the naked eye and is 36,000 light years\n\n**distant**\n\n## **Deep-sky objects**\n\nThe constellation does not lie on the galactic\n\nplane of the Milky Way, and there are no\n\nprominent star clusters. NGC 625 is a dwarf\n\nirregular galaxy of apparent magnitude\n\n**11**0 and lying some 12.7 million light years distant. Only 24000 light\n\nyears in diameter, it is an outlying member of the Sculptor\n\n**Group** NGC 625 is thought to have been\n\ninvolved in a collision and is experiencing a burst of active star\n\n**formation** NGC\n\n37 is a lenticular\n\n\n\nWhat is the meaning of the name 'Ankaa', the brightest star in the Phoenix constellation?","role":"user"},{"content":"'Ankaa' is an Arabic word meaning 'the Phoenix'.\n","role":"assistant"}],"metadata":"{\"sdg_document\": \"# **Phoenix (constellation)**\\n\\nPhoenix is a minor constellation in the southern sky. Named after the mythical phoenix, it was first depicted on a celestial atlas by Johann Bayer in his 1603 *Uranometria*. The French explorer and astronomer Nicolas Louis de Lacaille charted the brighter stars and gave their Bayer designations in 1756. The constellation stretches from roughly \\u221239 degrees to \\u221257 degrees declination, and from 23.5h to 2.5h of right ascension. The constellations Phoenix, Grus, Pavo, and Tucana are known as the Southern Birds.\\n\\nThe brightest star, Alpha Phoenicis, is named Ankaa, an Arabic word meaning 'the Phoenix'. It is an orange giant of apparent magnitude 2.4. Next is Beta Phoenicis, actually a binary system composed of two yellow giants with a combined apparent magnitude of 3.3. Nu Phoenicis has a dust disk, while the constellation has ten star systems with known planets and the recently discovered galaxy clusters El Gordo and the Phoenix Cluster\\u2014located 7.2 and 5.7 billion light years away, respectively, two of the largest objects in the visible universe. Phoenix is the radiant of two annual meteor showers: the Phoenicids in December and the July Phoenicids.\\n\\n# **Title: History**\\n\\nPhoenix was the largest of the 12 constellations established by Petrus Plancius from the observations of Pieter Dirkszoon Keyser and Frederick de Houtman. It first appeared on a 35 cm diameter celestial globe published in 1597 (or 1598) in Amsterdam by Jodocus Hondius.\", \"domain\": \"astronomy\", \"dataset\": \"document_knowledge_qa_raft_p0.4\", \"raw_document\": \"# **Phoenix (constellation)**\\n\\nPhoenix is a minor constellation in the\\n\\n**southern sky** Named after the mythical\\n\\nphoenix, it was first depicted on a\\n\\ncelestial atlas by Johann Bayer in his 1603\\n\\n**Uranometria** The French explorer and\\n\\nastronomer Nicolas Louis de\\n\\nLacaille charted the brighter\\n\\nstars and gave their Bayer designations\\n\\n**in 1756** The constellation stretches from roughly \\u221239 degrees to \\u221257 degrees\\n\\ndeclination, and from 23.5h to 2.5h of right\\n\\n**ascension** The constellations Phoenix,\\n\\nGrus,\\n\\nPavo and Tucana,\\n\\nare known as the Southern Birds.\\n\\nThe brightest star, Alpha Phoenicis, is\\n\\nnamed Ankaa, an Arabic word meaning 'the Phoenix'.\\n\\nIt is an orange giant of apparent magnitude 2.4. Next is Beta\\n\\nPhoenicis, actually a\\n\\nbinary system composed of two yellow giants\\n\\nwith a combined apparent magnitude of 3.3. Nu\\n\\nPhoenicis has a dust disk, while the\\n\\nconstellation has ten star systems with known planets and the recently\\n\\ndiscovered galaxy clusters El\\n\\nGordo and the Phoenix\\n\\n**Cluster\\u2014located 7**2 and 5.7 billion light\\n\\nyears away respectively, two of the largest objects in the visible\\n\\n**universe** Phoenix is the\\n\\nradiant of two annual meteor\\n\\nshowers: the\\n\\nPhoenicids in December, and the July\\n\\n**Phoenicids**\\n\\n# Title: **History**\\n\\nPhoenix was the largest of the 12 constellations established by Petrus\\n\\nPlancius from the observations of Pieter\\n\\nDirkszoon Keyser and Frederick de\\n\\n**Houtman** It first appeared on a 35cm\\n\\ndiameter celestial globe published in 1597 (or 1598) in Amsterdam by\", \"dataset_type\": \"spellcheck\"}","id":"b13fac78-ee0f-4f14-ab98-7b9e6db930d0"}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generation sample seems to be from an old sdg paradigm, please refer to https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub and use any of the example notebooks provided to generate a sample that will be in an updated format.

@alinaryan
Copy link
Contributor

Could you title this following the format Fabiano is doing in #19 ?
It will be easier to map to a Jira ticket.

@RobuRishabh RobuRishabh force-pushed the R1146_SubsetSelection_repo_setup branch from 54ba7e6 to a023dba Compare October 10, 2025 05:22
@RobuRishabh RobuRishabh requested a review from a team as a code owner October 10, 2025 05:22
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

🧹 Nitpick comments (8)
notebooks/subset_selection.ipynb (5)

193-195: Prefer logger.exception for tracebacks.

Captures stack trace automatically and avoids string formatting in logs.

-        logger.error(f"Error in process_folds_with_gpu on GPU {gpu_id}: {str(e)}")
+        logger.exception("Error in process_folds_with_gpu on GPU %s", gpu_id)

288-299: Use spawn context for multiprocessing with CUDA.

Safer with CUDA/Torch and more portable across platforms.

-        with Pool(processes=self.config.system.num_gpus) as pool:
-            gpu_results = pool.map(process_folds_with_gpu, gpu_assignments)
+        from multiprocessing import get_context
+        with get_context("spawn").Pool(processes=self.config.system.num_gpus) as pool:
+            gpu_results = pool.map(process_folds_with_gpu, gpu_assignments)

381-388: Differentiate JSON vs JSONL; validate unknown extensions.

Current code writes line-delimited JSON for .json too; add explicit handling and a guard.

-    extension = input_file.split(".")[-1]
-    if extension in ["json", "jsonl"]:
-        subset_data.to_json(output_file, orient="records", lines=True)
+    extension = input_file.rsplit(".", 1)[-1].lower()
+    if extension == "jsonl":
+        subset_data.to_json(output_file, orient="records", lines=True)
+    elif extension == "json":
+        subset_data.to_json(output_file, orient="records", lines=False)
     elif extension == "csv":
         subset_data.to_csv(output_file, index=False)
     elif extension == "parquet":
         subset_data.to_parquet(output_file)
+    else:
+        raise ValueError(f"Unsupported output extension: {extension}")

If these methods are from Hugging Face Datasets, confirm supported arguments. Based on learnings.


348-352: Avoid monkey-patching DataProcessor from a notebook.

Assigning methods onto the class at runtime is brittle. Prefer implementing these methods in the class/module and importing them, or subclassing for notebook experiments.

I can extract these methods into a proper module and update imports. Want me to open a follow-up PR?

Also applies to: 500-505


606-608: Use logger.exception to capture failures with stacktrace.

Shorter, includes traceback.

-    except Exception as e:
-        logger.error(f"Processing failed: {str(e)}")
+    except Exception:
+        logger.exception("Processing failed")
         raise
kubeflow-pipelines/docling-vlm/requirements.txt (1)

1-4: Consider removing spaces around version specifiers for consistency.

PEP 440 allows spaces, but most projects prefer no spaces: docling>=2.47.1, boto3>=1.39.17.

kubeflow-pipelines/docling-standard/README.md (1)

34-40: Call out Python version to match runtime.

Recommend specifying Python 3.11 to align with base images and avoid local mismatches.

-python3 -m venv venv
+python3.11 -m venv venv  # or ensure the venv uses Python 3.11
kubeflow-pipelines/common/constants.py (1)

2-3: Consider pinning images by digest for reproducibility.

Tags can move. Pin by SHA-256 digest (and keep tag as a comment) to strengthen supply-chain guarantees.

I can fetch current digests and propose updated constants if you want.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 54ba7e6 and a023dba.

📒 Files selected for processing (39)
  • .github/CODEOWNERS (1 hunks)
  • .github/workflows/compile-kfp.yml (1 hunks)
  • .gitignore (1 hunks)
  • .pre-commit-config.yaml (1 hunks)
  • .pylintrc (0 hunks)
  • Containerfile (0 hunks)
  • README.md (1 hunks)
  • docling_convert_components.py (0 hunks)
  • docling_convert_pipeline.py (0 hunks)
  • docling_convert_pipeline_compiled.yaml (0 hunks)
  • kubeflow-pipelines/README.md (1 hunks)
  • kubeflow-pipelines/common/.gitignore (1 hunks)
  • kubeflow-pipelines/common/__init__.py (1 hunks)
  • kubeflow-pipelines/common/components.py (1 hunks)
  • kubeflow-pipelines/common/constants.py (1 hunks)
  • kubeflow-pipelines/docling-standard/.gitignore (1 hunks)
  • kubeflow-pipelines/docling-standard/.pylintrc (1 hunks)
  • kubeflow-pipelines/docling-standard/Containerfile (1 hunks)
  • kubeflow-pipelines/docling-standard/README.md (1 hunks)
  • kubeflow-pipelines/docling-standard/local_run.py (2 hunks)
  • kubeflow-pipelines/docling-standard/requirements.txt (1 hunks)
  • kubeflow-pipelines/docling-standard/standard_components.py (1 hunks)
  • kubeflow-pipelines/docling-standard/standard_convert_pipeline.py (1 hunks)
  • kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (1 hunks)
  • kubeflow-pipelines/docling-vlm/.gitignore (1 hunks)
  • kubeflow-pipelines/docling-vlm/.pylintrc (1 hunks)
  • kubeflow-pipelines/docling-vlm/README.md (1 hunks)
  • kubeflow-pipelines/docling-vlm/local_run.py (1 hunks)
  • kubeflow-pipelines/docling-vlm/requirements.txt (1 hunks)
  • kubeflow-pipelines/docling-vlm/vlm_components.py (1 hunks)
  • kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py (1 hunks)
  • kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (1 hunks)
  • notebooks/.gitignore (1 hunks)
  • notebooks/README.md (1 hunks)
  • notebooks/data_preparation_and_config.ipynb (1 hunks)
  • notebooks/embedding_generation.ipynb (1 hunks)
  • notebooks/subset_selection.ipynb (1 hunks)
  • pyproject.toml (1 hunks)
  • requirements.txt (0 hunks)
💤 Files with no reviewable changes (6)
  • docling_convert_components.py
  • docling_convert_pipeline.py
  • .pylintrc
  • docling_convert_pipeline_compiled.yaml
  • requirements.txt
  • Containerfile
✅ Files skipped from review due to trivial changes (8)
  • kubeflow-pipelines/docling-standard/.pylintrc
  • kubeflow-pipelines/docling-vlm/.gitignore
  • kubeflow-pipelines/common/.gitignore
  • kubeflow-pipelines/docling-standard/.gitignore
  • .github/CODEOWNERS
  • notebooks/.gitignore
  • notebooks/README.md
  • .pre-commit-config.yaml
🧰 Additional context used
🧬 Code graph analysis (5)
kubeflow-pipelines/common/__init__.py (1)
kubeflow-pipelines/common/components.py (3)
  • import_pdfs (12-118)
  • create_pdf_splits (123-141)
  • download_docling_models (146-215)
kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py (2)
kubeflow-pipelines/common/components.py (3)
  • import_pdfs (12-118)
  • create_pdf_splits (123-141)
  • download_docling_models (146-215)
kubeflow-pipelines/docling-vlm/vlm_components.py (1)
  • docling_convert_vlm (14-153)
kubeflow-pipelines/docling-standard/local_run.py (2)
kubeflow-pipelines/common/components.py (3)
  • create_pdf_splits (123-141)
  • download_docling_models (146-215)
  • import_pdfs (12-118)
kubeflow-pipelines/docling-standard/standard_components.py (1)
  • docling_convert_standard (14-186)
kubeflow-pipelines/docling-vlm/local_run.py (2)
kubeflow-pipelines/common/components.py (3)
  • create_pdf_splits (123-141)
  • download_docling_models (146-215)
  • import_pdfs (12-118)
kubeflow-pipelines/docling-vlm/vlm_components.py (1)
  • docling_convert_vlm (14-153)
kubeflow-pipelines/docling-standard/standard_convert_pipeline.py (2)
kubeflow-pipelines/common/components.py (3)
  • import_pdfs (12-118)
  • create_pdf_splits (123-141)
  • download_docling_models (146-215)
kubeflow-pipelines/docling-standard/standard_components.py (1)
  • docling_convert_standard (14-186)
🪛 Pylint (3.3.9)
kubeflow-pipelines/docling-vlm/vlm_components.py

[refactor] 14-14: Too many positional arguments (9/5)

(R0917)


[refactor] 106-111: Consider using '{"model_id": remote_model_name, "parameters": dict(max_new_tokens=400), ... }' instead of a call to 'dict'.

(R1735)


[refactor] 108-110: Consider using '{"max_new_tokens": 400}' instead of a call to 'dict'.

(R1735)


[refactor] 14-14: Too many branches (13/12)

(R0912)


[refactor] 14-14: Too many statements (65/50)

(R0915)

kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py

[refactor] 22-22: Too many positional arguments (8/5)

(R0917)

kubeflow-pipelines/docling-standard/standard_components.py

[refactor] 14-14: Too many positional arguments (17/5)

(R0917)


[refactor] 14-14: Too many statements (62/50)

(R0915)

notebooks/embedding_generation.ipynb

[error] 19-19: Undefined variable 'config'

(E0602)


[error] 20-20: Undefined variable 'dataset'

(E0602)


[error] 20-20: Undefined variable 'dataset'

(E0602)


[error] 21-21: Undefined variable 'processor'

(E0602)


[error] 24-24: Undefined variable 'torch'

(E0602)


[error] 25-25: Undefined variable 'torch'

(E0602)


[error] 26-26: Undefined variable 'torch'

(E0602)


[error] 27-27: Undefined variable 'torch'

(E0602)


[error] 28-28: Undefined variable 'torch'

(E0602)


[error] 29-29: Undefined variable 'torch'

(E0602)


[error] 34-34: Undefined variable 'TypedDict'

(E0602)


[refactor] 34-34: Too few public methods (0/2)

(R0903)


[refactor] 43-43: Too many instance attributes (8/7)

(R0902)


[error] 42-42: Undefined variable 'dataclass'

(E0602)


[error] 47-47: Undefined variable 'torch'

(E0602)


[refactor] 43-43: Too few public methods (0/2)

(R0903)


[error] 56-56: Undefined variable 'Dict'

(E0602)


[refactor] 73-73: Too many positional arguments (6/5)

(R0917)


[error] 76-76: Undefined variable 'torch'

(E0602)


[error] 88-88: Undefined variable 'torch'

(E0602)


[error] 89-89: Undefined variable 'torch'

(E0602)


[error] 111-111: Undefined variable 'os'

(E0602)


[error] 112-112: Undefined variable 'os'

(E0602)


[error] 118-118: Undefined variable 'logger'

(E0602)


[error] 129-129: Undefined variable 'os'

(E0602)


[error] 147-147: Undefined variable 'logger'

(E0602)


[error] 153-153: Undefined variable 'Union'

(E0602)


[error] 153-153: Undefined variable 'List'

(E0602)


[error] 154-154: Undefined variable 'List'

(E0602)


[error] 182-182: Undefined variable 'torch'

(E0602)


[error] 185-185: Undefined variable 'Union'

(E0602)


[error] 185-185: Undefined variable 'List'

(E0602)


[error] 189-189: Undefined variable 'Union'

(E0602)


[error] 189-189: Undefined variable 'torch'

(E0602)


[error] 189-189: Undefined variable 'np'

(E0602)


[error] 203-203: Undefined variable 'tqdm'

(E0602)


[error] 214-214: Undefined variable 'torch'

(E0602)


[refactor] 68-68: Too few public methods (1/2)

(R0903)


[refactor] 245-245: Too many positional arguments (7/5)

(R0917)


[error] 246-246: Undefined variable 'torch'

(E0602)


[error] 247-247: Undefined variable 'torch'

(E0602)


[error] 250-250: Undefined variable 'Union'

(E0602)


[error] 250-250: Undefined variable 'torch'

(E0602)


[error] 253-253: Undefined variable 'torch'

(E0602)


[error] 261-261: Undefined variable 'torch'

(E0602)


[error] 261-261: Undefined variable 'torch'

(E0602)


[error] 268-268: Undefined variable 'torch'

(E0602)


[error] 276-276: Undefined variable 'torch'

(E0602)


[error] 276-276: Undefined variable 'torch'

(E0602)


[error] 276-276: Undefined variable 'torch'

(E0602)


[error] 278-278: Undefined variable 'torch'

(E0602)


[error] 280-280: Undefined variable 'torch'

(E0602)


[error] 284-284: Undefined variable 'torch'

(E0602)


[error] 286-286: Undefined variable 'torch'

(E0602)


[error] 287-287: Undefined variable 'torch'

(E0602)


[error] 288-288: Undefined variable 'torch'

(E0602)


[error] 337-337: Undefined variable 'torch'

(E0602)


[error] 338-338: Undefined variable 'torch'

(E0602)


[error] 343-343: Undefined variable 'logger'

(E0602)


[error] 348-348: Undefined variable 'torch'

(E0602)


[error] 353-353: Undefined variable 'Environment'

(E0602)


[error] 353-353: Undefined variable 'BaseLoader'

(E0602)


[error] 357-357: Undefined variable 'os'

(E0602)


[error] 358-358: Undefined variable 'os'

(E0602)


[error] 365-365: Undefined variable 'tqdm'

(E0602)


[error] 386-386: Undefined variable 'torch'

(E0602)


[error] 400-400: Undefined variable 'torch'

(E0602)


[error] 401-401: Undefined variable 'torch'

(E0602)


[error] 407-407: Undefined variable 'logger'

(E0602)


[error] 410-410: Undefined variable 'np'

(E0602)


[error] 413-413: Undefined variable 'os'

(E0602)


[error] 417-417: Undefined variable 'logger'

(E0602)


[error] 421-421: Undefined variable 'logger'

(E0602)


[error] 431-431: Undefined variable 'logger'

(E0602)


[error] 461-461: Undefined variable 'os'

(E0602)


[error] 463-463: Undefined variable 'os'

(E0602)


[error] 464-464: Undefined variable 'os'

(E0602)


[error] 465-465: Undefined variable 'os'

(E0602)


[error] 467-467: Undefined variable 'logger'

(E0602)


[error] 476-476: Undefined variable 'retry_on_exception'

(E0602)


[error] 491-491: Undefined variable 'os'

(E0602)


[error] 492-492: Undefined variable 'os'

(E0602)


[error] 495-495: Undefined variable 'os'

(E0602)


[error] 496-496: Undefined variable 'logger'

(E0602)


[error] 502-502: Undefined variable 'torch'

(E0602)


[error] 502-502: Undefined variable 'torch'

(E0602)


[error] 504-504: Undefined variable 'logger'

(E0602)


[error] 540-540: Undefined variable 'logger'

(E0602)


[error] 546-546: Undefined variable 'logger'

(E0602)


[error] 563-563: Undefined variable 'DataProcessor'

(E0602)


[error] 571-571: Undefined variable 'dataset'

(E0602)


[error] 576-576: Undefined variable 'os'

(E0602)


[error] 576-576: Undefined variable 'config'

(E0602)


[error] 578-578: Undefined variable 'dataset'

(E0602)


[error] 580-580: Undefined variable 'config'

(E0602)


[error] 581-581: Undefined variable 'config'

(E0602)


[error] 585-585: Undefined variable 'time'

(E0602)


[error] 588-588: Undefined variable 'processor'

(E0602)


[error] 588-588: Undefined variable 'dataset'

(E0602)


[error] 590-590: Undefined variable 'time'

(E0602)


[error] 594-594: Undefined variable 'dataset'

(E0602)


[error] 598-598: Undefined variable 'os'

(E0602)

notebooks/subset_selection.ipynb

[error] 13-13: Undefined variable 'config'

(E0602)


[error] 14-14: Undefined variable 'dataset'

(E0602)


[error] 14-14: Undefined variable 'dataset'

(E0602)


[error] 15-15: Undefined variable 'processor'

(E0602)


[error] 16-16: Undefined variable 'processor'

(E0602)


[error] 38-38: Undefined variable 'torch'

(E0602)


[error] 39-39: Undefined variable 'torch'

(E0602)


[error] 44-44: Undefined variable 'logger'

(E0602)


[error] 52-52: Undefined variable 'logger'

(E0602)


[error] 56-56: Undefined variable 'logger'

(E0602)


[error] 57-57: Undefined variable 'compute_pairwise_dense'

(E0602)


[error] 89-89: Undefined variable 'logger'

(E0602)


[error] 112-112: Undefined variable 'logger'

(E0602)


[error] 121-121: Undefined variable 'gc'

(E0602)


[error] 122-122: Undefined variable 'torch'

(E0602)


[error] 123-123: Undefined variable 'torch'

(E0602)


[error] 127-127: Undefined variable 'logger'

(E0602)


[refactor] 23-23: Too many branches (13/12)

(R0912)


[error] 135-135: Undefined variable 'torch'

(E0602)


[error] 135-135: Undefined variable 'Dict'

(E0602)


[error] 135-135: Undefined variable 'Union'

(E0602)


[error] 135-135: Undefined variable 'List'

(E0602)


[error] 139-139: Undefined variable 'np'

(E0602)


[error] 140-140: Undefined variable 'np'

(E0602)


[error] 180-180: Undefined variable 'logger'

(E0602)


[error] 186-186: Undefined variable 'logger'

(E0602)


[error] 187-187: Undefined variable 'Pool'

(E0602)


[error] 195-195: Undefined variable 'List'

(E0602)


[error] 196-196: Undefined variable 'List'

(E0602)


[error] 198-198: Undefined variable 'Dict'

(E0602)


[error] 198-198: Undefined variable 'Union'

(E0602)


[error] 212-212: Undefined variable 'logger'

(E0602)


[error] 226-226: Undefined variable 'os'

(E0602)


[error] 231-231: Undefined variable 'np'

(E0602)


[error] 232-232: Undefined variable 'logger'

(E0602)


[error] 239-239: Undefined variable 'DataProcessor'

(E0602)


[error] 262-262: Undefined variable 'logger'

(E0602)


[error] 292-292: Undefined variable 'os'

(E0602)


[error] 293-293: Undefined variable 'os'

(E0602)


[error] 295-295: Undefined variable 'logger'

(E0602)


[error] 297-297: Undefined variable 'os'

(E0602)


[error] 300-300: Undefined variable 'logger'

(E0602)


[error] 301-301: Undefined variable 'h5py'

(E0602)


[error] 304-304: Undefined variable 'logger'

(E0602)


[error] 308-308: Undefined variable 'torch'

(E0602)


[error] 308-308: Undefined variable 'torch'

(E0602)


[error] 310-310: Undefined variable 'logger'

(E0602)


[error] 313-313: Undefined variable 'logger'

(E0602)


[error] 319-319: Undefined variable 'os'

(E0602)


[error] 325-325: Undefined variable 'logger'

(E0602)


[error] 331-331: Undefined variable 'gc'

(E0602)


[error] 332-332: Undefined variable 'torch'

(E0602)


[error] 335-335: Undefined variable 'logger'

(E0602)


[error] 339-339: Undefined variable 'List'

(E0602)


[error] 350-350: Undefined variable 'logger'

(E0602)


[error] 360-360: Undefined variable 'logger'

(E0602)


[error] 364-364: Undefined variable 'logger'

(E0602)


[error] 370-370: Undefined variable 'logger'

(E0602)


[error] 375-375: Undefined variable 'DataProcessor'

(E0602)


[error] 376-376: Undefined variable 'DataProcessor'

(E0602)


[error] 377-377: Undefined variable 'DataProcessor'

(E0602)


[error] 384-384: Undefined variable 'List'

(E0602)


[error] 385-385: Undefined variable 'List'

(E0602)


[error] 385-385: Undefined variable 'Union'

(E0602)


[error] 387-387: Undefined variable 'Any'

(E0602)


[error] 402-402: Undefined variable 'get_default_num_gpus'

(E0602)


[error] 405-405: Undefined variable 'BasicConfig'

(E0602)


[error] 406-406: Undefined variable 'EncoderConfig'

(E0602)


[error] 407-407: Undefined variable 'TemplateConfig'

(E0602)


[error] 408-408: Undefined variable 'SystemConfig'

(E0602)


[error] 423-423: Undefined variable 'logger'

(E0602)


[error] 430-430: Undefined variable 'ProcessingConfig'

(E0602)


[error] 440-440: Undefined variable 'logger'

(E0602)


[error] 441-441: Undefined variable 'DataProcessor'

(E0602)


[error] 445-445: Undefined variable 'logger'

(E0602)


[error] 450-450: Undefined variable 'gc'

(E0602)


[error] 451-451: Undefined variable 'torch'

(E0602)


[error] 452-452: Undefined variable 'torch'

(E0602)

kubeflow-pipelines/docling-standard/standard_convert_pipeline.py

[refactor] 22-22: Too many positional arguments (17/5)

(R0917)

kubeflow-pipelines/common/components.py

[refactor] 12-12: Too many branches (21/12)

(R0912)


[refactor] 12-12: Too many statements (69/50)

(R0915)

🪛 Ruff (0.13.3)
kubeflow-pipelines/common/__init__.py

20-26: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

kubeflow-pipelines/docling-vlm/vlm_components.py

23-23: Possible hardcoded password assigned to function default: "remote_model_secret_mount_path"

(S107)


42-42: Unused noqa directive (non-enabled: PLC0415, E402)

Remove unused noqa directive

(RUF100)


43-43: Unused noqa directive (non-enabled: PLC0415, E402)

Remove unused noqa directive

(RUF100)


44-44: Unused noqa directive (non-enabled: PLC0415, E402)

Remove unused noqa directive

(RUF100)


48-48: Unused noqa directive (non-enabled: PLC0415, E402)

Remove unused noqa directive

(RUF100)


49-49: Unused noqa directive (non-enabled: PLC0415, E402)

Remove unused noqa directive

(RUF100)


50-50: Unused noqa directive (non-enabled: PLC0415, E402)

Remove unused noqa directive

(RUF100)


51-51: Unused noqa directive (non-enabled: PLC0415, E402)

Remove unused noqa directive

(RUF100)


62-62: Avoid specifying long messages outside the exception class

(TRY003)


66-68: Avoid specifying long messages outside the exception class

(TRY003)


72-72: Avoid specifying long messages outside the exception class

(TRY003)


74-74: Possible hardcoded password assigned to: "remote_model_endpoint_url_secret"

(S105)


80-80: Avoid specifying long messages outside the exception class

(TRY003)


82-82: Possible hardcoded password assigned to: "remote_model_name_secret"

(S105)


88-88: Avoid specifying long messages outside the exception class

(TRY003)


90-90: Possible hardcoded password assigned to: "remote_model_api_key_secret"

(S105)


96-96: Avoid specifying long messages outside the exception class

(TRY003)


99-99: Avoid specifying long messages outside the exception class

(TRY003)

kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py

36-36: Possible hardcoded password assigned to: "s3_secret_mount_path"

(S105)


45-45: Possible hardcoded password assigned to argument: "secret_name"

(S106)


61-61: Possible hardcoded password assigned to: "remote_model_secret_mount_path"

(S105)


78-78: Possible hardcoded password assigned to argument: "secret_name"

(S106)

notebooks/data_preparation_and_config.ipynb

51-53: Avoid specifying long messages outside the exception class

(TRY003)


69-69: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


69-69: Use explicit conversion flag

Replace with conversion flag

(RUF010)


72-72: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


72-72: Use explicit conversion flag

Replace with conversion flag

(RUF010)


75-75: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


75-75: Use explicit conversion flag

Replace with conversion flag

(RUF010)


78-78: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


78-78: Use explicit conversion flag

Replace with conversion flag

(RUF010)


81-81: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


81-81: Use explicit conversion flag

Replace with conversion flag

(RUF010)


123-123: Consider moving this statement to an else block

(TRY300)


125-125: Do not catch blind exception: Exception

(BLE001)


155-155: Avoid specifying long messages outside the exception class

(TRY003)


247-247: Prefer TypeError exception for invalid type

(TRY004)


247-247: Avoid specifying long messages outside the exception class

(TRY003)


251-251: Prefer TypeError exception for invalid type

(TRY004)


251-251: Avoid specifying long messages outside the exception class

(TRY003)


253-255: Avoid specifying long messages outside the exception class

(TRY003)


257-257: Avoid specifying long messages outside the exception class

(TRY003)


281-281: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.

(S701)


306-306: Avoid specifying long messages outside the exception class

(TRY003)


335-337: Avoid specifying long messages outside the exception class

(TRY003)


354-356: Avoid specifying long messages outside the exception class

(TRY003)


461-461: f-string without any placeholders

Remove extraneous f prefix

(F541)


468-468: Do not catch blind exception: Exception

(BLE001)


489-489: f-string without any placeholders

Remove extraneous f prefix

(F541)


525-525: f-string without any placeholders

Remove extraneous f prefix

(F541)


531-531: f-string without any placeholders

Remove extraneous f prefix

(F541)


535-535: f-string without any placeholders

Remove extraneous f prefix

(F541)


539-539: f-string without any placeholders

Remove extraneous f prefix

(F541)


540-540: f-string without any placeholders

Remove extraneous f prefix

(F541)


543-543: Do not catch blind exception: Exception

(BLE001)

kubeflow-pipelines/docling-standard/standard_components.py

77-77: Avoid specifying long messages outside the exception class

(TRY003)


81-83: Avoid specifying long messages outside the exception class

(TRY003)


87-89: Avoid specifying long messages outside the exception class

(TRY003)


93-95: Avoid specifying long messages outside the exception class

(TRY003)


100-102: Avoid specifying long messages outside the exception class

(TRY003)

notebooks/embedding_generation.ipynb

16-16: f-string without any placeholders

Remove extraneous f prefix

(F541)


17-17: f-string without any placeholders

Remove extraneous f prefix

(F541)


18-18: f-string without any placeholders

Remove extraneous f prefix

(F541)


19-19: Undefined name config

(F821)


20-20: Undefined name dataset

(F821)


20-20: Undefined name dataset

(F821)


21-21: Undefined name processor

(F821)


24-24: Undefined name torch

(F821)


25-25: Undefined name torch

(F821)


26-26: Undefined name torch

(F821)


27-27: Undefined name torch

(F821)


28-28: Undefined name torch

(F821)


29-29: Undefined name torch

(F821)


35-35: Undefined name TypedDict

(F821)


43-43: Undefined name dataclass

(F821)


48-48: Undefined name torch

(F821)


57-57: Undefined name Dict

(F821)


78-78: Undefined name torch

(F821)


85-87: Avoid specifying long messages outside the exception class

(TRY003)


90-90: Undefined name torch

(F821)


91-91: Undefined name torch

(F821)


113-113: Undefined name os

(F821)


114-114: Undefined name os

(F821)


120-120: Undefined name logger

(F821)


131-131: Undefined name os

(F821)


132-135: Avoid specifying long messages outside the exception class

(TRY003)


149-149: Undefined name logger

(F821)


155-155: Undefined name Union

(F821)


155-155: Undefined name List

(F821)


156-156: Undefined name List

(F821)


163-166: Avoid specifying long messages outside the exception class

(TRY003)


176-179: Avoid specifying long messages outside the exception class

(TRY003)


184-184: Undefined name torch

(F821)


187-187: Undefined name Union

(F821)


187-187: Undefined name List

(F821)


191-191: Undefined name Union

(F821)


191-191: Undefined name torch

(F821)


191-191: Undefined name np

(F821)


205-205: Undefined name tqdm

(F821)


216-216: Undefined name torch

(F821)


234-237: Abstract raise to an inner function

(TRY301)


234-237: Avoid specifying long messages outside the exception class

(TRY003)


240-240: Avoid specifying long messages outside the exception class

(TRY003)


240-240: Use explicit conversion flag

Replace with conversion flag

(RUF010)


249-249: Undefined name torch

(F821)


250-250: Undefined name torch

(F821)


253-253: Undefined name Union

(F821)


253-253: Undefined name torch

(F821)


256-256: Undefined name torch

(F821)


264-264: Undefined name torch

(F821)


264-264: Undefined name torch

(F821)


271-271: Undefined name torch

(F821)


279-279: Undefined name torch

(F821)


279-279: Undefined name torch

(F821)


279-279: Undefined name torch

(F821)


281-281: Undefined name torch

(F821)


283-283: Undefined name torch

(F821)


287-287: Undefined name torch

(F821)


289-289: Undefined name torch

(F821)


290-290: Undefined name torch

(F821)


291-291: Undefined name torch

(F821)


293-293: Avoid specifying long messages outside the exception class

(TRY003)


341-341: Undefined name torch

(F821)


342-342: Undefined name torch

(F821)


347-347: Undefined name logger

(F821)


352-352: Undefined name torch

(F821)


357-357: Undefined name Environment

(F821)


357-357: Undefined name BaseLoader

(F821)


361-361: Undefined name os

(F821)


362-362: Undefined name os

(F821)


369-369: Undefined name tqdm

(F821)


382-382: Abstract raise to an inner function

(TRY301)


382-382: Avoid specifying long messages outside the exception class

(TRY003)


390-390: Undefined name torch

(F821)


404-404: Undefined name torch

(F821)


405-405: Undefined name torch

(F821)


411-411: Undefined name logger

(F821)


414-414: Undefined name np

(F821)


417-417: Undefined name os

(F821)


421-421: Undefined name logger

(F821)


422-422: Consider moving this statement to an else block

(TRY300)


425-425: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


425-425: Undefined name logger

(F821)


425-425: Use explicit conversion flag

Replace with conversion flag

(RUF010)


435-435: Undefined name logger

(F821)


465-465: Undefined name os

(F821)


467-467: Undefined name os

(F821)


468-468: Undefined name os

(F821)


469-469: Undefined name os

(F821)


471-471: Undefined name logger

(F821)


481-481: Undefined name retry_on_exception

(F821)


496-496: Undefined name os

(F821)


497-497: Undefined name os

(F821)


500-500: Undefined name os

(F821)


501-501: Undefined name logger

(F821)


507-507: Undefined name torch

(F821)


507-507: Undefined name torch

(F821)


509-509: Undefined name logger

(F821)


545-545: Undefined name logger

(F821)


551-551: Undefined name logger

(F821)


559-559: Avoid specifying long messages outside the exception class

(TRY003)


568-568: Undefined name DataProcessor

(F821)


575-575: Undefined name dataset

(F821)


580-580: Undefined name os

(F821)


580-580: Undefined name config

(F821)


582-582: Undefined name dataset

(F821)


584-584: Undefined name config

(F821)


585-585: Undefined name config

(F821)


588-588: f-string without any placeholders

Remove extraneous f prefix

(F541)


589-589: Undefined name time

(F821)


592-592: Undefined name processor

(F821)


592-592: Undefined name dataset

(F821)


594-594: Undefined name time

(F821)


596-596: f-string without any placeholders

Remove extraneous f prefix

(F541)


598-598: Undefined name dataset

(F821)


602-602: Undefined name os

(F821)


605-605: Do not catch blind exception: Exception

(BLE001)

notebooks/subset_selection.ipynb

13-13: Undefined name config

(F821)


14-14: Undefined name dataset

(F821)


14-14: Undefined name dataset

(F821)


15-15: Undefined name processor

(F821)


16-16: Undefined name processor

(F821)


17-17: f-string without any placeholders

Remove extraneous f prefix

(F541)


18-18: f-string without any placeholders

Remove extraneous f prefix

(F541)


19-19: f-string without any placeholders

Remove extraneous f prefix

(F541)


39-39: Undefined name torch

(F821)


40-40: Undefined name torch

(F821)


44-44: Abstract raise to an inner function

(TRY301)


44-44: Avoid specifying long messages outside the exception class

(TRY003)


45-45: Undefined name logger

(F821)


53-53: Undefined name logger

(F821)


57-57: Undefined name logger

(F821)


58-58: Undefined name compute_pairwise_dense

(F821)


90-90: Undefined name logger

(F821)


113-115: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


113-113: Undefined name logger

(F821)


114-114: Use explicit conversion flag

Replace with conversion flag

(RUF010)


122-122: Undefined name gc

(F821)


123-123: Undefined name torch

(F821)


124-124: Undefined name torch

(F821)


126-126: Consider moving this statement to an else block

(TRY300)


128-128: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


128-128: Undefined name logger

(F821)


128-128: Use explicit conversion flag

Replace with conversion flag

(RUF010)


137-137: Undefined name torch

(F821)


137-137: Undefined name Dict

(F821)


137-137: Undefined name Union

(F821)


137-137: Undefined name List

(F821)


141-141: Undefined name np

(F821)


142-142: Undefined name np

(F821)


182-182: Undefined name logger

(F821)


188-188: Undefined name logger

(F821)


189-189: Undefined name Pool

(F821)


197-197: Undefined name List

(F821)


198-198: Undefined name List

(F821)


200-200: Undefined name Dict

(F821)


200-200: Undefined name Union

(F821)


204-204: Loop control variable fold_idx not used within loop body

Rename unused fold_idx to _fold_idx

(B007)


214-214: Undefined name logger

(F821)


228-228: Undefined name os

(F821)


233-233: Undefined name np

(F821)


234-234: Undefined name logger

(F821)


241-241: Undefined name DataProcessor

(F821)


248-248: Unused function argument: self

(ARG001)


265-265: Undefined name logger

(F821)


295-295: Undefined name os

(F821)


296-296: Undefined name os

(F821)


298-298: Undefined name logger

(F821)


300-300: Undefined name os

(F821)


303-303: Undefined name logger

(F821)


304-304: Undefined name h5py

(F821)


307-307: Undefined name logger

(F821)


311-311: Undefined name torch

(F821)


311-311: Undefined name torch

(F821)


313-313: Undefined name logger

(F821)


316-316: Undefined name logger

(F821)


322-322: Undefined name os

(F821)


328-328: Undefined name logger

(F821)


334-334: Undefined name gc

(F821)


335-335: Undefined name torch

(F821)


338-338: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


338-338: Undefined name logger

(F821)


338-338: Use explicit conversion flag

Replace with conversion flag

(RUF010)


342-342: Undefined name List

(F821)


353-353: Undefined name logger

(F821)


363-363: Undefined name logger

(F821)


367-367: Undefined name logger

(F821)


373-373: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


373-373: Undefined name logger

(F821)


373-373: Use explicit conversion flag

Replace with conversion flag

(RUF010)


378-378: Undefined name DataProcessor

(F821)


379-379: Undefined name DataProcessor

(F821)


380-380: Undefined name DataProcessor

(F821)


388-388: Undefined name List

(F821)


389-389: Undefined name List

(F821)


389-389: Undefined name Union

(F821)


391-391: Undefined name Any

(F821)


406-406: Undefined name get_default_num_gpus

(F821)


409-409: Undefined name BasicConfig

(F821)


410-410: Undefined name EncoderConfig

(F821)


411-411: Undefined name TemplateConfig

(F821)


412-412: Undefined name SystemConfig

(F821)


427-427: Undefined name logger

(F821)


434-434: Undefined name ProcessingConfig

(F821)


444-444: Undefined name logger

(F821)


445-445: Undefined name DataProcessor

(F821)


449-449: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


449-449: Undefined name logger

(F821)


449-449: Use explicit conversion flag

Replace with conversion flag

(RUF010)


454-454: Undefined name gc

(F821)


455-455: Undefined name torch

(F821)


456-456: Undefined name torch

(F821)

kubeflow-pipelines/docling-standard/standard_convert_pipeline.py

45-45: Possible hardcoded password assigned to: "s3_secret_mount_path"

(S105)


54-54: Possible hardcoded password assigned to argument: "secret_name"

(S106)

kubeflow-pipelines/common/components.py

17-17: Possible hardcoded password assigned to function default: "s3_secret_mount_path"

(S107)


36-36: Avoid specifying long messages outside the exception class

(TRY003)


43-43: Avoid specifying long messages outside the exception class

(TRY003)


45-45: Possible hardcoded password assigned to: "s3_endpoint_url_secret"

(S105)


51-51: Avoid specifying long messages outside the exception class

(TRY003)


53-53: Possible hardcoded password assigned to: "s3_access_key_secret"

(S105)


59-59: Avoid specifying long messages outside the exception class

(TRY003)


61-61: Possible hardcoded password assigned to: "s3_secret_key_secret"

(S105)


67-67: Avoid specifying long messages outside the exception class

(TRY003)


69-69: Possible hardcoded password assigned to: "s3_bucket_secret"

(S105)


75-75: Avoid specifying long messages outside the exception class

(TRY003)


77-77: Possible hardcoded password assigned to: "s3_prefix_secret"

(S105)


83-83: Avoid specifying long messages outside the exception class

(TRY003)


86-86: Avoid specifying long messages outside the exception class

(TRY003)


89-89: Avoid specifying long messages outside the exception class

(TRY003)


105-105: Avoid specifying long messages outside the exception class

(TRY003)


215-215: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
  • GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
🔇 Additional comments (3)
pyproject.toml (1)

2-3: Confirm the minimum Python version aligns with Ruff’s target.

Setting target-version = "py312" will let Ruff accept Python 3.12-only syntax. Please double-check that the project's supported/runtime Python version is actually 3.12 or higher so we don’t silently allow newer features in code that must run on older interpreters.

notebooks/subset_selection.ipynb (2)

563-565: Ensure get_default_num_gpus is available; provide fallback.

This symbol isn’t imported here; if unavailable, fallback to torch.cuda.device_count().

Consider:

try:
    available_gpus = get_default_num_gpus(testing_mode=testing_mode)
except NameError:
    available_gpus = torch.cuda.device_count()

245-352: All references verified as present. DataProcessor, its helper methods (calculate_subset_size, get_subset_name, load_and_combine_datasets, validate_epsilon_for_dataset_size, get_dataset_name), compute_pairwise_dense, and all config dataclasses are defined in the upstream notebooks.

Comment on lines 42 to 75
- name: Set up Python 3.12
uses: actions/setup-python@v5
with:
python-version: "3.12"

- name: Install requirements for ${{ matrix.name }}
working-directory: ${{ matrix.dir }}
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt

- name: Compile and compare (${{ matrix.name }})
run: |
set -ux
BASE="${{ matrix.dir }}/${{ matrix.out }}"
SNAP="${BASE}.baseline"

# Snapshot committed YAML in-place using mv, or create an empty baseline
if [ -f "$BASE" ]; then
mv "$BASE" "$SNAP"
else
: > "$SNAP"
fi

# Compile (regenerates $BASE)
( cd "${{ matrix.dir }}" && ${{ matrix.cmd }} )

# Ensure compiled YAML exists and is non-empty
test -s "$BASE"

echo "Comparing compiled pipeline YAML to the committed version..."
echo "If this fails, you changed pipeline code but did not regenerate/commit the compiled YAML."
echo "To fix locally: (cd ${{ matrix.dir }} && ${{ matrix.cmd }}) and commit ${{ matrix.out }}."
diff "$BASE" "$SNAP"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Align workflow Python version with dependency support.

tesserocr does not publish CPython 3.12 wheels; on ubuntu‑latest the pip install step will try to build from source, fail to find the Tesseract/Leptonica headers, and the workflow stops. Drop to Python 3.11 (which has prebuilt wheels and matches the runtime image) or install the native build deps before pip install.

-      - name: Set up Python 3.12
+      - name: Set up Python 3.11
         uses: actions/setup-python@v5
         with:
-          python-version: "3.12"
+          python-version: "3.11"

Comment on lines 48 to 83
with open(s3_endpoint_url_file_path) as f:
s3_endpoint_url = f.read()
else:
raise ValueError(f"Key {s3_endpoint_url_secret} not defined in secret {s3_secret_mount_path}")

s3_access_key_secret = "S3_ACCESS_KEY"
s3_access_key_file_path = os.path.join(s3_secret_mount_path, s3_access_key_secret)
if os.path.isfile(s3_access_key_file_path):
with open(s3_access_key_file_path) as f:
s3_access_key = f.read()
else:
raise ValueError(f"Key {s3_access_key_secret} not defined in secret {s3_secret_mount_path}")

s3_secret_key_secret = "S3_SECRET_KEY"
s3_secret_key_file_path = os.path.join(s3_secret_mount_path, s3_secret_key_secret)
if os.path.isfile(s3_secret_key_file_path):
with open(s3_secret_key_file_path) as f:
s3_secret_key = f.read()
else:
raise ValueError(f"Key {s3_secret_key_secret} not defined in secret {s3_secret_mount_path}")

s3_bucket_secret = "S3_BUCKET"
s3_bucket_file_path = os.path.join(s3_secret_mount_path, s3_bucket_secret)
if os.path.isfile(s3_bucket_file_path):
with open(s3_bucket_file_path) as f:
s3_bucket = f.read()
else:
raise ValueError(f"Key {s3_bucket_secret} not defined in secret {s3_secret_mount_path}")

s3_prefix_secret = "S3_PREFIX"
s3_prefix_file_path = os.path.join(s3_secret_mount_path, s3_prefix_secret)
if os.path.isfile(s3_prefix_file_path):
with open(s3_prefix_file_path) as f:
s3_prefix = f.read()
else:
raise ValueError(f"Key {s3_prefix_secret} not defined in secret {s3_secret_mount_path}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Strip whitespace when reading S3 secret values

Kubernetes-mounted secret files usually end with a newline. Passing those raw strings into boto3.client or composing S3 keys adds \n, yielding malformed endpoint URLs (https://…\n) and object keys (prefix\n/…), so downloads fail. Trim the values as you read them.

Apply this diff to normalize the secrets:

-            with open(s3_endpoint_url_file_path) as f:
-                s3_endpoint_url = f.read()
+            with open(s3_endpoint_url_file_path) as f:
+                s3_endpoint_url = f.read().strip()
...
-            with open(s3_access_key_file_path) as f:
-                s3_access_key = f.read()
+            with open(s3_access_key_file_path) as f:
+                s3_access_key = f.read().strip()
...
-            with open(s3_secret_key_file_path) as f:
-                s3_secret_key = f.read()
+            with open(s3_secret_key_file_path) as f:
+                s3_secret_key = f.read().strip()
...
-            with open(s3_bucket_file_path) as f:
-                s3_bucket = f.read()
+            with open(s3_bucket_file_path) as f:
+                s3_bucket = f.read().strip()
...
-            with open(s3_prefix_file_path) as f:
-                s3_prefix = f.read()
+            with open(s3_prefix_file_path) as f:
+                s3_prefix = f.read().strip()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
with open(s3_endpoint_url_file_path) as f:
s3_endpoint_url = f.read()
else:
raise ValueError(f"Key {s3_endpoint_url_secret} not defined in secret {s3_secret_mount_path}")
s3_access_key_secret = "S3_ACCESS_KEY"
s3_access_key_file_path = os.path.join(s3_secret_mount_path, s3_access_key_secret)
if os.path.isfile(s3_access_key_file_path):
with open(s3_access_key_file_path) as f:
s3_access_key = f.read()
else:
raise ValueError(f"Key {s3_access_key_secret} not defined in secret {s3_secret_mount_path}")
s3_secret_key_secret = "S3_SECRET_KEY"
s3_secret_key_file_path = os.path.join(s3_secret_mount_path, s3_secret_key_secret)
if os.path.isfile(s3_secret_key_file_path):
with open(s3_secret_key_file_path) as f:
s3_secret_key = f.read()
else:
raise ValueError(f"Key {s3_secret_key_secret} not defined in secret {s3_secret_mount_path}")
s3_bucket_secret = "S3_BUCKET"
s3_bucket_file_path = os.path.join(s3_secret_mount_path, s3_bucket_secret)
if os.path.isfile(s3_bucket_file_path):
with open(s3_bucket_file_path) as f:
s3_bucket = f.read()
else:
raise ValueError(f"Key {s3_bucket_secret} not defined in secret {s3_secret_mount_path}")
s3_prefix_secret = "S3_PREFIX"
s3_prefix_file_path = os.path.join(s3_secret_mount_path, s3_prefix_secret)
if os.path.isfile(s3_prefix_file_path):
with open(s3_prefix_file_path) as f:
s3_prefix = f.read()
else:
raise ValueError(f"Key {s3_prefix_secret} not defined in secret {s3_secret_mount_path}")
with open(s3_endpoint_url_file_path) as f:
s3_endpoint_url = f.read().strip()
else:
raise ValueError(f"Key {s3_endpoint_url_secret} not defined in secret {s3_secret_mount_path}")
s3_access_key_secret = "S3_ACCESS_KEY"
s3_access_key_file_path = os.path.join(s3_secret_mount_path, s3_access_key_secret)
if os.path.isfile(s3_access_key_file_path):
with open(s3_access_key_file_path) as f:
s3_access_key = f.read().strip()
else:
raise ValueError(f"Key {s3_access_key_secret} not defined in secret {s3_secret_mount_path}")
s3_secret_key_secret = "S3_SECRET_KEY"
s3_secret_key_file_path = os.path.join(s3_secret_mount_path, s3_secret_key_secret)
if os.path.isfile(s3_secret_key_file_path):
with open(s3_secret_key_file_path) as f:
s3_secret_key = f.read().strip()
else:
raise ValueError(f"Key {s3_secret_key_secret} not defined in secret {s3_secret_mount_path}")
s3_bucket_secret = "S3_BUCKET"
s3_bucket_file_path = os.path.join(s3_secret_mount_path, s3_bucket_secret)
if os.path.isfile(s3_bucket_file_path):
with open(s3_bucket_file_path) as f:
s3_bucket = f.read().strip()
else:
raise ValueError(f"Key {s3_bucket_secret} not defined in secret {s3_secret_mount_path}")
s3_prefix_secret = "S3_PREFIX"
s3_prefix_file_path = os.path.join(s3_secret_mount_path, s3_prefix_secret)
if os.path.isfile(s3_prefix_file_path):
with open(s3_prefix_file_path) as f:
s3_prefix = f.read().strip()
else:
raise ValueError(f"Key {s3_prefix_secret} not defined in secret {s3_secret_mount_path}")
🧰 Tools
🪛 Ruff (0.13.3)

51-51: Avoid specifying long messages outside the exception class

(TRY003)


53-53: Possible hardcoded password assigned to: "s3_access_key_secret"

(S105)


59-59: Avoid specifying long messages outside the exception class

(TRY003)


61-61: Possible hardcoded password assigned to: "s3_secret_key_secret"

(S105)


67-67: Avoid specifying long messages outside the exception class

(TRY003)


69-69: Possible hardcoded password assigned to: "s3_bucket_secret"

(S105)


75-75: Avoid specifying long messages outside the exception class

(TRY003)


77-77: Possible hardcoded password assigned to: "s3_prefix_secret"

(S105)


83-83: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In kubeflow-pipelines/common/components.py around lines 48 to 83, the code reads
Kubernetes secret files but doesn't trim trailing newlines or whitespace, which
results in malformed S3 endpoint URLs and keys; update each place that does
f.read() (s3_endpoint_url, s3_access_key, s3_secret_key, s3_bucket, s3_prefix)
to call .strip() on the string returned from the file read so the values are
normalized before being passed to boto3 or composed into S3 paths.

Comment on lines 19 to 49
USER 1001

WORKDIR /opt/app-root/src

ENV OMP_NUM_THREADS=4 \
PYTHONIOENCODING=UTF-8 \
LC_ALL=en_US.UTF-8 \
LANG=en_US.UTF-8 \
PIP_NO_CACHE_DIR=1 \
HOME=/opt/app-root/src \
PATH=/opt/app-root/src:$PATH \
DOCLING_ARTIFACTS_PATH=/opt/app-root/src/.cache/docling/models

# This will install torch with *only* cpu support
# Remove the --extra-index-url part if you want to install all the gpu requirements
# For more details in the different torch distribution visit https://pytorch.org/.
RUN pip install --no-cache-dir docling --extra-index-url https://download.pytorch.org/whl/cpu && \
pip uninstall -y tesserocr && \
pip install --no-cache-dir --no-binary :all: tesserocr && \
chmod -R g+w /opt/app-root/lib/python3.11/site-packages && \
chmod -R g+w /opt/app-root/lib64/python3.11/site-packages && \
fix-permissions /opt/app-root -P

ENV HF_HOME=/tmp/
ENV TORCH_HOME=/tmp/

RUN docling --version

RUN echo "Downloading models..." && \
docling-tools models download layout tableformer -o "${DOCLING_ARTIFACTS_PATH}"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Run package installation as root before dropping privileges.

This RUN block executes under UID 1001, but it invokes pip uninstall, chmod -R, and fix-permissions on root-owned paths. Those operations need root; the build fails with “Operation not permitted.” Install as root, then switch back to the non-root user before running Docling.

-USER 1001
-
 WORKDIR /opt/app-root/src
@@
-RUN pip install --no-cache-dir docling --extra-index-url https://download.pytorch.org/whl/cpu && \
-    pip uninstall -y tesserocr && \
-    pip install --no-cache-dir --no-binary :all: tesserocr && \
-    chmod -R g+w /opt/app-root/lib/python3.11/site-packages && \
-    chmod -R g+w /opt/app-root/lib64/python3.11/site-packages && \
-    fix-permissions /opt/app-root -P
-
-ENV HF_HOME=/tmp/
-ENV TORCH_HOME=/tmp/
-
-RUN docling --version
+RUN pip install --no-cache-dir docling --extra-index-url https://download.pytorch.org/whl/cpu && \
+    pip uninstall -y tesserocr && \
+    pip install --no-cache-dir --no-binary :all: tesserocr && \
+    chmod -R g+w /opt/app-root/lib/python3.11/site-packages && \
+    chmod -R g+w /opt/app-root/lib64/python3.11/site-packages && \
+    fix-permissions /opt/app-root -P
+
+ENV HF_HOME=/tmp/
+ENV TORCH_HOME=/tmp/
+
+USER 1001
+
+RUN docling --version
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
USER 1001
WORKDIR /opt/app-root/src
ENV OMP_NUM_THREADS=4 \
PYTHONIOENCODING=UTF-8 \
LC_ALL=en_US.UTF-8 \
LANG=en_US.UTF-8 \
PIP_NO_CACHE_DIR=1 \
HOME=/opt/app-root/src \
PATH=/opt/app-root/src:$PATH \
DOCLING_ARTIFACTS_PATH=/opt/app-root/src/.cache/docling/models
# This will install torch with *only* cpu support
# Remove the --extra-index-url part if you want to install all the gpu requirements
# For more details in the different torch distribution visit https://pytorch.org/.
RUN pip install --no-cache-dir docling --extra-index-url https://download.pytorch.org/whl/cpu && \
pip uninstall -y tesserocr && \
pip install --no-cache-dir --no-binary :all: tesserocr && \
chmod -R g+w /opt/app-root/lib/python3.11/site-packages && \
chmod -R g+w /opt/app-root/lib64/python3.11/site-packages && \
fix-permissions /opt/app-root -P
ENV HF_HOME=/tmp/
ENV TORCH_HOME=/tmp/
RUN docling --version
RUN echo "Downloading models..." && \
docling-tools models download layout tableformer -o "${DOCLING_ARTIFACTS_PATH}"
# remove initial non-root user so the install and permission fixes run as root
WORKDIR /opt/app-root/src
ENV OMP_NUM_THREADS=4 \
PYTHONIOENCODING=UTF-8 \
LC_ALL=en_US.UTF-8 \
LANG=en_US.UTF-8 \
PIP_NO_CACHE_DIR=1 \
HOME=/opt/app-root/src \
PATH=/opt/app-root/src:$PATH \
DOCLING_ARTIFACTS_PATH=/opt/app-root/src/.cache/docling/models
# This will install torch with *only* cpu support
# Remove the --extra-index-url part if you want to install all the gpu requirements
# For more details in the different torch distribution visit https://pytorch.org/.
RUN pip install --no-cache-dir docling --extra-index-url https://download.pytorch.org/whl/cpu && \
pip uninstall -y tesserocr && \
pip install --no-cache-dir --no-binary :all: tesserocr && \
chmod -R g+w /opt/app-root/lib/python3.11/site-packages && \
chmod -R g+w /opt/app-root/lib64/python3.11/site-packages && \
fix-permissions /opt/app-root -P
ENV HF_HOME=/tmp/
ENV TORCH_HOME=/tmp/
# now switch back to the non-root user for running the app
USER 1001
RUN docling --version
RUN echo "Downloading models..." && \
docling-tools models download layout tableformer -o "${DOCLING_ARTIFACTS_PATH}"

Comment on lines 1 to 5
[MESSAGES CONTROL]
disable=no-value-for-parameter, # KFP components define inputs and outputs that are supplied by pipelines
import-outside-toplevel, # KFP components require imports inside functions
too-many-arguments, # KFP components often need many parameters
too-many-locals # Complex KFP logic may require more local variables
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Inline comments in disable list may break Pylint parsing.

Move comments to their own lines; keep disable list clean.

 [MESSAGES CONTROL]
-disable=no-value-for-parameter, # KFP components define inputs and outputs that are supplied by pipelines
-    import-outside-toplevel,    # KFP components require imports inside functions
-    too-many-arguments,         # KFP components often need many parameters
-    too-many-locals             # Complex KFP logic may require more local variables
+disable=no-value-for-parameter,
+        import-outside-toplevel,
+        too-many-arguments,
+        too-many-locals
+
+# KFP components define inputs and outputs that are supplied by pipelines
+# KFP components require imports inside functions
+# KFP components often need many parameters
+# Complex KFP logic may require more local variables
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
[MESSAGES CONTROL]
disable=no-value-for-parameter, # KFP components define inputs and outputs that are supplied by pipelines
import-outside-toplevel, # KFP components require imports inside functions
too-many-arguments, # KFP components often need many parameters
too-many-locals # Complex KFP logic may require more local variables
[MESSAGES CONTROL]
disable=no-value-for-parameter,
import-outside-toplevel,
too-many-arguments,
too-many-locals
# KFP components define inputs and outputs that are supplied by pipelines
# KFP components require imports inside functions
# KFP components often need many parameters
# Complex KFP logic may require more local variables
🤖 Prompt for AI Agents
In kubeflow-pipelines/docling-vlm/.pylintrc around lines 1 to 5, the disable
list contains inline comments which can break Pylint parsing; move each comment
to its own line (precede with #) and leave the disable option as a clean
comma-separated list across lines (no inline comments inside the value),
ensuring each disabled message name remains separated by commas or placed on its
own line within the disable value.

Comment on lines 2 to 3
kfp >= 2.14.3
kfp-kubernetes
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

Align KFP packages to latest 2.14.x patch and pin kfp-kubernetes.

To pick up bugfixes, consider bumping to the latest 2.14.x and aligning kfp-kubernetes.

-kfp >= 2.14.3
-kfp-kubernetes
+kfp >= 2.14.5
+kfp-kubernetes >= 2.14.5

Based on learnings. Please confirm current latest:


🌐 Web query:

What is the latest stable patch release for Kubeflow Pipelines 2.14.x (Python SDK)?

💡 Result:

The latest stable patch release for Kubeflow Pipelines Python SDK in the 2.14.x series is kfp 2.14.3 (released Aug 28, 2025). [1][2]

Install: pip install kfp==2.14.3.

Sources: GitHub releases (2.14.3) and PyPI (kfp 2.14.3). [1] https://github.com/kubeflow/pipelines/releases [2] https://pypi.org/project/kfp/


Pin kfp-kubernetes to match kfp 2.14.3

kfp remains at the latest stable patch (>= 2.14.3); add version spec for kfp-kubernetes:

 kfp >= 2.14.3
-kfp-kubernetes
+kfp-kubernetes >= 2.14.3
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
kfp >= 2.14.3
kfp-kubernetes
kfp >= 2.14.3
kfp-kubernetes >= 2.14.3
🤖 Prompt for AI Agents
In kubeflow-pipelines/docling-vlm/requirements.txt around lines 2-3, kfp is left
as "kfp >= 2.14.3" while kfp-kubernetes is unpinned; update the file to pin
kfp-kubernetes to the same patch version as kfp (e.g., change to
"kfp-kubernetes==2.14.3" or use a compatible range like
"kfp-kubernetes>=2.14.3,<2.15.0") so the two packages match and avoid mismatched
API/compatibility issues.

Comment on lines +545 to +624
" # Handle percentage (0.1 = 10%, 0.05 = 5%)\n",
" if size_spec <= 0 or size_spec > 1:\n",
" raise ValueError(\n",
" \"Percentage values must be between 0(non-inclusive) and 1(inclusive)\"\n",
" )\n",
" return max(1, int(size_spec * total_samples))\n",
" # Treat as absolute number\n",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Align percentage handling with earlier validation.

Line [548] rejects floats greater than 1, but ProcessingConfig.__post_init__ and the accompanying error message explicitly allow percentage values up to 100. A user supplying 10.0 for 10 % passes validation and then hits this runtime ValueError. Please normalize values >1 (e.g., divide by 100) or tighten the validation error message so both sides agree.

-        if isinstance(size_spec, float):
-            # Handle percentage (0.1 = 10%, 0.05 = 5%)
-            if size_spec <= 0 or size_spec > 1:
-                raise ValueError(
-                    "Percentage values must be between 0(non-inclusive) and 1(inclusive)"
-                )
-            return max(1, int(size_spec * total_samples))
+        if isinstance(size_spec, float):
+            if size_spec <= 0:
+                raise ValueError("Percentage values must be greater than 0")
+            percentage = size_spec / 100 if size_spec > 1 else size_spec
+            if percentage > 1:
+                raise ValueError("Percentage values cannot exceed 100%")
+            return max(1, int(percentage * total_samples))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
" # Handle percentage (0.1 = 10%, 0.05 = 5%)\n",
" if size_spec <= 0 or size_spec > 1:\n",
" raise ValueError(\n",
" \"Percentage values must be between 0(non-inclusive) and 1(inclusive)\"\n",
" )\n",
" return max(1, int(size_spec * total_samples))\n",
" # Treat as absolute number\n",
if isinstance(size_spec, float):
if size_spec <= 0:
raise ValueError("Percentage values must be greater than 0")
percentage = size_spec / 100 if size_spec > 1 else size_spec
if percentage > 1:
raise ValueError("Percentage values cannot exceed 100%")
return max(1, int(percentage * total_samples))
# Treat as absolute number
🤖 Prompt for AI Agents
In notebooks/data_preparation_and_config.ipynb around lines 545 to 551, the
percentage handling rejects floats >1 but earlier validation allows percentage
inputs up to 100, causing inconsistent behavior; update this block to accept
values >1 and <=100 as percentages by dividing them by 100 before further checks
(i.e., if 1 < size_spec <= 100 then size_spec = size_spec / 100), keep the
existing check to reject values <=0 or >100, and then compute and return max(1,
int(size_spec * total_samples)); alternatively, if you prefer stricter
validation, change the runtime ValueError message to state percentages must be
between 0 (non-inclusive) and 100 (inclusive) so both validations match.

Comment on lines 30 to 43
"# Additional imports for subset selection\n",
"import math\n",
"from typing import TypedDict\n",
"\n",
"# Import submodlib for Facility Location\n",
"from submodlib import FacilityLocationFunction\n",
"\n",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add explicit imports and robust logger/print guards; avoid brittle reliance on %run.

Several names used later are undefined in this notebook unless Notebook 2 was executed in the same kernel. Make execution independent and satisfy linters by importing needed modules and guarding optional prints.

Apply this diff in the import cell:

-# Additional imports for subset selection
-import math
-from typing import TypedDict
+# Additional imports for subset selection
+import math
+import os
+import gc
+import logging
+from multiprocessing import Pool
+from typing import TypedDict, List, Dict, Union, Any
+import numpy as np
+import torch
+import h5py
+# Logger (reuses root config if already configured)
+logger = logging.getLogger(__name__)

Also guard optional prints to avoid NameError when Notebook 2 wasn’t run:

-print(f"   • config: {type(config).__name__}")
+print(f"   • config: {type(config).__name__}") if "config" in globals() else print("   • config: None")
-print(f"   • processor: {type(processor).__name__}")
-print(f"   • DataProcessor has generate_embeddings(): {hasattr(processor, 'generate_embeddings')}")
+print(f"   • processor: {type(processor).__name__}") if "processor" in globals() else print("   • processor: None")
+print(f"   • DataProcessor has generate_embeddings(): {hasattr(processor, 'generate_embeddings')}") if "processor" in globals() else None

Note: Ensure compute_pairwise_dense, DataProcessor, and configs are importable from a module rather than via %run; otherwise, this notebook remains brittle. As per coding guidelines.

Also applies to: 37-45

🤖 Prompt for AI Agents
In notebooks/subset_selection.ipynb around lines 30 to 36 (and similarly 37 to
45), the cell relies on names defined in another notebook (%run) and lacks
explicit imports and safe print/log guards; modify the cell to import required
symbols from a module (e.g., from mypackage.data import compute_pairwise_dense,
DataProcessor; from myproject.configs import configs), add standard logging
import and logger setup (import logging; logger = logging.getLogger(__name__)),
and replace brittle prints with guarded checks (use try/except NameError or if
'compute_pairwise_dense' in globals(): ... and use logger.debug/info for
optional output). Ensure the notebook no longer depends on %run by importing the
functions/classes from a proper module so linters and independent execution
succeed.

Comment on lines +184 to +200
" # Cleanup - ADDED THIS SECTION\n",
" for var in [\"ds_func\", \"similarity_matrix\", \"fold_embeddings\"]:\n",
" if var in locals():\n",
" del locals()[var]\n",
" gc.collect()\n",
" if torch.cuda.is_available():\n",
" torch.cuda.empty_cache()\n",
"\n",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

locals() deletion does not free memory; explicitly del variables.

Manipulating locals() doesn’t reliably delete local variables. This prevents timely freeing of large tensors/arrays, risking OOM.

Apply:

-                # Cleanup - ADDED THIS SECTION
-                for var in ["ds_func", "similarity_matrix", "fold_embeddings"]:
-                    if var in locals():
-                        del locals()[var]
+                # Cleanup: explicitly drop large refs to free memory deterministically
+                try: del ds_func
+                except NameError: pass
+                try: del similarity_matrix
+                except NameError: pass
+                try: del fold_embeddings
+                except NameError: pass
                 gc.collect()
                 if torch.cuda.is_available():
                     torch.cuda.empty_cache()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
" # Cleanup - ADDED THIS SECTION\n",
" for var in [\"ds_func\", \"similarity_matrix\", \"fold_embeddings\"]:\n",
" if var in locals():\n",
" del locals()[var]\n",
" gc.collect()\n",
" if torch.cuda.is_available():\n",
" torch.cuda.empty_cache()\n",
"\n",
# Cleanup: explicitly drop large refs to free memory deterministically
try: del ds_func
except NameError: pass
try: del similarity_matrix
except NameError: pass
try: del fold_embeddings
except NameError: pass
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
🧰 Tools
🪛 Pylint (3.3.9)

[error] 186-186: Undefined variable 'logger'

(E0602)


[error] 187-187: Undefined variable 'Pool'

(E0602)

🪛 Ruff (0.13.3)

188-188: Undefined name logger

(F821)


189-189: Undefined name Pool

(F821)

🤖 Prompt for AI Agents
In notebooks/subset_selection.ipynb around lines 184 to 191, the current cleanup
uses del locals()[var] which does not reliably remove local names or free large
tensors; instead explicitly unset and delete each large variable by checking if
the variable name exists in locals() or globals(), assigning None to the
variable (e.g., var_name = None) and then calling del var_name to remove the
reference, followed by gc.collect() and torch.cuda.empty_cache() if CUDA is
available; update the block to explicitly reference and delete ds_func,
similarity_matrix, fold_embeddings (and any other large objects) rather than
manipulating locals().

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
notebooks/data_preparation_and_config.ipynb (2)

152-159: Critical: Past review comment not addressed - Guard CUDA cache clears.

Line 157 calls torch.cuda.empty_cache() unconditionally, which will raise AssertionError: Torch not compiled with CUDA enabled on CPU-only systems (the documented testing path). This was flagged in a previous review and remains unresolved.

Apply this diff to guard the cache flush:

             if attempt < self.config.system.max_retries - 1:
                 logger.info(f"Retrying in {self.config.system.retry_delay} seconds...")
                 time.sleep(self.config.system.retry_delay)
                 gc.collect()
-                torch.cuda.empty_cache()
+                if torch.cuda.is_available():
+                    torch.cuda.empty_cache()

545-551: Major: Past review comment not addressed - Percentage handling inconsistency.

Line 548 rejects floats greater than 1, but ProcessingConfig.__post_init__ (line 427-429) explicitly allows percentage values up to 100. A user supplying 10.0 for 10% passes validation but hits this runtime ValueError. This was flagged in a previous review and remains unresolved.

Apply this diff to align validation:

         if isinstance(size_spec, float):
-            # Handle percentage (0.1 = 10%, 0.05 = 5%)
-            if size_spec <= 0 or size_spec > 1:
-                raise ValueError(
-                    "Percentage values must be between 0(non-inclusive) and 1(inclusive)"
-                )
-            return max(1, int(size_spec * total_samples))
+            if size_spec <= 0:
+                raise ValueError("Percentage values must be greater than 0")
+            percentage = size_spec / 100 if size_spec > 1 else size_spec
+            if percentage > 1:
+                raise ValueError("Percentage values cannot exceed 100%")
+            return max(1, int(percentage * total_samples))
🧹 Nitpick comments (2)
notebooks/data_preparation_and_config.ipynb (1)

474-474: Consider enabling Jinja2 autoescape for security.

Line 474 creates a Jinja2 Environment with BaseLoader(), which sets autoescape=False by default. While this notebook processes trusted data (dataset templates), enabling autoescape helps prevent XSS vulnerabilities if untrusted content is ever rendered. Based on coding guidelines.

Apply this diff if autoescape is desired:

-        self.env = Environment(loader=BaseLoader())
+        self.env = Environment(loader=BaseLoader(), autoescape=True)
notebooks/embedding_generation.ipynb (1)

520-520: Consider enabling Jinja2 autoescape for security.

Line 520 creates a Jinja2 Environment with BaseLoader(), which sets autoescape=False by default. While this processes trusted templates, enabling autoescape helps prevent XSS vulnerabilities if untrusted content is ever rendered. Based on coding guidelines.

Apply this diff if autoescape is desired:

-        env = Environment(loader=BaseLoader())
+        env = Environment(loader=BaseLoader(), autoescape=True)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a023dba and 5477cf0.

📒 Files selected for processing (2)
  • notebooks/data_preparation_and_config.ipynb (1 hunks)
  • notebooks/embedding_generation.ipynb (1 hunks)
🧰 Additional context used
🪛 Pylint (4.0.0)
notebooks/embedding_generation.ipynb

[error] 19-19: Undefined variable 'config'

(E0602)


[error] 20-20: Undefined variable 'dataset'

(E0602)


[error] 20-20: Undefined variable 'dataset'

(E0602)


[error] 21-21: Undefined variable 'processor'

(E0602)


[error] 24-24: Undefined variable 'torch'

(E0602)


[error] 25-25: Undefined variable 'torch'

(E0602)


[error] 26-26: Undefined variable 'torch'

(E0602)


[error] 27-27: Undefined variable 'torch'

(E0602)


[error] 28-28: Undefined variable 'torch'

(E0602)


[error] 29-29: Undefined variable 'torch'

(E0602)


[error] 34-34: Undefined variable 'TypedDict'

(E0602)


[refactor] 34-34: Too few public methods (0/2)

(R0903)


[refactor] 43-43: Too many instance attributes (8/7)

(R0902)


[error] 42-42: Undefined variable 'dataclass'

(E0602)


[error] 47-47: Undefined variable 'torch'

(E0602)


[refactor] 43-43: Too few public methods (0/2)

(R0903)


[error] 56-56: Undefined variable 'Dict'

(E0602)


[refactor] 73-73: Too many positional arguments (6/5)

(R0917)


[error] 76-76: Undefined variable 'torch'

(E0602)


[error] 88-88: Undefined variable 'torch'

(E0602)


[error] 89-89: Undefined variable 'torch'

(E0602)


[error] 111-111: Undefined variable 'os'

(E0602)


[error] 112-112: Undefined variable 'os'

(E0602)


[error] 118-118: Undefined variable 'logger'

(E0602)


[error] 129-129: Undefined variable 'os'

(E0602)


[error] 147-147: Undefined variable 'logger'

(E0602)


[error] 153-153: Undefined variable 'Union'

(E0602)


[error] 153-153: Undefined variable 'List'

(E0602)


[error] 154-154: Undefined variable 'List'

(E0602)


[error] 182-182: Undefined variable 'torch'

(E0602)


[error] 185-185: Undefined variable 'Union'

(E0602)


[error] 185-185: Undefined variable 'List'

(E0602)


[error] 189-189: Undefined variable 'Union'

(E0602)


[error] 189-189: Undefined variable 'torch'

(E0602)


[error] 189-189: Undefined variable 'np'

(E0602)


[error] 203-203: Undefined variable 'tqdm'

(E0602)


[error] 214-214: Undefined variable 'torch'

(E0602)


[refactor] 68-68: Too few public methods (1/2)

(R0903)


[refactor] 245-245: Too many positional arguments (7/5)

(R0917)


[error] 246-246: Undefined variable 'torch'

(E0602)


[error] 247-247: Undefined variable 'torch'

(E0602)


[error] 250-250: Undefined variable 'Union'

(E0602)


[error] 250-250: Undefined variable 'torch'

(E0602)


[error] 253-253: Undefined variable 'torch'

(E0602)


[error] 261-261: Undefined variable 'torch'

(E0602)


[error] 261-261: Undefined variable 'torch'

(E0602)


[error] 268-268: Undefined variable 'torch'

(E0602)


[error] 276-276: Undefined variable 'torch'

(E0602)


[error] 276-276: Undefined variable 'torch'

(E0602)


[error] 276-276: Undefined variable 'torch'

(E0602)


[error] 278-278: Undefined variable 'torch'

(E0602)


[error] 280-280: Undefined variable 'torch'

(E0602)


[error] 284-284: Undefined variable 'torch'

(E0602)


[error] 286-286: Undefined variable 'torch'

(E0602)


[error] 287-287: Undefined variable 'torch'

(E0602)


[error] 288-288: Undefined variable 'torch'

(E0602)


[error] 337-337: Undefined variable 'torch'

(E0602)


[error] 338-338: Undefined variable 'torch'

(E0602)


[error] 343-343: Undefined variable 'logger'

(E0602)


[error] 348-348: Undefined variable 'torch'

(E0602)


[error] 353-353: Undefined variable 'Environment'

(E0602)


[error] 353-353: Undefined variable 'BaseLoader'

(E0602)


[error] 357-357: Undefined variable 'os'

(E0602)


[error] 358-358: Undefined variable 'os'

(E0602)


[error] 365-365: Undefined variable 'tqdm'

(E0602)


[error] 386-386: Undefined variable 'torch'

(E0602)


[error] 400-400: Undefined variable 'torch'

(E0602)


[error] 401-401: Undefined variable 'torch'

(E0602)


[error] 407-407: Undefined variable 'logger'

(E0602)


[error] 410-410: Undefined variable 'np'

(E0602)


[error] 413-413: Undefined variable 'os'

(E0602)


[error] 417-417: Undefined variable 'logger'

(E0602)


[error] 421-421: Undefined variable 'logger'

(E0602)


[error] 431-431: Undefined variable 'logger'

(E0602)


[error] 461-461: Undefined variable 'os'

(E0602)


[error] 463-463: Undefined variable 'os'

(E0602)


[error] 464-464: Undefined variable 'os'

(E0602)


[error] 465-465: Undefined variable 'os'

(E0602)


[error] 467-467: Undefined variable 'logger'

(E0602)


[error] 476-476: Undefined variable 'retry_on_exception'

(E0602)


[error] 491-491: Undefined variable 'os'

(E0602)


[error] 492-492: Undefined variable 'os'

(E0602)


[error] 495-495: Undefined variable 'os'

(E0602)


[error] 496-496: Undefined variable 'logger'

(E0602)


[error] 502-502: Undefined variable 'torch'

(E0602)


[error] 502-502: Undefined variable 'torch'

(E0602)


[error] 504-504: Undefined variable 'logger'

(E0602)


[error] 540-540: Undefined variable 'logger'

(E0602)


[error] 546-546: Undefined variable 'logger'

(E0602)


[error] 563-563: Undefined variable 'DataProcessor'

(E0602)


[error] 571-571: Undefined variable 'dataset'

(E0602)


[error] 576-576: Undefined variable 'os'

(E0602)


[error] 576-576: Undefined variable 'config'

(E0602)


[error] 578-578: Undefined variable 'dataset'

(E0602)


[error] 580-580: Undefined variable 'config'

(E0602)


[error] 581-581: Undefined variable 'config'

(E0602)


[error] 585-585: Undefined variable 'time'

(E0602)


[error] 588-588: Undefined variable 'processor'

(E0602)


[error] 588-588: Undefined variable 'dataset'

(E0602)


[error] 590-590: Undefined variable 'time'

(E0602)


[error] 594-594: Undefined variable 'dataset'

(E0602)


[error] 598-598: Undefined variable 'os'

(E0602)

🪛 Ruff (0.14.0)
notebooks/embedding_generation.ipynb

16-16: f-string without any placeholders

Remove extraneous f prefix

(F541)


17-17: f-string without any placeholders

Remove extraneous f prefix

(F541)


18-18: f-string without any placeholders

Remove extraneous f prefix

(F541)


19-19: Undefined name config

(F821)


20-20: Undefined name dataset

(F821)


20-20: Undefined name dataset

(F821)


21-21: Undefined name processor

(F821)


24-24: Undefined name torch

(F821)


25-25: Undefined name torch

(F821)


26-26: Undefined name torch

(F821)


27-27: Undefined name torch

(F821)


28-28: Undefined name torch

(F821)


29-29: Undefined name torch

(F821)


35-35: Undefined name TypedDict

(F821)


43-43: Undefined name dataclass

(F821)


48-48: Undefined name torch

(F821)


57-57: Undefined name Dict

(F821)


78-78: Undefined name torch

(F821)


85-87: Avoid specifying long messages outside the exception class

(TRY003)


90-90: Undefined name torch

(F821)


91-91: Undefined name torch

(F821)


113-113: Undefined name os

(F821)


114-114: Undefined name os

(F821)


120-120: Undefined name logger

(F821)


131-131: Undefined name os

(F821)


132-135: Avoid specifying long messages outside the exception class

(TRY003)


149-149: Undefined name logger

(F821)


155-155: Undefined name Union

(F821)


155-155: Undefined name List

(F821)


156-156: Undefined name List

(F821)


163-166: Avoid specifying long messages outside the exception class

(TRY003)


176-179: Avoid specifying long messages outside the exception class

(TRY003)


184-184: Undefined name torch

(F821)


187-187: Undefined name Union

(F821)


187-187: Undefined name List

(F821)


191-191: Undefined name Union

(F821)


191-191: Undefined name torch

(F821)


191-191: Undefined name np

(F821)


205-205: Undefined name tqdm

(F821)


216-216: Undefined name torch

(F821)


234-237: Abstract raise to an inner function

(TRY301)


234-237: Avoid specifying long messages outside the exception class

(TRY003)


240-240: Avoid specifying long messages outside the exception class

(TRY003)


240-240: Use explicit conversion flag

Replace with conversion flag

(RUF010)


249-249: Undefined name torch

(F821)


250-250: Undefined name torch

(F821)


253-253: Undefined name Union

(F821)


253-253: Undefined name torch

(F821)


256-256: Undefined name torch

(F821)


264-264: Undefined name torch

(F821)


264-264: Undefined name torch

(F821)


271-271: Undefined name torch

(F821)


279-279: Undefined name torch

(F821)


279-279: Undefined name torch

(F821)


279-279: Undefined name torch

(F821)


281-281: Undefined name torch

(F821)


283-283: Undefined name torch

(F821)


287-287: Undefined name torch

(F821)


289-289: Undefined name torch

(F821)


290-290: Undefined name torch

(F821)


291-291: Undefined name torch

(F821)


293-293: Avoid specifying long messages outside the exception class

(TRY003)


341-341: Undefined name torch

(F821)


342-342: Undefined name torch

(F821)


347-347: Undefined name logger

(F821)


352-352: Undefined name torch

(F821)


357-357: Undefined name Environment

(F821)


357-357: Undefined name BaseLoader

(F821)


361-361: Undefined name os

(F821)


362-362: Undefined name os

(F821)


369-369: Undefined name tqdm

(F821)


382-382: Abstract raise to an inner function

(TRY301)


382-382: Avoid specifying long messages outside the exception class

(TRY003)


390-390: Undefined name torch

(F821)


404-404: Undefined name torch

(F821)


405-405: Undefined name torch

(F821)


411-411: Undefined name logger

(F821)


414-414: Undefined name np

(F821)


417-417: Undefined name os

(F821)


421-421: Undefined name logger

(F821)


422-422: Consider moving this statement to an else block

(TRY300)


425-425: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


425-425: Undefined name logger

(F821)


425-425: Use explicit conversion flag

Replace with conversion flag

(RUF010)


435-435: Undefined name logger

(F821)


465-465: Undefined name os

(F821)


467-467: Undefined name os

(F821)


468-468: Undefined name os

(F821)


469-469: Undefined name os

(F821)


471-471: Undefined name logger

(F821)


481-481: Undefined name retry_on_exception

(F821)


496-496: Undefined name os

(F821)


497-497: Undefined name os

(F821)


500-500: Undefined name os

(F821)


501-501: Undefined name logger

(F821)


507-507: Undefined name torch

(F821)


507-507: Undefined name torch

(F821)


509-509: Undefined name logger

(F821)


545-545: Undefined name logger

(F821)


551-551: Undefined name logger

(F821)


559-559: Avoid specifying long messages outside the exception class

(TRY003)


568-568: Undefined name DataProcessor

(F821)


575-575: Undefined name dataset

(F821)


580-580: Undefined name os

(F821)


580-580: Undefined name config

(F821)


582-582: Undefined name dataset

(F821)


584-584: Undefined name config

(F821)


585-585: Undefined name config

(F821)


588-588: f-string without any placeholders

Remove extraneous f prefix

(F541)


589-589: Undefined name time

(F821)


592-592: Undefined name processor

(F821)


592-592: Undefined name dataset

(F821)


594-594: Undefined name time

(F821)


596-596: f-string without any placeholders

Remove extraneous f prefix

(F541)


598-598: Undefined name dataset

(F821)


602-602: Undefined name os

(F821)


605-605: Do not catch blind exception: Exception

(BLE001)

notebooks/data_preparation_and_config.ipynb

51-53: Avoid specifying long messages outside the exception class

(TRY003)


69-69: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


69-69: Use explicit conversion flag

Replace with conversion flag

(RUF010)


72-72: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


72-72: Use explicit conversion flag

Replace with conversion flag

(RUF010)


75-75: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


75-75: Use explicit conversion flag

Replace with conversion flag

(RUF010)


78-78: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


78-78: Use explicit conversion flag

Replace with conversion flag

(RUF010)


81-81: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


81-81: Use explicit conversion flag

Replace with conversion flag

(RUF010)


123-123: Consider moving this statement to an else block

(TRY300)


125-125: Do not catch blind exception: Exception

(BLE001)


155-155: Avoid specifying long messages outside the exception class

(TRY003)


247-247: Prefer TypeError exception for invalid type

(TRY004)


247-247: Avoid specifying long messages outside the exception class

(TRY003)


251-251: Prefer TypeError exception for invalid type

(TRY004)


251-251: Avoid specifying long messages outside the exception class

(TRY003)


253-255: Avoid specifying long messages outside the exception class

(TRY003)


257-257: Avoid specifying long messages outside the exception class

(TRY003)


281-281: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.

(S701)


306-306: Avoid specifying long messages outside the exception class

(TRY003)


335-337: Avoid specifying long messages outside the exception class

(TRY003)


354-356: Avoid specifying long messages outside the exception class

(TRY003)


461-461: f-string without any placeholders

Remove extraneous f prefix

(F541)


468-468: Do not catch blind exception: Exception

(BLE001)


489-489: f-string without any placeholders

Remove extraneous f prefix

(F541)


525-525: f-string without any placeholders

Remove extraneous f prefix

(F541)


531-531: f-string without any placeholders

Remove extraneous f prefix

(F541)


535-535: f-string without any placeholders

Remove extraneous f prefix

(F541)


539-539: f-string without any placeholders

Remove extraneous f prefix

(F541)


540-540: f-string without any placeholders

Remove extraneous f prefix

(F541)


543-543: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (1)
notebooks/embedding_generation.ipynb (1)

675-764: LGTM! Well-designed multi-GPU embedding generation.

The generate_embeddings() method is well-structured with:

  • Early exit when embeddings exist (line 694-696)
  • Proper GPU detection and distribution (lines 699-707)
  • Serial processing for testing/notebook mode (line 738)
  • Parallel processing for production (lines 745-747)
  • Robust error handling with retry decorator (line 675)

@fabianofranz
Copy link
Contributor

This needs a rebase. @RobuRishabh can you also please clear all notebook outputs? Tks!

@RobuRishabh RobuRishabh force-pushed the R1146_SubsetSelection_repo_setup branch from 5505cf0 to 8d27caa Compare October 16, 2025 00:50
This commit introduces three notebooks for the subset selection use case:

1. data_preparation_and_config.ipynb:
   - Implements data loading and preprocessing pipeline
   - Handles dataset preparation for embedding generation
   - Sets up configuration parameters for the workflow

2. embedding_generation.ipynb:
   - Generates embeddings using sentence transformers
   - Processes the prepared dataset through embedding models
   - Stores embeddings for downstream subset selection tasks
   - Optimizes batch processing for large datasets

3. subset_selection.ipynb:
   - Implements subset selection algorithms
   - Analyzes embedding similarity and diversity metrics
   - Selects optimal data subsets based on defined criteria
   - Provides visualization and analysis of selected subsets

These notebooks form a complete pipeline for intelligent dataset
subsetting based on embedding similarity and diversity analysis.

Signed-off-by: RobuRishabh <[email protected]>
@RobuRishabh RobuRishabh force-pushed the R1146_SubsetSelection_repo_setup branch from 8d27caa to 52dcdb1 Compare October 16, 2025 00:56
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
notebooks/use-cases/subset-selection/data_preparation_and_config.ipynb (2)

617-625: Align percentage handling with ProcessingConfig validation.

ProcessingConfig allows 0–100 floats; this method rejects >1. Normalize values >1 by dividing by 100, or change validation to require 0–1. Suggest normalization for UX.

-        if isinstance(size_spec, float):
-            # Handle percentage (0.1 = 10%, 0.05 = 5%)
-            if size_spec <= 0 or size_spec > 1:
-                raise ValueError(
-                    "Percentage values must be between 0(non-inclusive) and 1(inclusive)"
-                )
-            return max(1, int(size_spec * total_samples))
+        if isinstance(size_spec, float):
+            # Support either 0–1 (fraction) or 0–100 (percent)
+            if size_spec <= 0:
+                raise ValueError("Percentage values must be greater than 0")
+            percentage = size_spec / 100.0 if size_spec > 1 else size_spec
+            if percentage > 1:
+                raise ValueError("Percentage values cannot exceed 100%")
+            return max(1, int(percentage * total_samples))

Based on past review comments.


171-176: Guard CUDA cache clears in retry decorator.

Unconditional empty_cache breaks CPU-only runs. Gate by torch.cuda.is_available().

-                gc.collect()
-                torch.cuda.empty_cache()
+                gc.collect()
+                if torch.cuda.is_available():
+                    torch.cuda.empty_cache()

Based on past review comments.

🧹 Nitpick comments (5)
notebooks/use-cases/subset-selection/subset_selection.ipynb (3)

37-43: Add missing imports for standalone robustness.

This notebook relies on symbols from prior notebooks. Import locally to avoid NameError when cells are re‑run out of order.

 import math
 import logging
-from dataclasses import dataclass, field
-from typing import TypedDict, Union, List, Dict, Optional, Any
+from dataclasses import dataclass, field
+from typing import TypedDict, Union, List, Dict, Optional, Any
+import os
+import time
+import gc
+import numpy as np
+import torch
+import h5py
+from multiprocessing import Pool

526-533: Ensure metadata output directory exists before saving.

np.savez will fail if basic.output_dir hasn’t been created in this path. Make sure the directory exists.

         subset_name = self.get_subset_name(size_spec, actual_size)
-        metadata_file = os.path.join(
-            self.config.basic.output_dir,
-            f"{base_name}_fl_{self.config.basic.num_folds}_partitions_{subset_name}_metadata.npz",
-        )
+        out_dir = self.config.basic.output_dir
+        os.makedirs(out_dir, exist_ok=True)
+        metadata_file = os.path.join(
+            out_dir,
+            f"{base_name}_fl_{self.config.basic.num_folds}_partitions_{subset_name}_metadata.npz",
+        )
 
         np.savez(metadata_file, indices=sorted_indices, gains=sorted_gains)

362-364: Preserve traceback on failures.

Use logger.exception to retain stack traces for triage.

-        logger.error(f"Error during subset selection: {str(e)}")
+        logger.exception("Error during subset selection: %s", e)
notebooks/use-cases/subset-selection/embedding_generation.ipynb (2)

556-558: Preserve traceback on shard failures.

Use logger.exception to capture stack traces.

-    except Exception as e:
-        logger.error(f"Error processing shard on GPU {gpu_id}: {str(e)}")
-        raise
+    except Exception as e:
+        logger.exception("Error processing shard on GPU %s: %s", gpu_id, e)
+        raise

356-426: Optional: reduce peak memory in compute_pairwise_dense.

results is always float32 on CPU and can be huge. If downstream tolerates it, consider float16 or memory-mapped arrays to lower RAM pressure on large folds.

Would you like a variant that writes to a NumPy memmap on disk to bound memory?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5477cf0 and 52dcdb1.

📒 Files selected for processing (4)
  • notebooks/.gitignore (1 hunks)
  • notebooks/use-cases/subset-selection/data_preparation_and_config.ipynb (1 hunks)
  • notebooks/use-cases/subset-selection/embedding_generation.ipynb (1 hunks)
  • notebooks/use-cases/subset-selection/subset_selection.ipynb (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • notebooks/.gitignore
🧰 Additional context used
🪛 Pylint (4.0.0)
notebooks/use-cases/subset-selection/subset_selection.ipynb

[error] 17-17: Undefined variable 'config'

(E0602)


[error] 18-18: Undefined variable 'dataset'

(E0602)


[error] 18-18: Undefined variable 'dataset'

(E0602)


[error] 19-19: Undefined variable 'processor'

(E0602)


[error] 20-20: Undefined variable 'processor'

(E0602)


[error] 42-42: Undefined variable 'torch'

(E0602)


[error] 43-43: Undefined variable 'torch'

(E0602)


[error] 61-61: Undefined variable 'compute_pairwise_dense'

(E0602)


[error] 125-125: Undefined variable 'gc'

(E0602)


[error] 126-126: Undefined variable 'torch'

(E0602)


[error] 127-127: Undefined variable 'torch'

(E0602)


[refactor] 27-27: Too many branches (13/12)

(R0912)


[refactor] 139-139: Too many positional arguments (9/5)

(R0917)


[error] 168-168: Undefined variable 'os'

(E0602)


[error] 176-176: Undefined variable 'get_default_num_gpus'

(E0602)


[error] 184-184: Undefined variable 'BasicConfig'

(E0602)


[error] 195-195: Undefined variable 'SystemConfig'

(E0602)


[error] 204-204: Undefined variable 'ProcessingConfig'

(E0602)


[error] 208-208: Undefined variable 'EncoderConfig'

(E0602)


[error] 209-209: Undefined variable 'TemplateConfig'

(E0602)


[error] 214-214: Undefined variable 'DataProcessor'

(E0602)


[error] 219-219: Undefined variable 'h5py'

(E0602)


[error] 223-223: Undefined variable 'torch'

(E0602)


[error] 223-223: Undefined variable 'torch'

(E0602)


[error] 228-228: Undefined variable 'os'

(E0602)


[error] 229-229: Undefined variable 'os'

(E0602)


[error] 233-233: Undefined variable 'time'

(E0602)


[error] 237-237: Undefined variable 'time'

(E0602)


[error] 245-245: Undefined variable 'os'

(E0602)


[error] 270-270: Undefined variable 'gc'

(E0602)


[error] 271-271: Undefined variable 'torch'

(E0602)


[error] 272-272: Undefined variable 'torch'

(E0602)


[refactor] 139-139: Too many statements (52/50)

(R0915)


[error] 279-279: Undefined variable 'torch'

(E0602)


[error] 328-328: Undefined variable 'np'

(E0602)


[error] 329-329: Undefined variable 'np'

(E0602)


[error] 376-376: Undefined variable 'Pool'

(E0602)


[error] 415-415: Undefined variable 'os'

(E0602)


[error] 420-420: Undefined variable 'np'

(E0602)


[error] 428-428: Undefined variable 'DataProcessor'

(E0602)


[error] 481-481: Undefined variable 'os'

(E0602)


[error] 482-482: Undefined variable 'os'

(E0602)


[error] 486-486: Undefined variable 'os'

(E0602)


[error] 490-490: Undefined variable 'h5py'

(E0602)


[error] 497-497: Undefined variable 'torch'

(E0602)


[error] 497-497: Undefined variable 'torch'

(E0602)


[error] 508-508: Undefined variable 'os'

(E0602)


[error] 520-520: Undefined variable 'gc'

(E0602)


[error] 521-521: Undefined variable 'torch'

(E0602)


[error] 564-564: Undefined variable 'DataProcessor'

(E0602)


[error] 565-565: Undefined variable 'DataProcessor'

(E0602)


[error] 566-566: Undefined variable 'DataProcessor'

(E0602)


[error] 591-591: Undefined variable 'get_default_num_gpus'

(E0602)


[error] 594-594: Undefined variable 'BasicConfig'

(E0602)


[error] 595-595: Undefined variable 'EncoderConfig'

(E0602)


[error] 596-596: Undefined variable 'TemplateConfig'

(E0602)


[error] 597-597: Undefined variable 'SystemConfig'

(E0602)


[error] 619-619: Undefined variable 'ProcessingConfig'

(E0602)


[error] 630-630: Undefined variable 'DataProcessor'

(E0602)


[error] 639-639: Undefined variable 'gc'

(E0602)


[error] 640-640: Undefined variable 'torch'

(E0602)


[error] 641-641: Undefined variable 'torch'

(E0602)


[error] 664-664: Undefined variable 'dataset'

(E0602)

notebooks/use-cases/subset-selection/embedding_generation.ipynb

[error] 22-22: Undefined variable 'config'

(E0602)


[error] 23-23: Undefined variable 'dataset'

(E0602)


[error] 23-23: Undefined variable 'dataset'

(E0602)


[refactor] 40-40: Too many instance attributes (8/7)

(R0902)


[error] 44-44: Undefined variable 'torch'

(E0602)


[refactor] 70-70: Too many positional arguments (6/5)

(R0917)


[error] 73-73: Undefined variable 'torch'

(E0602)


[error] 89-89: Undefined variable 'torch'

(E0602)


[error] 90-90: Undefined variable 'torch'

(E0602)


[error] 116-116: Undefined variable 'os'

(E0602)


[error] 117-117: Undefined variable 'os'

(E0602)


[error] 134-134: Undefined variable 'os'

(E0602)


[error] 190-190: Undefined variable 'torch'

(E0602)


[error] 197-197: Undefined variable 'torch'

(E0602)


[error] 197-197: Undefined variable 'np'

(E0602)


[error] 216-216: Undefined variable 'tqdm'

(E0602)


[error] 227-227: Undefined variable 'torch'

(E0602)


[refactor] 65-65: Too few public methods (1/2)

(R0903)


[refactor] 255-255: Too many positional arguments (7/5)

(R0917)


[error] 256-256: Undefined variable 'torch'

(E0602)


[error] 257-257: Undefined variable 'torch'

(E0602)


[error] 260-260: Undefined variable 'torch'

(E0602)


[error] 263-263: Undefined variable 'torch'

(E0602)


[error] 276-276: Undefined variable 'torch'

(E0602)


[error] 276-276: Undefined variable 'torch'

(E0602)


[error] 283-283: Undefined variable 'torch'

(E0602)


[error] 291-291: Undefined variable 'torch'

(E0602)


[error] 291-291: Undefined variable 'torch'

(E0602)


[error] 291-291: Undefined variable 'torch'

(E0602)


[error] 293-293: Undefined variable 'torch'

(E0602)


[error] 295-295: Undefined variable 'torch'

(E0602)


[error] 299-299: Undefined variable 'torch'

(E0602)


[error] 301-301: Undefined variable 'torch'

(E0602)


[error] 302-302: Undefined variable 'torch'

(E0602)


[error] 303-303: Undefined variable 'torch'

(E0602)


[error] 352-352: Undefined variable 'torch'

(E0602)


[error] 353-353: Undefined variable 'torch'

(E0602)


[error] 363-363: Undefined variable 'torch'

(E0602)


[error] 368-368: Undefined variable 'Environment'

(E0602)


[error] 368-368: Undefined variable 'BaseLoader'

(E0602)


[error] 372-372: Undefined variable 'os'

(E0602)


[error] 373-373: Undefined variable 'os'

(E0602)


[error] 380-380: Undefined variable 'tqdm'

(E0602)


[error] 401-401: Undefined variable 'torch'

(E0602)


[error] 415-415: Undefined variable 'torch'

(E0602)


[error] 416-416: Undefined variable 'torch'

(E0602)


[error] 425-425: Undefined variable 'np'

(E0602)


[error] 428-428: Undefined variable 'os'

(E0602)


[error] 477-477: Undefined variable 'os'

(E0602)


[error] 479-479: Undefined variable 'os'

(E0602)


[error] 480-480: Undefined variable 'os'

(E0602)


[error] 481-481: Undefined variable 'os'

(E0602)


[error] 489-489: Undefined variable 'retry_on_exception'

(E0602)


[error] 502-502: Undefined variable 'os'

(E0602)


[error] 503-503: Undefined variable 'os'

(E0602)


[error] 506-506: Undefined variable 'os'

(E0602)


[error] 513-513: Undefined variable 'torch'

(E0602)


[error] 513-513: Undefined variable 'torch'

(E0602)


[error] 574-574: Undefined variable 'DataProcessor'

(E0602)


[error] 591-591: Undefined variable 'dataset'

(E0602)


[error] 597-597: Undefined variable 'DataProcessor'

(E0602)


[error] 597-597: Undefined variable 'config'

(E0602)


[error] 601-601: Undefined variable 'os'

(E0602)


[error] 601-601: Undefined variable 'config'

(E0602)


[error] 603-603: Undefined variable 'dataset'

(E0602)


[error] 605-605: Undefined variable 'config'

(E0602)


[error] 606-606: Undefined variable 'config'

(E0602)


[error] 610-610: Undefined variable 'time'

(E0602)


[error] 613-613: Undefined variable 'dataset'

(E0602)


[error] 615-615: Undefined variable 'time'

(E0602)


[error] 619-619: Undefined variable 'dataset'

(E0602)


[error] 623-623: Undefined variable 'os'

(E0602)

🪛 Ruff (0.14.0)
notebooks/use-cases/subset-selection/subset_selection.ipynb

17-17: Undefined name config

(F821)


18-18: Undefined name dataset

(F821)


18-18: Undefined name dataset

(F821)


19-19: Undefined name processor

(F821)


20-20: Undefined name processor

(F821)


21-21: f-string without any placeholders

Remove extraneous f prefix

(F541)


22-22: f-string without any placeholders

Remove extraneous f prefix

(F541)


23-23: f-string without any placeholders

Remove extraneous f prefix

(F541)


43-43: Undefined name torch

(F821)


44-44: Undefined name torch

(F821)


48-48: Abstract raise to an inner function

(TRY301)


48-48: Avoid specifying long messages outside the exception class

(TRY003)


62-62: Undefined name compute_pairwise_dense

(F821)


117-119: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


118-118: Use explicit conversion flag

Replace with conversion flag

(RUF010)


126-126: Undefined name gc

(F821)


127-127: Undefined name torch

(F821)


128-128: Undefined name torch

(F821)


130-130: Consider moving this statement to an else block

(TRY300)


132-132: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


132-132: Use explicit conversion flag

Replace with conversion flag

(RUF010)


149-149: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


170-170: Undefined name os

(F821)


171-174: Avoid specifying long messages outside the exception class

(TRY003)


178-178: Undefined name get_default_num_gpus

(F821)


186-186: Undefined name BasicConfig

(F821)


197-197: Undefined name SystemConfig

(F821)


206-206: Undefined name ProcessingConfig

(F821)


210-210: Undefined name EncoderConfig

(F821)


211-211: Undefined name TemplateConfig

(F821)


216-216: Undefined name DataProcessor

(F821)


221-221: Undefined name h5py

(F821)


224-224: Abstract raise to an inner function

(TRY301)


224-224: Avoid specifying long messages outside the exception class

(TRY003)


225-225: Undefined name torch

(F821)


225-225: Undefined name torch

(F821)


230-230: Undefined name os

(F821)


231-231: Undefined name os

(F821)


235-235: Undefined name time

(F821)


239-239: Undefined name time

(F821)


247-247: Undefined name os

(F821)


268-268: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


268-268: Use explicit conversion flag

Replace with conversion flag

(RUF010)


272-272: Undefined name gc

(F821)


273-273: Undefined name torch

(F821)


274-274: Undefined name torch

(F821)


281-281: Undefined name torch

(F821)


330-330: Undefined name np

(F821)


331-331: Undefined name np

(F821)


378-378: Undefined name Pool

(F821)


393-393: Loop control variable fold_idx not used within loop body

Rename unused fold_idx to _fold_idx

(B007)


405-408: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


417-417: Undefined name os

(F821)


422-422: Undefined name np

(F821)


430-430: Undefined name DataProcessor

(F821)


437-437: Unused function argument: self

(ARG001)


484-484: Undefined name os

(F821)


485-485: Undefined name os

(F821)


489-489: Undefined name os

(F821)


493-493: Undefined name h5py

(F821)


500-500: Undefined name torch

(F821)


500-500: Undefined name torch

(F821)


511-511: Undefined name os

(F821)


523-523: Undefined name gc

(F821)


524-524: Undefined name torch

(F821)


527-527: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


527-527: Use explicit conversion flag

Replace with conversion flag

(RUF010)


562-562: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


562-562: Use explicit conversion flag

Replace with conversion flag

(RUF010)


567-567: Undefined name DataProcessor

(F821)


568-568: Undefined name DataProcessor

(F821)


569-569: Undefined name DataProcessor

(F821)


595-595: Undefined name get_default_num_gpus

(F821)


598-598: Undefined name BasicConfig

(F821)


599-599: Undefined name EncoderConfig

(F821)


600-600: Undefined name TemplateConfig

(F821)


601-601: Undefined name SystemConfig

(F821)


623-623: Undefined name ProcessingConfig

(F821)


634-634: Undefined name DataProcessor

(F821)


638-638: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


638-638: Use explicit conversion flag

Replace with conversion flag

(RUF010)


643-643: Undefined name gc

(F821)


644-644: Undefined name torch

(F821)


645-645: Undefined name torch

(F821)


667-667: Undefined name dataset

(F821)


676-676: Do not catch blind exception: Exception

(BLE001)

notebooks/use-cases/subset-selection/embedding_generation.ipynb

16-16: Redefinition of unused Optional from line 8

Remove definition: Optional

(F811)


22-22: Undefined name config

(F821)


23-23: Undefined name dataset

(F821)


23-23: Undefined name dataset

(F821)


45-45: Undefined name torch

(F821)


75-75: Undefined name torch

(F821)


86-88: Avoid specifying long messages outside the exception class

(TRY003)


91-91: Undefined name torch

(F821)


92-92: Undefined name torch

(F821)


118-118: Undefined name os

(F821)


119-119: Undefined name os

(F821)


136-136: Undefined name os

(F821)


137-140: Avoid specifying long messages outside the exception class

(TRY003)


171-174: Avoid specifying long messages outside the exception class

(TRY003)


184-187: Avoid specifying long messages outside the exception class

(TRY003)


192-192: Undefined name torch

(F821)


199-199: Undefined name torch

(F821)


199-199: Undefined name np

(F821)


218-218: Undefined name tqdm

(F821)


229-229: Undefined name torch

(F821)


247-250: Abstract raise to an inner function

(TRY301)


247-250: Avoid specifying long messages outside the exception class

(TRY003)


253-253: Avoid specifying long messages outside the exception class

(TRY003)


253-253: Use explicit conversion flag

Replace with conversion flag

(RUF010)


258-258: Undefined name torch

(F821)


259-259: Undefined name torch

(F821)


262-262: Undefined name torch

(F821)


265-265: Undefined name torch

(F821)


278-278: Undefined name torch

(F821)


278-278: Undefined name torch

(F821)


285-285: Undefined name torch

(F821)


293-293: Undefined name torch

(F821)


293-293: Undefined name torch

(F821)


293-293: Undefined name torch

(F821)


295-295: Undefined name torch

(F821)


297-297: Undefined name torch

(F821)


301-301: Undefined name torch

(F821)


303-303: Undefined name torch

(F821)


304-304: Undefined name torch

(F821)


305-305: Undefined name torch

(F821)


307-307: Avoid specifying long messages outside the exception class

(TRY003)


355-355: Undefined name torch

(F821)


356-356: Undefined name torch

(F821)


366-366: Undefined name torch

(F821)


371-371: Undefined name Environment

(F821)


371-371: Undefined name BaseLoader

(F821)


375-375: Undefined name os

(F821)


376-376: Undefined name os

(F821)


383-383: Undefined name tqdm

(F821)


396-396: Abstract raise to an inner function

(TRY301)


396-396: Avoid specifying long messages outside the exception class

(TRY003)


404-404: Undefined name torch

(F821)


418-418: Undefined name torch

(F821)


419-419: Undefined name torch

(F821)


428-428: Undefined name np

(F821)


431-431: Undefined name os

(F821)


436-436: Consider moving this statement to an else block

(TRY300)


439-439: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


439-439: Use explicit conversion flag

Replace with conversion flag

(RUF010)


480-480: Undefined name os

(F821)


482-482: Undefined name os

(F821)


483-483: Undefined name os

(F821)


484-484: Undefined name os

(F821)


492-492: Undefined name retry_on_exception

(F821)


505-505: Undefined name os

(F821)


506-506: Undefined name os

(F821)


509-509: Undefined name os

(F821)


516-516: Undefined name torch

(F821)


516-516: Undefined name torch

(F821)


568-568: Avoid specifying long messages outside the exception class

(TRY003)


577-577: Undefined name DataProcessor

(F821)


593-593: Undefined name dataset

(F821)


599-599: Undefined name DataProcessor

(F821)


599-599: Undefined name config

(F821)


603-603: Undefined name os

(F821)


603-603: Undefined name config

(F821)


605-605: Undefined name dataset

(F821)


607-607: Undefined name config

(F821)


608-608: Undefined name config

(F821)


611-611: f-string without any placeholders

Remove extraneous f prefix

(F541)


612-612: Undefined name time

(F821)


615-615: Undefined name dataset

(F821)


617-617: Undefined name time

(F821)


619-619: f-string without any placeholders

Remove extraneous f prefix

(F541)


621-621: Undefined name dataset

(F821)


625-625: Undefined name os

(F821)


628-628: Do not catch blind exception: Exception

(BLE001)

notebooks/use-cases/subset-selection/data_preparation_and_config.ipynb

50-52: Avoid specifying long messages outside the exception class

(TRY003)


68-68: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


68-68: Use explicit conversion flag

Replace with conversion flag

(RUF010)


71-71: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


71-71: Use explicit conversion flag

Replace with conversion flag

(RUF010)


74-74: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


74-74: Use explicit conversion flag

Replace with conversion flag

(RUF010)


77-77: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


77-77: Use explicit conversion flag

Replace with conversion flag

(RUF010)


80-80: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


80-80: Use explicit conversion flag

Replace with conversion flag

(RUF010)


122-122: f-string without any placeholders

Remove extraneous f prefix

(F541)


185-185: Avoid specifying long messages outside the exception class

(TRY003)


276-276: Prefer TypeError exception for invalid type

(TRY004)


276-276: Avoid specifying long messages outside the exception class

(TRY003)


280-280: Prefer TypeError exception for invalid type

(TRY004)


280-280: Avoid specifying long messages outside the exception class

(TRY003)


282-284: Avoid specifying long messages outside the exception class

(TRY003)


286-286: Avoid specifying long messages outside the exception class

(TRY003)


310-310: By default, jinja2 sets autoescape to False. Consider using autoescape=True or the select_autoescape function to mitigate XSS vulnerabilities.

(S701)


335-335: Avoid specifying long messages outside the exception class

(TRY003)


364-366: Avoid specifying long messages outside the exception class

(TRY003)


383-385: Avoid specifying long messages outside the exception class

(TRY003)


492-492: f-string without any placeholders

Remove extraneous f prefix

(F541)


499-499: Do not catch blind exception: Exception

(BLE001)


520-520: f-string without any placeholders

Remove extraneous f prefix

(F541)

Comment on lines +509 to +538
" # Process each example in the shard\n",
" for example in dataset_shard:\n",
" # Format the text using the template\n",
" template = templates_dict.get(template_name)\n",
" if not template:\n",
" raise ValueError(f\"Unknown format type: {template_name}\")\n",
"\n",
" text = template.render(**example)\n",
" batch_texts.append(text)\n",
"\n",
" # Process when batch is full or at the end\n",
" if len(batch_texts) == batch_size or example == dataset_shard[-1]:\n",
" # Generate embeddings for this batch\n",
" with torch.no_grad():\n",
" batch_embeddings = (\n",
" encoder.encode(\n",
" inputs=batch_texts,\n",
" instruction=instruction,\n",
" return_tensors=False, # Return numpy for easier handling\n",
" )\n",
" )\n",
"\n",
" all_embeddings.append(batch_embeddings)\n",
" progress_bar.update(len(batch_texts))\n",
" batch_texts = []\n",
"\n",
" # Clean up GPU memory\n",
" if torch.cuda.is_available():\n",
" torch.cuda.empty_cache()\n",
"\n",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix batch flush condition in shard processing.

Comparing example dicts to detect the last item is brittle. Use enumerate to flush the final partial batch reliably.

-        # Process each example in the shard
-        for example in dataset_shard:
+        # Process each example in the shard
+        for idx, example in enumerate(dataset_shard):
@@
-            # Process when batch is full or at the end
-            if len(batch_texts) == batch_size or example == dataset_shard[-1]:
+            # Process when batch is full or at the end
+            if len(batch_texts) == batch_size or idx == (len(dataset_shard) - 1):
                 # Generate embeddings for this batch
                 with torch.no_grad():
                     batch_embeddings = (
                         encoder.encode(
                             inputs=batch_texts,
                             instruction=instruction,
                             return_tensors=False,  # Return numpy for easier handling
                         )
                     )
 
                 all_embeddings.append(batch_embeddings)
                 progress_bar.update(len(batch_texts))
                 batch_texts = []
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
" # Process each example in the shard\n",
" for example in dataset_shard:\n",
" # Format the text using the template\n",
" template = templates_dict.get(template_name)\n",
" if not template:\n",
" raise ValueError(f\"Unknown format type: {template_name}\")\n",
"\n",
" text = template.render(**example)\n",
" batch_texts.append(text)\n",
"\n",
" # Process when batch is full or at the end\n",
" if len(batch_texts) == batch_size or example == dataset_shard[-1]:\n",
" # Generate embeddings for this batch\n",
" with torch.no_grad():\n",
" batch_embeddings = (\n",
" encoder.encode(\n",
" inputs=batch_texts,\n",
" instruction=instruction,\n",
" return_tensors=False, # Return numpy for easier handling\n",
" )\n",
" )\n",
"\n",
" all_embeddings.append(batch_embeddings)\n",
" progress_bar.update(len(batch_texts))\n",
" batch_texts = []\n",
"\n",
" # Clean up GPU memory\n",
" if torch.cuda.is_available():\n",
" torch.cuda.empty_cache()\n",
"\n",
# Process each example in the shard
for idx, example in enumerate(dataset_shard):
# Format the text using the template
template = templates_dict.get(template_name)
if not template:
raise ValueError(f"Unknown format type: {template_name}")
text = template.render(**example)
batch_texts.append(text)
# Process when batch is full or at the end
if len(batch_texts) == batch_size or idx == (len(dataset_shard) - 1):
# Generate embeddings for this batch
with torch.no_grad():
batch_embeddings = (
encoder.encode(
inputs=batch_texts,
instruction=instruction,
return_tensors=False, # Return numpy for easier handling
)
)
all_embeddings.append(batch_embeddings)
progress_bar.update(len(batch_texts))
batch_texts = []
# Clean up GPU memory
if torch.cuda.is_available():
torch.cuda.empty_cache()
🧰 Tools
🪛 Pylint (4.0.0)

[error] 513-513: Undefined variable 'torch'

(E0602)


[error] 513-513: Undefined variable 'torch'

(E0602)

🪛 Ruff (0.14.0)

509-509: Undefined name os

(F821)


516-516: Undefined name torch

(F821)


516-516: Undefined name torch

(F821)

🤖 Prompt for AI Agents
In notebooks/use-cases/subset-selection/embedding_generation.ipynb around lines
509 to 538, the loop flushes the final partial batch by comparing example dicts
(example == dataset_shard[-1]) which is brittle; change the loop to use
enumerate (for idx, example in enumerate(dataset_shard)) and replace the flush
condition with if len(batch_texts) == batch_size or idx == len(dataset_shard) -
1 so the last partial batch is reliably processed; keep template
lookup/validation, embedding call, appending to all_embeddings, progress_bar
update, batch_texts reset, and CUDA cache cleanup unchanged.

Comment on lines +99 to +207
"def process_folds_with_gpu(args):\n",
" \"\"\"\n",
" Process folds on GPU or CPU with support for both percentage and absolute size specifications.\n",
" \"\"\"\n",
" (\n",
" gpu_id,\n",
" gpu_folds_info,\n",
" embeddings,\n",
" subset_sizes,\n",
" total_samples,\n",
" epsilon,\n",
" testing_mode,\n",
" ) = args\n",
"\n",
" try:\n",
" if torch.cuda.is_available():\n",
" torch.cuda.set_device(gpu_id)\n",
" device = f\"cuda:{gpu_id}\"\n",
" else:\n",
" if not testing_mode:\n",
" raise RuntimeError(\"GPU processing required but CUDA is not available\")\n",
" logger.warning(\n",
" \"Running in CPU mode for testing. Production use requires GPU acceleration.\"\n",
" )\n",
" device = \"cpu\"\n",
"\n",
" results = []\n",
" for fold_idx, fold_indices in gpu_folds_info:\n",
" try:\n",
" logger.info(f\"Processing fold {fold_idx + 1} on GPU {gpu_id}\")\n",
"\n",
" fold_embeddings = embeddings[fold_indices].to(device)\n",
"\n",
" logger.info(f\"Computing similarity matrix for fold {fold_idx + 1}\")\n",
" max_sim_mat = compute_pairwise_dense(\n",
" fold_embeddings,\n",
" batch_size=50000,\n",
" metric=\"cosine\",\n",
" device=device,\n",
" scaling=\"additive\",\n",
" )\n",
" similarity_matrix = max_sim_mat.cpu().numpy()\n",
"\n",
" subsets = {}\n",
" ds_func = FacilityLocationFunction(\n",
" n=similarity_matrix.shape[0],\n",
" sijs=similarity_matrix,\n",
" mode=\"dense\",\n",
" separate_rep=False,\n",
" )\n",
"\n",
" for size_spec in subset_sizes:\n",
" if isinstance(size_spec, float):\n",
" # Percentage-based selection\n",
" budget = max(\n",
" 1, math.ceil(size_spec * similarity_matrix.shape[0])\n",
" )\n",
" else:\n",
" # Absolute number-based selection\n",
" budget = max(\n",
" 1,\n",
" math.ceil(\n",
" size_spec * (similarity_matrix.shape[0] / total_samples)\n",
" ),\n",
" )\n",
"\n",
" logger.info(\n",
" f\"Selecting subset of size {budget} for fold {fold_idx + 1}\"\n",
" )\n",
"\n",
" subset_result = ds_func.maximize(\n",
" budget=budget,\n",
" optimizer=\"LazierThanLazyGreedy\",\n",
" epsilon=epsilon,\n",
" stopIfZeroGain=False,\n",
" stopIfNegativeGain=False,\n",
" verbose=False,\n",
" )\n",
"\n",
" subset_indices = [fold_indices[x[0]] for x in subset_result]\n",
" subset_gains = [x[1] for x in subset_result]\n",
" subsets[size_spec] = {\n",
" \"indices\": subset_indices,\n",
" \"gains\": subset_gains,\n",
" }\n",
"\n",
" results.append((fold_idx, subsets))\n",
"\n",
" except Exception as e:\n",
" logger.error(\n",
" f\"Error processing fold {fold_idx + 1} on GPU {gpu_id}: {str(e)}\"\n",
" )\n",
" raise\n",
" finally:\n",
" # Cleanup - ADDED THIS SECTION\n",
" for var in [\"ds_func\", \"similarity_matrix\", \"fold_embeddings\"]:\n",
" if var in locals():\n",
" del locals()[var]\n",
" gc.collect()\n",
" if torch.cuda.is_available():\n",
" torch.cuda.empty_cache()\n",
"\n",
" return results\n",
" except Exception as e:\n",
" logger.error(f\"Error in process_folds_with_gpu on GPU {gpu_id}: {str(e)}\")\n",
" raise\n",
"\n",
"print(\"✅ process_folds_with_gpu function defined!\")"
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix empty-fold handling, cleanup reliability, and logging in process_folds_with_gpu.

  • Skip empty folds to avoid FacilityLocationFunction(n=0) errors when num_folds > dataset size.
  • Replace locals() deletion with explicit deletions; current code doesn’t free memory.
  • Use logger.exception to preserve tracebacks.
 def process_folds_with_gpu(args):
@@
-    try:
+    try:
         if torch.cuda.is_available():
             torch.cuda.set_device(gpu_id)
             device = f"cuda:{gpu_id}"
         else:
             if not testing_mode:
                 raise RuntimeError("GPU processing required but CUDA is not available")
             logger.warning(
                 "Running in CPU mode for testing. Production use requires GPU acceleration."
             )
             device = "cpu"
 
         results = []
         for fold_idx, fold_indices in gpu_folds_info:
             try:
                 logger.info(f"Processing fold {fold_idx + 1} on GPU {gpu_id}")
 
+                # Skip empty folds (can occur when num_folds > num_samples)
+                if len(fold_indices) == 0:
+                    logger.info(f"Skipping empty fold {fold_idx + 1} on GPU {gpu_id}")
+                    continue
+
                 fold_embeddings = embeddings[fold_indices].to(device)
@@
-            except Exception as e:
-                logger.error(
-                    f"Error processing fold {fold_idx + 1} on GPU {gpu_id}: {str(e)}"
-                )
-                raise
+            except Exception as e:
+                logger.exception("Error processing fold %s on GPU %s: %s", fold_idx + 1, gpu_id, e)
+                raise
             finally:
-                # Cleanup - ADDED THIS SECTION
-                for var in ["ds_func", "similarity_matrix", "fold_embeddings"]:
-                    if var in locals():
-                        del locals()[var]
-                gc.collect()
-                if torch.cuda.is_available():
-                    torch.cuda.empty_cache()
+                # Cleanup large objects to free memory
+                try:
+                    del ds_func
+                except UnboundLocalError:
+                    pass
+                try:
+                    del similarity_matrix
+                except UnboundLocalError:
+                    pass
+                try:
+                    del fold_embeddings
+                except UnboundLocalError:
+                    pass
+                gc.collect()
+                if torch.cuda.is_available():
+                    torch.cuda.empty_cache()
@@
-    except Exception as e:
-        logger.error(f"Error in process_folds_with_gpu on GPU {gpu_id}: {str(e)}")
-        raise
+    except Exception as e:
+        logger.exception("Error in process_folds_with_gpu on GPU %s: %s", gpu_id, e)
+        raise
🧰 Tools
🪛 Pylint (4.0.0)

[error] 125-125: Undefined variable 'gc'

(E0602)


[error] 126-126: Undefined variable 'torch'

(E0602)


[error] 127-127: Undefined variable 'torch'

(E0602)


[refactor] 139-139: Too many positional arguments (9/5)

(R0917)


[error] 168-168: Undefined variable 'os'

(E0602)


[error] 176-176: Undefined variable 'get_default_num_gpus'

(E0602)


[error] 184-184: Undefined variable 'BasicConfig'

(E0602)


[error] 195-195: Undefined variable 'SystemConfig'

(E0602)


[error] 204-204: Undefined variable 'ProcessingConfig'

(E0602)


[refactor] 139-139: Too many statements (52/50)

(R0915)

🪛 Ruff (0.14.0)

117-119: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


118-118: Use explicit conversion flag

Replace with conversion flag

(RUF010)


126-126: Undefined name gc

(F821)


127-127: Undefined name torch

(F821)


128-128: Undefined name torch

(F821)


130-130: Consider moving this statement to an else block

(TRY300)


132-132: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


132-132: Use explicit conversion flag

Replace with conversion flag

(RUF010)


149-149: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


170-170: Undefined name os

(F821)


171-174: Avoid specifying long messages outside the exception class

(TRY003)


178-178: Undefined name get_default_num_gpus

(F821)


186-186: Undefined name BasicConfig

(F821)


197-197: Undefined name SystemConfig

(F821)


206-206: Undefined name ProcessingConfig

(F821)

Comment on lines +645 to +648
" del dataset, embeddings\n",
" gc.collect()\n",
" torch.cuda.empty_cache()\n",
"\n",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard CUDA cache clear on CPU-only runs.

Unconditional torch.cuda.empty_cache() raises when CUDA isn’t available.

-        torch.cuda.empty_cache()
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.14.0)

645-645: Undefined name torch

(F821)

🤖 Prompt for AI Agents
In notebooks/use-cases/subset-selection/subset_selection.ipynb around lines 645
to 648, the code calls torch.cuda.empty_cache() unconditionally which raises on
CPU-only environments; guard this call by checking torch.cuda.is_available() (or
wrap in a try/except) so torch.cuda.empty_cache() is only invoked when CUDA is
present, keeping the del dataset/embeddings and gc.collect() behavior unchanged.

@RobuRishabh RobuRishabh changed the title R1146 Subset Selection Initial repo setup [RHAIENG-1146] Subset Selection Notebooks Oct 16, 2025
@RobuRishabh RobuRishabh closed this Nov 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants