Safety Polytope (SaP) is a geometric safety layer that defines a set of half-space constraints ("facets") in the hidden-state representation space of a language model (Learning Safety Constraints for Large Language Models the technical details).
sap-interpreter
provides a lightweight, post-hoc analysis toolkit for understanding what these facets / edges capture and whether specialization emerges inside the polytope.
The library focuses on three complementary questions:
- Which inputs violate which facets (at whole input and token level)? (
compute-edge-violations
) - Which concept encoder features are activated by a given input (whole input and token level)? (
extract-sae-activations
)
Taken together, these scripts let you trace a single safety facet all the way from hidden state → activation → violation → natural-language example.
Below is the pipeline we use in our mechanistic-interpretability studies.
-
Extract hidden states from your dataset
crlhf
. -
Train SaP
crlhf
. -
Compute sample- / token-level facet violations.
compute-edge-violations \ --model_path /path/to/model \ --trained_weights_path /path/to/weights \ --hidden_states_path /path/to/hidden_states \ --output_dir outputs \ --token_level \ --save_separate_datasets
• Outputs compressed NPZ files with raw violation scores and a CSV with per-facet statistics.
• When--token_level
is active we additionally storetoken_level/*.npz
containing the token sequence, per-token violations and the original text. -
Extract SaP Concept Encoder activations.
extract-sae-activations \ --model_path /path/to/model \ --trained_weights_path /path/to/weights \ --hidden_states_path /path/to/hidden_states \ --output_dir outputs \ --token_level \ --save_separate_datasets
This mirrors step 3 but saves activations on features (before applying facet normals/thresholds) instead of violations on facets.
-
Aggregate results across multiple runs (optional).
python -m sap_interpreter.combine_violations \ --input_dirs outputs/run-* \ --output_dir outputs/combined \ --token_level --edge_level
-
Inspect & visualise.
• Use the companion Streamlit app insap-interpret-frontend
to exploreedge_violations.npz
,sae_activations.npz
and token-level files.
-
Clone the repository:
git clone [repository-url] cd sap-interpreter
-
Install the package in development mode:
pip install -e .
pip install .
Safety Polytope dependency
All scripts expect the upstream Safety Polytope package (which still registers itself under the crlhf
namespace for backward-compatibility) to be importable. Install it once per environment:
git clone https://github.com/lasgroup/SafetyPolytope.git
cd SafetyPolytope
pip install -e .