SAP Interpreter Overview

Safety Polytope (SaP) is a geometric safety layer that defines a set of half-space constraints ("facets") in the hidden-state representation space of a language model (Learning Safety Constraints for Large Language Models the technical details).

sap-interpreter provides a lightweight, post-hoc analysis toolkit for understanding what these facets / edges capture and whether specialization emerges inside the polytope.

The library focuses on three complementary questions:

Which inputs violate which facets (at whole input and token level)? (compute-edge-violations)
Which concept encoder features are activated by a given input (whole input and token level)? (extract-sae-activations)

Taken together, these scripts let you trace a single safety facet all the way from hidden state → activation → violation → natural-language example.

End-to-End Workflow

Below is the pipeline we use in our mechanistic-interpretability studies.

Extract hidden states from your dataset crlhf.
Train SaP crlhf.
Compute sample- / token-level facet violations.
```
compute-edge-violations \
 --model_path /path/to/model \
 --trained_weights_path /path/to/weights \
 --hidden_states_path /path/to/hidden_states \
 --output_dir outputs \
 --token_level \
 --save_separate_datasets
```
• Outputs compressed NPZ files with raw violation scores and a CSV with per-facet statistics.
• When --token_level is active we additionally store token_level/*.npz containing the token sequence, per-token violations and the original text.

Extract SaP Concept Encoder activations.

extract-sae-activations \
 --model_path /path/to/model \
 --trained_weights_path /path/to/weights \
 --hidden_states_path /path/to/hidden_states \
 --output_dir outputs \
 --token_level \
 --save_separate_datasets

This mirrors step 3 but saves activations on features (before applying facet normals/thresholds) instead of violations on facets.

Aggregate results across multiple runs (optional).

python -m sap_interpreter.combine_violations \
    --input_dirs outputs/run-* \
    --output_dir outputs/combined \
    --token_level --edge_level

Inspect & visualise.
• Use the companion Streamlit app in sap-interpret-frontend to explore edge_violations.npz, sae_activations.npz and token-level files.

Installation

Local Development Installation

Clone the repository:

git clone [repository-url]
cd sap-interpreter

Install the package in development mode:
```
pip install -e .
```

Regular Installation

pip install .

Safety Polytope dependency

All scripts expect the upstream Safety Polytope package (which still registers itself under the crlhf namespace for backward-compatibility) to be importable. Install it once per environment:

git clone https://github.com/lasgroup/SafetyPolytope.git
cd SafetyPolytope
pip install -e .

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
.gitignore		.gitignore
README.md		README.md
run.txt		run.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SAP Interpreter Overview

End-to-End Workflow

Installation

Local Development Installation

Regular Installation

Safety Polytope dependency

About

Uh oh!

Releases

Packages

Languages

MisteFr/sap-interpreter

Folders and files

Latest commit

History

Repository files navigation

SAP Interpreter Overview

End-to-End Workflow

Installation

Local Development Installation

Regular Installation

Safety Polytope dependency

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages