Skip to content
Longterm Wiki

Activation Monitoring

Interpretabilityemerging

Using probes and monitors on model internals to detect deception, harmful intent, or anomalous reasoning in real time.

Organizations
3
Cluster: Interpretability
Parent Area: Interpretability

Tags

interpretabilitymonitoringprobes