SARM: Interpretable Reward Model Demo
This is an interactive demo for the SARM-4B model (Sparse Autoencoder-enhanced Reward Model).
SARM is a novel reward model architecture that enhances interpretability by integrating a pretrained Sparse Autoencoder (SAE). It maps the internal hidden states of a large language model into a sparse and human-understandable feature space, making the resulting reward scores transparent and conceptually meaningful.
How to use this Demo:
- Enter a Prompt (e.g., a question) in the left textbox below.
- Enter a corresponding Response in the right textbox.
- Click the "Calculate Reward Score" button.
The model will output a scalar score that evaluates the quality of the response. A higher score indicates that the SARM model considers the response to be of better quality.
SARM Architecture
Authors (* indicates equal contribution)
Shuyi Zhang*, Wei Shi*, Sihang Li*, Jiayi Liao, Tao Liang, Hengxing Cai, Xiang Wang
Model: schrieffer/SARM-4B
- Finetuned from model: Llama-3.1-8B-Instruct
Code Repository: https://github.com/schrieffer-z/sarm
Prompt / Question | Response to be Evaluated |
---|