SARM: Interpretable Reward Model Demo

This is an interactive demo for the SARM-4B model (Sparse Autoencoder-enhanced Reward Model).

SARM is a novel reward model architecture that enhances interpretability by integrating a pretrained Sparse Autoencoder (SAE). It maps the internal hidden states of a large language model into a sparse and human-understandable feature space, making the resulting reward scores transparent and conceptually meaningful.

How to use this Demo:

Enter a Prompt (e.g., a question) in the left textbox below.
Enter a corresponding Response in the right textbox.
Click the "Calculate Reward Score" button.

The model will output a scalar score that evaluates the quality of the response. A higher score indicates that the SARM model considers the response to be of better quality.

SARM Architecture

Authors (* indicates equal contribution)

Shuyi Zhang*, Wei Shi*, Sihang Li*, Jiayi Liao, Tao Liang, Hengxing Cai, Xiang Wang
Paper: Interpretable Reward Model via Sparse Autoencoder
Model: schrieffer/Llama-SARM-4B
- Finetuned from model: Llama-3.1-8B-Instruct
Code Repository: https://github.com/schrieffer-z/sarm