SARM: Interpretable Reward Model Demo

This is an interactive demo for the SARM-4B model (Sparse Autoencoder-enhanced Reward Model).

SARM is a novel reward model architecture that enhances interpretability by integrating a pretrained Sparse Autoencoder (SAE). It maps the internal hidden states of a large language model into a sparse and human-understandable feature space, making the resulting reward scores transparent and conceptually meaningful.

How to use this Demo:

  1. Enter a Prompt (e.g., a question) in the left textbox below.
  2. Enter a corresponding Response in the right textbox.
  3. Click the "Calculate Reward Score" button.

The model will output a scalar score that evaluates the quality of the response. A higher score indicates that the SARM model considers the response to be of better quality.


SARM Architecture

Examples
Prompt / Question Response to be Evaluated