LLM Evaluation Suite

Configure a test model and an LLM-as-judge, run on 10 dataset row(s), then upload results to LangSmith.

Test model

Choose a profile to load a base system prompt and sample prompts, or Custom to use your own.

Edit the base instructions. Evaluation criteria (right panel) append constraints under // Active Evaluation Constraints: in the preview below.

Effective test system prompt (live preview)

Evaluator (judge)

Criteria below are appended automatically with pass/fail at 70%.

Full evaluator system prompt (live preview)
Evaluation criteria

Toggle criteria for the judge prompt and for test-model instruction injections (see Test model → effective prompt preview). Pass threshold: ≥70%.

Load dataset from file

.json array, wrapped object (items / dataset / data / examples / rows), or .jsonl (one JSON object per line). Max 1 MiB, 200 rows, 8 KiB per field.

Drop a .json or .jsonl file here, or

Test input dataset

Each row is one user message to the test model. Reference / ideal response is used by the judge for scoring.

  • Row 1
  • Row 2
  • Row 3
  • Row 4
  • Row 5
  • Row 6
  • Row 7
  • Row 8
  • Row 9
  • Row 10

Run

Idle0%