LLM Evaluation Suite

Configure a test model and an LLM-as-judge, run on 10 dataset row(s), then upload results to LangSmith.

Test model

Model under test

Business template

Choose a profile to load a base system prompt and sample prompts, or Custom to use your own.

Base system prompt

Edit the base instructions. Evaluation criteria (right panel) append constraints under // Active Evaluation Constraints: in the preview below.

Effective test system prompt (live preview)

You are a helpful assistant.

// Active Evaluation Constraints:
- Ensure factual claims are correct; if you cannot verify a fact, say so and avoid guessing.
- Stay strictly on topic and address what the user asked; do not add unrelated material.
- Never assert facts you cannot verify from the conversation or provided context; if uncertain, state your limitations explicitly.
- Structure your answer clearly with logical flow so each part supports the next.
- Address every part of the user's request; if something cannot be answered, acknowledge the gap.
- Be concise: prefer short, direct wording unless the user asks for depth or detail.
- Match the tone implied by the role and context (e.g. professional, empathetic) consistently.

Evaluator (judge)

Judge model

Evaluator base instructions

Criteria below are appended automatically with pass/fail at 70%.

Full evaluator system prompt (live preview)

You are an expert LLM evaluation judge. Be objective and concise.

You are an expert LLM evaluation judge. Score only these active criteria (0-100 each). Pass threshold: 70% or higher.

Active criteria:
- Accuracy (accuracy): score 0-100; pass if ≥ 70%
- Relevance (relevance): score 0-100; pass if ≥ 70%
- Faithfulness (faithfulness): score 0-100; pass if ≥ 70%
- Coherence (coherence): score 0-100; pass if ≥ 70%
- Completeness (completeness): score 0-100; pass if ≥ 70%
- Conciseness (conciseness): score 0-100; pass if ≥ 70%
- Tone (tone): score 0-100; pass if ≥ 70%

Respond with a single JSON object only (no markdown fences), exactly in this shape:
{
  "metrics": {
    "accuracy": { "score": <number>, "reasoning": "<string>" },
    "relevance": { "score": <number>, "reasoning": "<string>" },
    "faithfulness": { "score": <number>, "reasoning": "<string>" },
    "coherence": { "score": <number>, "reasoning": "<string>" },
    "completeness": { "score": <number>, "reasoning": "<string>" },
    "conciseness": { "score": <number>, "reasoning": "<string>" },
    "tone": { "score": <number>, "reasoning": "<string>" }
  }
}

Evaluation criteria

Toggle criteria for the judge prompt and for test-model instruction injections (see Test model → effective prompt preview). Pass threshold: ≥70%.

Load dataset from file

.json array, wrapped object (items / dataset / data / examples / rows), or .jsonl (one JSON object per line). Max 1 MiB, 200 rows, 8 KiB per field.

Drop a .json or .jsonl file here, or

Test input dataset

Each row is one user message to the test model. Reference / ideal response is used by the judge for scoring.

Row 1
User prompt (input)What is the capital of France?
Reference / ideal outputParis
Row 2
User prompt (input)What is the speed of light in vacuum (approximate value in m/s)?
Reference / ideal output~299,792,458 m/s
Row 3
User prompt (input)Who wrote Romeo and Juliet?
Reference / ideal outputWilliam Shakespeare
Row 4
User prompt (input)What is the square root of 144?
Reference / ideal output12
Row 5
User prompt (input)What year did World War II end?
Reference / ideal output1945
Row 6
User prompt (input)What is the chemical symbol for gold?
Reference / ideal outputAu
Row 7
User prompt (input)Which planet is known as the Red Planet?
Reference / ideal outputMars
Row 8
User prompt (input)What is often called the powerhouse of the cell?
Reference / ideal outputThe mitochondria
Row 9
User prompt (input)Who painted the Mona Lisa?
Reference / ideal outputLeonardo da Vinci
Row 10
User prompt (input)What is the sum of the interior angles of a triangle in Euclidean geometry?
Reference / ideal output180 degrees

Run

Idle0%