LLM Evaluation Suite
Configure a test model and an LLM-as-judge, run on 10 dataset row(s), then upload results to LangSmith.
Test model
Choose a profile to load a base system prompt and sample prompts, or Custom to use your own.
Edit the base instructions. Evaluation criteria (right panel) append constraints under // Active Evaluation Constraints: in the preview below.
Effective test system prompt (live preview)
Evaluator (judge)
Criteria below are appended automatically with pass/fail at 70%.
Full evaluator system prompt (live preview)
Load dataset from file
.json array, wrapped object (items / dataset / data / examples / rows), or .jsonl (one JSON object per line). Max 1 MiB, 200 rows, 8 KiB per field.
Drop a .json or .jsonl file here, or
Test input dataset
Each row is one user message to the test model. Reference / ideal response is used by the judge for scoring.
- Row 1
- Row 2
- Row 3
- Row 4
- Row 5
- Row 6
- Row 7
- Row 8
- Row 9
- Row 10
Run
Idle0%