llm evals

April 11, 2025


notes for How to Construct Domain Specific LLM Evaluation Systems: Hamel Husain and Emil Sedgh - YouTube

a systematic approach to improve AI:

call LLMs with synthetic or human inputs >

  1. unit tests
    • assertions and unit tests (look at your data = finding dumb failures modes)
      • testing if actinos work
      • invalid placeholders
      • details repeated
    • use CI, use what you have when you begin, don't jump straight into tools
    • log results to a DB, see if you're making progress on dumb failure modes
  2. log traces / human review (use a tool!)
    • instruct/ open llmetry
    • look at evaluate the traces
      • build your own data viewing and annotation tools: your data and annotation is very specific to your domain

these two are feed into evals and curation

where evals include:

  1. human review
  2. model-based
  3. a/b tests

here you can use LLMs to synthetically generate inputs to your system

test your system by prompt engineering first -> improve model -> back to invocation, testing with synthetic data

the upshot of having an eval system is you get free finetuning data

the more comprehensive your eval framework, the cost of human eval goes down

aligning LLM judge to a human, use a spreadsheet, optimize for judge vs human agreement

mistakes for LLM evals

  1. not looking at data
  2. focus on tools, not processes
  3. using generic evals off the shelf
  4. using llm as a judge too early and not aligning with a human