llm evals

a systematic approach to improve AI:

call LLMs with synthetic or human inputs >

unit tests
- assertions and unit tests (look at your data = finding dumb failures modes)
  - testing if actinos work
  - invalid placeholders
  - details repeated
- use CI, use what you have when you begin, don't jump straight into tools
- log results to a DB, see if you're making progress on dumb failure modes
log traces / human review (use a tool!)
- instruct/ open llmetry
- look at evaluate the traces
  - build your own data viewing and annotation tools: your data and annotation is very specific to your domain

these two are feed into evals and curation

where evals include:

here you can use LLMs to synthetically generate inputs to your system

test your system by prompt engineering first -> improve model -> back to invocation, testing with synthetic data

the upshot of having an eval system is you get free finetuning data

the more comprehensive your eval framework, the cost of human eval goes down

aligning LLM judge to a human, use a spreadsheet, optimize for judge vs human agreement

mistakes for LLM evals

BENEDICT NEO 梁耀恩