notes for How to Construct Domain Specific LLM Evaluation Systems: Hamel Husain and Emil Sedgh - YouTube
a systematic approach to improve AI:
call LLMs with synthetic or human inputs >
- unit tests
- assertions and unit tests (look at your data = finding dumb failures modes)
- testing if actinos work
- invalid placeholders
- details repeated
- use CI, use what you have when you begin, don't jump straight into tools
- log results to a DB, see if you're making progress on dumb failure modes
- assertions and unit tests (look at your data = finding dumb failures modes)
- log traces / human review (use a tool!)
- instruct/ open llmetry
- look at evaluate the traces
- build your own data viewing and annotation tools: your data and annotation is very specific to your domain
these two are feed into evals and curation
where evals include:
- human review
- model-based
- a/b tests
here you can use LLMs to synthetically generate inputs to your system
test your system by prompt engineering first -> improve model -> back to invocation, testing with synthetic data
the upshot of having an eval system is you get free finetuning data
the more comprehensive your eval framework, the cost of human eval goes down
aligning LLM judge to a human, use a spreadsheet, optimize for judge vs human agreement
mistakes for LLM evals
- not looking at data
- focus on tools, not processes
- using generic evals off the shelf
- using llm as a judge too early and not aligning with a human