got to finetune llama 8b and 70b and gemma models for the task.
i'm actually doing finetuning rather than just writing prompts now. it's really fun.
also looked into guided decoding which is how u can guarantee llms output valid structured data
traditional decoding samples from full vocab:
p(token_t | context) -> softmax over entire vocab
this means the model can generate anything, sometimes you get valid json, sometimes not
the solution: guided decoding masks invalid tokens at each
p(token_t | context, constraints) -> softmax over valid_tokens
how it works: finite state automaton (FSA)- basically a lookup table that says, in state x, these are the valid tokens.
- compile constraints -> FSA (one time, cached). your json schema becomes a state machine
- during generation:
- checks current FSA state
- lookup which tokens are valid
- mask everything else
- sample from valid tokens only
- after each token:
- update fsa state
- repeat
vLLM supports three backends for this
- outlines: good for regex
- lm-format-enforcer: character level
- xgrammar: optimized for nested structures
the con? overhead is a 5-15% slower generation