guided decoding

got to finetune llama 8b and 70b and gemma models for the task.

i'm actually doing finetuning rather than just writing prompts now. it's really fun.

also looked into guided decoding which is how u can guarantee llms output valid structured data

traditional decoding samples from full vocab:

p(token_t | context) -> softmax over entire vocab

this means the model can generate anything, sometimes you get valid json, sometimes not

the solution: guided decoding masks invalid tokens at each

p(token_t | context, constraints) -> softmax over valid_tokens

how it works: finite state automaton (FSA)- basically a lookup table that says, in state x, these are the valid tokens.

compile constraints -> FSA (one time, cached). your json schema becomes a state machine
during generation:
- checks current FSA state
- lookup which tokens are valid
- mask everything else
- sample from valid tokens only
after each token:
- update fsa state
- repeat

vLLM supports three backends for this

the con? overhead is a 5-15% slower generation

BENEDICT NEO 梁耀恩