guided decoding

October 2, 2025


got to finetune llama 8b and 70b and gemma models for the task.

i'm actually doing finetuning rather than just writing prompts now. it's really fun.

also looked into guided decoding which is how u can guarantee llms output valid structured data

traditional decoding samples from full vocab:

p(token_t | context) -> softmax over entire vocab

this means the model can generate anything, sometimes you get valid json, sometimes not

the solution: guided decoding masks invalid tokens at each

p(token_t | context, constraints) -> softmax over valid_tokens

how it works: finite state automaton (FSA)- basically a lookup table that says, in state x, these are the valid tokens.

  1. compile constraints -> FSA (one time, cached). your json schema becomes a state machine
  2. during generation:
    • checks current FSA state
    • lookup which tokens are valid
    • mask everything else
    • sample from valid tokens only
  3. after each token:
    • update fsa state
    • repeat

vLLM supports three backends for this

  • outlines: good for regex
  • lm-format-enforcer: character level
  • xgrammar: optimized for nested structures

the con? overhead is a 5-15% slower generation

Previous:

b10 h100s