how can you calculate AUROC from LLMs? i will be presenting my research in a few days and i have a big bug in my code. the solution in the end was to use token log probs. my professor suggested this package: EleutherAI/lm-evaluation-harness: A framework for few-shot evaluation of language models
here's the formula
- call endpoint with
logprobs=True
,top_logprobs >= # classes
,max_tokens=1
,temperature=0
- ensure
choices[0].logprobs.content[0].top_logprobs
is preesent, if None, -> NaN - constrain output to single token per class: prompt to return labeled option: (e.g. A: apples, B: bananas) and instruct "answer with a single capital letter"
- ensure
- extract token log probls -> class probabilities
- from top_logprobs of first generated token, build a map for option letters {A, B, C, D}
- clean tokens (strip leading space/byte-pair artifacts)
- if any optoin letter missing, mark sample prob as invalid
- convert logprobs to probabilities with softmax, (normalize to sum=1)
- compute AUROC
- Binary tasks: roc_auc_score(y_true, P[:, pos_idx]) (use probability of the positive class)
- Multiclass: roc_auc_score(y_true, P, multi_class='ovr', average='macro')
- Compute on only those rows with valid probabilities
references