limitations of embeddings

October 16, 2025


here's a scenario:

you have a vector db of disease terms mapping to information you want, say journal_id, and given a user query, you want to find all the journal_ids for that disease term.

say the disease term is "autoimmune polyendocrine syndrome type 1"

and you use using text-embedding-ada-002 and got this back:

0.6462 │ Type 1 Diabetes Mellitus │ ID_001, ID_003 (+92 more) 0.6452 │ Type 1 Diabetes │ ID_002, ID_004 (+95 more) 0.6443 │ Autoimmune Polyendocrinopathy... │ ID_099

my first thought, being medically illiterate like me, is oh the model saw Type 1 in both terms, so it latched on to that heavily. these general embeddings are bad at medical concepts. but it understood it more than i give it credit for.

here's the gist:

type 1 diabetes is actually a component diseases of APS-1. APS-1 (aka APECED) is caused by mutation in the AIRE gene.

The AIRE protein normally teaches your immune system not to attack your own organs, but when its broken, it goes rogue and attacks multiple organs.

diagnosis requires at least 2 out of 3 classicial triad symptoms:

  • chronic candidiasis (yeast infections)
  • hypoparathyroidism (low calcium)
  • addison's disease (adrenal failure)

but only 45-67% of patients actually develop all three, many patients develop other autoimmune conditions or in addition, including: type 1 diabetes (18% of APS-1 patients), autoimmune hepatitis, vitiligo (skin), alopecia (hair loss), thyroid problems

so APS-1 is a syndrome that can include T1D as one of the many possible manifestation, and the embedding model picked up on this relationship

APS-1 (the syndrome - broken AIRE gene) ├── Chronic candidiasis (73-100%) ├── Hypoparathyroidism (76-93%) ├── Addison's disease (72-100%) ├── Type 1 diabetes (~18%) ├── Alopecia (29-40%) └── Other manifestations

so searching for APS-1 and getting a T1D is like seraching heart attack and getting "chest pain" back. yes chest pain is a symptom of heart attack, but if someone needs info on heart attacks, giving them chest pain resources misses the point. they want the specific condition, not one of its symptoms.

the solution?

medical specific embeddings, a two stage retrieval (common in industry), contrastive finetuning w triplet loss where positive pairs are synonyms and exact matches, and hard negatives are manifestations and siblings.