Publications

My research focuses on foundational NLP methods and evaluation frameworks for high-stakes decision making, including healthcare. Example work includes diagnostic reasoning, uncertainty estimation, knowledge integration, and safe deployment of large language models (LLMs).

A complete and up-to-date list of publications is available on
Google Scholar.

Last updated: January 2026.


Selected Journal Publications

Gao, Yanjun, Myers S, Chen S, Dligach D, Miller T, Bitterman DS, Chen G, Mayampurath A, Churpek MM, Afshar M.
Uncertainty estimation in diagnosis generation from large language models: next-word probability is not pre-test probability.
JAMIA Open, 2025.
→ Introduces a principled analysis of why token probabilities from LLMs do not correspond to clinical risk estimates.

Croxford E, Gao, Yanjun, First E, Pellegrino N, et al.
Evaluating clinical AI summaries with large language models as judges.
npj Digital Medicine, 2025.
→ Proposes and validates LLM-as-judge evaluation aligned with clinician judgment.

Croxford E, Gao, Yanjun, Pellegrino N, et al.
Development and validation of a provider documentation summarization quality instrument for large language models.
Journal of the American Medical Informatics Association (JAMIA), 2025.
→ Introduces a clinician-grounded evaluation instrument for medical summarization.

Gao, Yanjun, Li R, Croxford E, et al.
Leveraging a medical knowledge graph into large language models for diagnosis prediction: design and application study.
JMIR AI, 2025.
→ Demonstrates knowledge-graph–augmented LLMs for explainable diagnostic prediction.

Afshar M, Gao, Yanjun, Gupta D, Croxford E, Demner-Fushman D.
On the role of the UMLS in supporting diagnosis generation proposed by large language models.
Journal of Biomedical Informatics, 2024.
→ Analyzes how structured medical knowledge supports faithful diagnosis generation.

Yoon W, Chen S, Gao, Yanjun, Zhao Z, Dligach D, Bitterman DS, Afshar M, Miller T.
LCD benchmark: long clinical document benchmark on mortality prediction for language models.
Journal of the American Medical Informatics Association (JAMIA), 2025.
→ Establishes a large-scale benchmark for long-context clinical reasoning.

Gao, Yanjun, Mahajan D, Uzuner Ö, Yetisgen M.
Clinical natural language processing for secondary uses.
Journal of Biomedical Informatics, 2024.
→ A comprehensive review framing clinical NLP beyond information extraction.


Selected Conference Publications

Kruse M, Afshar M, Khatwani S, Mayampurath A, Chen G, Gao, Yanjun.
Simple yet effective: an information-theoretic approach to multi-LLM uncertainty quantification.
EMNLP (Main), 2025.
→ Introduces an information-theoretic framework to quantify uncertainty across multiple LLMs without requiring model retraining.

Kruse M, Hu S, Derby N, Wu Y, Stonbraker S, Yao B, Wang D, Goldberg E, Gao, Yanjun.
Large language models with temporal reasoning for longitudinal clinical summarization and prediction.
EMNLP Findings, 2025.
→ Develops temporal reasoning strategies enabling LLMs to model longitudinal patient trajectories from clinical narratives.

Li R, Chen C, Hu Y, Gao, Yanjun, Wang X, Yilmaz E.
Attributing response to context: a Jensen–Shannon divergence–driven mechanistic study of context attribution in retrieval-augmented generation.
COLM Workshop (Best Paper), 2025.
→ Proposes a divergence-based mechanistic method to quantify and analyze how retrieved context influences LLM responses.

Gao, Yanjun, Myers S, Chen S, Dligach D, Miller T, Bitterman DS, Afshar M.
When raw data prevails: are large language model embeddings effective in numerical data representation for medical machine learning?
EMNLP Findings, 2024.
→ Provides a systematic evaluation showing when LLM embeddings succeed or fail for numerical clinical prediction tasks.

Chen X, Huang H, Gao, Yanjun, Wang Y, Zhao J, Ding K.
Learning to maximize mutual information for chain-of-thought distillation.
ACL Findings, 2024.
→ Proposes a mutual-information objective for distilling chain-of-thought reasoning while improving reasoning faithfulness.

Gao, Yanjun, Dligach D, Miller T, Xu D, Churpek MMM, Afshar M.
Summarizing patients’ problems from hospital progress notes using pre-trained sequence-to-sequence models.
COLING, 2022.
→ Introduces one of the first end-to-end neural models for problem list summarization from real-world EHR notes.

Gao, Yanjun, Huang T-H, Passonneau RJ.
ABCD: A graph framework to convert complex sentences to a covering set of simple sentences.
ACL (Main), 2021.
→ Presents a graph-based decomposition framework for controllable sentence simplification and meaning preservation.

Gao, Yanjun, Sun C, Passonneau RJ.
Automated pyramid summarization evaluation.
CoNLL, 2019.
→ Introduces automated pyramid-based summarization evaluation with strong correlation to human judgment.


Benchmarks and Community Resources (Selected)

Gao, Yanjun, Dligach D, Miller T, et al.
DR.BENCH: Diagnostic reasoning benchmark for clinical natural language processing.
Journal of Biomedical Informatics, 2023.

Gao, Yanjun, Dligach D, Christensen L, et al.
A scoping review of publicly available language tasks in clinical natural language processing.
JAMIA, 2022.