I work on making AI systems honest about their own certainty, so when they're confident, you can be too.
My research develops confidence calibration methods for language models (LLMs, LRMs, and VLMs) with a focus on high-stakes domains like oncology and finance where a model's overconfidence can cause serious harm.
Education
Aug 2022 - Present
Ph.D.
Aug 2022 - Apr 2024
M.Sc.
Sep 2017 - Aug 2021
B.Sc.
Leadership
Rasht School of AI Leader
University of Guilan
Dec 2020 - Aug 2022
Brain and Cognition Association AI Head
University of Guilan
Oct 2020 - Oct 2021
CE Scientific Association Head of Research Affairs
University of Guilan
Oct 2020 - Oct 2021
Awards
NSF-EMBS-Google Young Professional NextGen Scholar
IEEE BHI 2025, Atlanta, USA
Recognized for outstanding accomplishments in biomedical AI and health informatics.
Sigma Xi Full Member
Sigma Xi Scientific Research Honor Society
Full membership by invitation only, recognizing excellence in scientific research
Selected Speaker
3rd Henry Ford + MSU Cancer Research Symposium (2023)
Poster Award Winner at "Cancer Control & Prevention"
Selected Tutorial
4th ACM International Conference on AI in Finance (2023)
Large Language Models for NLP in Finance
Teacher Assistanship
Certifications
Language
English
Proficient

German
Intermediate

Persian
Native
Experience
Aug 2022- Present
East Lansing, USA
AI Research Assistant
Human Augmentation and Artificial Intelligence Laboratory
Introduced CCPS, a state-of-the-art LLM confidence estimation method published at EMNLP 2025, and led the first systematic investigation of confidence estimation in large reasoning models, published at EACL 2026.
Applied LLMs to cancer care in collaboration with Henry Ford Health and Cedars-Sinai, with work on oncology QA, toxicity extraction, and toxicity grading published in IJROBP and presented at AAPM, ASTRO, and IEEE BHI.
Partnered with JPMorgan AI Research on financial AI, building a 70M+ node temporal graph for investment prediction and mapping science-to-industry funding pipelines, published in IEEE TCSS.
Sep 2018 - Jul 2022
Rasht, Iran
NLP Research Assistant
Guilan NLP Group
Developed models and datasets for Persian NLP from the ground up, addressing the challenges of a critically low-resource language setting.
Key contributions include Prose2Poem, a prose-to-poetry translation model; COPER, a semantic search engine with the accompanying PerSICK similarity dataset; and PGST, a Persian text style transfer method.
Techstack
Presentations
Publications
How Reliable are Confidence Estimators for Large Reasoning Models? A Systematic Benchmark on High-Stakes Domains
Khanmohammadi, R., Miahi, E., Kaur, S., Brugere, I., Smiley, C. H., Thind, K., & Ghassemi, M. M.
Published at the 2026 European Chapter of the Association for Computational Linguistics (EACL'26) - doi:10.18653/v1/2026.eacl-long.78
Introduced RMCB, a large-scale benchmark of 347K+ reasoning traces across six large reasoning models and five high-stakes domains.
Evaluated 10+ architectures and uncovered a fundamental calibration-discrimination trade-off with no existing method dominating both.
Showed that structural awareness of the reasoning trace improves calibration by 7.5% relative without affecting discrimination.
In collaboration with:
JPMorgan Chase
Henry Ford Health
Calibrating LLM Confidence by Probing Perturbed Representation Stability
Khanmohammadi, R., Miahi, E., Mardikoraem, M., Kaur, S., Brugere, I., Smiley, C. H., Thind, K., & Ghassemi, M. M.
Published at the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP'25) - doi:10.18653/v1/2025.emnlp-main.530
Introduced CCPS, a confidence calibration method for LLMs using perturbation-based representation stability features.
Reduces ECE by 55% and improves Brier score by 21% without modifying model weights.
Outperforms all prior approaches on MMLU and MMLU-Pro benchmarks.
Nominated for the Outstanding Paper Award, placing in the top 0.4% of 8,174 submissions.
In collaboration with:
JPMorgan Chase
Henry Ford Health
Efficient CTCAE Grading for Post-Radiotherapy Toxicities Using Large Language Models: A Privacy-Preserving Approach Using Instruction Fine-Tuning
Khanmohammadi, R., Ghanem, A. I., Bhatnagar, A., Turfa, J., Siddiqui, S., Elshaikh, M., Bagher-Ebadian, H., Movsas, B., Chetty, I. J., & Ghassemi, M. M., Thind, K
Published in International Journal of Radiation Oncology, Biology, Physics (2025) - doi:10.1016/j.ijrobp.2025.06.3177
In collaboration with:
Henry Ford Health
Cedars-Sinai
Hybrid student-teacher large language model refinement for cancer toxicity symptom extraction
Khanmohammadi, R., Ghanem, A. I., Verdecchia, K., Hall, R., Elshaikh, M., Movsas, B., Bagher-Ebadian, H., Luo, B., Chetty, I. J., Alhanai, T., Thind, K., & Ghassemi, M. M.
Published at the 2025 IEEE International Conference on Biomedical and Health Informatics (BHI'25) - doi:10.1109/BHI67747.2025.11269490
Applied a student-teacher framework to improve compact LLMs for symptom extraction.
Used GPT-4o (teacher) to guide compact LLMs in refining prompts, using RAG, and finetuning.
Achieved F1 improvements of 26% for Phi3 and 13% for Zephyr.
Reduced costs: Phi3 was 48x and Zephyr 30x cheaper than GPT-4o.
Demonstrated an efficient, cost-effective approach for using LLMs in clinical settings.
In collaboration with:
Henry Ford Health
Cedars-Sinai
NYU-AD
Bridging Scientific Research, Innovation, and Finance: A Temporal Heterogeneous Graph Dataset for Financial Investment Prediction
Khanmohammadi, R., Singh, K., Maheshwari, P., Panda, V., Kaur, S., Brugere, I., Smiley, C. H., Nourbakhsh, A., Alhanai, T., & Ghassemi, M. M. - Under review
Built a 70M+ node graph dataset linking papers, patents, and financial data (2001–2022).
Developed ML models and an advanced TGNN model, Durendal++, for investment predictions.
Durendal++ achieved top performance, with F1 Micro scores up to 89% F1 in 2022.
Showcased the benefits of diverse data integration in financial predictions.
In collaboration with:
JPMorgan Chase
NYU-AD
Investigating the Temporal Association of Biomedical Research on Small Business Funding: A Bibliometric and Data Analytic Approach
Khanmohammadi, R., Kaur, S., Smiley, C. H., Alhanai, T., Brugere, I., Nourbakhsh, A., & Ghassemi, M. M.
Published in IEEE Transactions on Computational Social Systems (2024) - doi:10.1109/TCSS.2024.3466010
Analyzed 10,873 biomedical topics to link scientific innovation with small business funding.
Combined bibliometric analysis with SBIR data to assess science’s industrial impact.
Measured time-lagged effects of scientific advances on industry funding (2010-2021).
Found impactful scientific topics as predictors of future funding (p-values < 0.05).
Revealed strong contextual overlap between scientific papers and industry projects.
In collaboration with:
JPMorgan Chase
NYU-AD
Iterative Prompt Refinement for Radiation Oncology Symptom Extraction Using Teacher-Student Large Language Models
Khanmohammadi, R., Ghanem, A. I., Verdecchia, K., Hall, R., Elshaikh, M., Movsas, B., Bagher-Ebadian, H., Chetty, I., Ghassemi, M. M., & Thind, K.
Published at the 2024 International Conference on the use of Computers in Radiation therapy (ICCR'24) - HAL ID: hal-04720234
Automated prompt optimization through a teacher-student model setup.
Improved model performance using zero-shot learning, avoiding additional training.
Ensured local data processing to protect sensitive clinical information.
Improved domain-specific concept extraction accuracy through iterative refinement.
In collaboration with:
Henry Ford Health
Cedars-Sinai
A Novel Localized Student-Teacher LLM for Enhanced Toxicity Extraction in Radiation Oncology
Khanmohammadi, R., Ghanem, A. I., Verdecchia, K., Hall, R., Elshaikh, M. A., Movsas, B., Bagher-Ebadian, H., Chetty, I. J., Ghassemi, M. M., & Thind, K.
Published in International Journal of Radiation Oncology, Biology, Physics (2024) - doi:10.1016/j.ijrobp.2024.07.1392
Developed a student-teacher LLM system to improve toxicity extraction in radiation oncology.
Tested on prostate cancer notes, focusing on key symptoms and treatments from 177 patients.
Achieved significant accuracy, precision, recall, and F1 score improvements in single and multi-symptom as well as single and multi-treatment notes (p < 0.05).
Demonstrated potential for local, privacy-preserving NLP in clinical environments.
In collaboration with:
Henry Ford Health
Cedars-Sinai
Integrating Natural Language Processing into Radiation Oncology: A Practical Guide to Transformer Architecture and Large Language Models
Khanmohammadi, R., Ghassemi, M. M., Verdecchia, K., Ghanem, A. I., Bing, L., Chetty, I. J., Bagher-Ebadian, H., Siddiqui, F., Elshaikh, M., Movsas, B., & Thind, K. (2023).
Published in BJR|Artificial Intelligence (2025) - doi:10.1093/bjrai/ubaf010
Introduced NLP's role in converting clinical text to structured data for radiation oncology.
Reviewed major advancements in NLP, focusing on applications in radiation oncology.
Proposed a comprehensive evaluation framework for assessing NLP models' readiness for clinical use, focusing on purpose, technical performance, bias, ethics, and quality assurance.
Identified current challenges with LLMs, including hallucinations, bias, and issues in clinical deployment.
Outlined a checklist for clinical implementation, providing practical guidance for researchers and clinicians to evaluate NLP models for safe and effective use.
In collaboration with:
Henry Ford Health
MambaNet: A Hybrid Neural Network for Predicting the NBA Playoffs
Khanmohammadi, R., Saba-Sadiya, S., Esfandiarpour, S., Alhanai, T., & Ghassemi, M. M.
Published in SN Computer Science (2024) - doi:10.1007/s42979-024-02977-0
Introduced MambaNet for NBA playoff prediction with advanced neural layers.
Leveraged Feature Imitating Networks (FINs) for improved statistical feature representation.
Outperformed baseline models, achieving AUC up to 0.82.
Demonstrated model generalizability with NBA and Iranian Super League data.
In collaboration with:
Hudl Instat
NYU-AD
The Broad Impact of Feature Imitation: Neural Enhancements Across Financial, Speech, and Physiological Domains
Khanmohammadi, R., Alhanai, T., & Ghassemi, M. M.
Under review - https://arxiv.org/abs/2309.12279
FINs with Tsallis entropy boosted performance in finance, speech, and physiology tasks.
FIN-ENN improved Bitcoin prediction accuracy by reducing RMSE and MAPE.
Enhanced speech emotion recognition by 2.65% with FIN.
Improved Chronic Neck Pain detection accuracy to 62.5%, outperforming traditional models.
In collaboration with:
NYU-AD
Fetal Biological Sex Identification using Machine and Deep Learning Algorithms on Phonocardiogram Signals
Khanmohammadi, R., Mirshafiee, M. S., Alhanai, T., & Ghassemi, M. M.
Under review - https://arxiv.org/abs/2110.06131
Developed a method to identify fetal biological sex from fetal phonocardiogram (FPCG) signals.
Achieved 91% accuracy, surpassing previous baselines by 10%.
Analyzed a dataset of 1000 FPCG samples, balanced across male and female fetuses.
Combined statistical and sound features to improve classification over individual models.
In collaboration with:
NYU-AD
COPER: a Query-Adaptable Semantics-based Search Engine for Persian COVID-19 Articles
Khanmohammadi, R., Mirshafiee, M. S., Allahyari, M. (2021)
Published at the 2021 International Conference on Web Research (ICWR'21) - doi:10.1109/ICWR51868.2021.9443151
Built COPER, a search engine with 3,500 Persian COVID-19 articles.
Used BM25, TF-IDF, and BERT/SBERT for query-adaptive re-ranking.
Developed PerSICK, the first Persian semantic textual similarity dataset with 3,000 pairs.
Fine-tuned SBERT, achieving 97% STS accuracy.
Prose2Poem: The Blessing of Transformers in Translating Prose to Persian Poetry
Khanmohammadi, R., Mirshafiee, M. S., Rezaee, Y., Mirroshandel, S. A.
Published in ACM Transactions on Asian and Low-Resource Language Information Processing (2023) - doi:10.1145/359279
Created the first Persian Prose-to-Poem translation using a new low-resource NMT method.
Released a unique prose-poem and synonym-antonym dataset in Persian.
PGST: A Persian gender style transfer method
Khanmohammadi, R., Mirroshandel, S. A.
Published in Natural Language Engineering (2023) - doi:10.1017/S1351324923000426
PGST is the first Persian text style transfer method for gender-based language differences.
A benchmark compares PGST with models using word and character embeddings.
PGST is extended to English and evaluated against top models with various metrics.



























