Artificial intelligence meets medical rarity: evaluating ChatGPT’s responses on post-orgasmic illness syndrome


Sökmen D., Albayrak A. T., Sertkaya Z., Başağa Y., Serefoglu E. C.

International Journal of Impotence Research, 2025 (SCI-Expanded, Scopus) identifier identifier

Özet

Post-Orgasmic Illness Syndrome (POIS) is a rare and debilitating condition characterized by systemic and cognitive symptoms following ejaculation. As patients increasingly seek health information from artificial intelligence (AI) tools such as ChatGPT, evaluating the accuracy, consistency, and readability of these responses is especially important in the context of underrecognized conditions like POIS, where patients often encounter limited access to specialist care and evidence-based educational resources. This makes generative AI a likely source of health information, underscoring the need to evaluate the accuracy, consistency, and readability of its outputs. This study assessed the performance of ChatGPT version 4o (ChatGPT-4o) in generating patient-directed responses to POIS-related questions. Sixteen real-world questions were selected across four content domains: epidemiology, treatment, treatment risks, and counseling. Each question was submitted to ChatGPT-4o on two different days using separate accounts. Responses were independently graded by three English-speaking urologists with expertise in men’s sexual health and andrology using a validated 4-point scale: “correct and comprehensive,” “correct but inadequate,” “mixed correct and incorrect,” and “completely incorrect.” Reproducibility was defined by whether the two responses received the same grading category, and Cohen’s kappa coefficient (κ) was calculated to measure inter-rater agreement. Readability was assessed using the Gunning Fog Index (GFI). ChatGPT-4o demonstrated high performance in the epidemiology and counseling domains, achieving 100% accuracy and 100% reproducibility (κ = 1.00). However, accuracy dropped to 50% in the treatment and risk domains, with lower reproducibility (κ = 0.25). Readability scores worsened significantly from Day 1 to Day 2 across all domains (p < 0.05), indicating a shift toward more linguistically complex, less accessible language. While ChatGPT-4o shows potential in supporting patient education for rare conditions like POIS, its variability in treatment content and elevated language complexity limit its reliability as a stand-alone medical resource. These findings underscore the need for expert oversight and further model refinement before large language models can be safely integrated into clinical patient communication.