2Health Sciences University, Van Training and Research Hospital, Van, Türkiye
3İstanbul Aydın University, Medical Park Florya Hospital, İstanbul, Türkiye
4Department of Radiology, Machine and Hybrid Intelligence Lab, Northwestern University, Chicago, IL, USA
5Department of Cardiology, Health Sciences University, Sultan Abdulhamid Han Training and Research Hospital, İstanbul, Türkiye
Abstract
Background: Interpreting intracardiac electrograms (EGMs) requires expertise that many cardiologists lack. Artificial intelligence models like ChatGPT-4o may improve diagnostic accuracy. This study evaluates ChatGPT-4o’s performance in EGM interpretation across 4 scenarios (A-D) with increasing contextual information.
Methods: Twenty EGM cases from The EHRA Book of Pacemaker, ICD, and CRT Troubleshooting were analyzed using ChatGPT-4o. Ten predefined features were assessed in Scenarios A and B, while Scenarios C and D required 20 correct responses per scenario across all cases. Performance was evaluated over 2 months using McNemar’s test, Cohen’s Kappa, and Prevalence- and Bias-Adjusted Kappa (PABAK).
Results: Providing clinical context enhanced ChatGPT-4o’s accuracy, improving from 57% (Scenario A) to 66% (Scenario B). “No Answer” rates decreased from 19.5% to 8%, while false responses increased from 8.5% to 11%, suggesting occasional misinterpretation. Agreement in Scenario A showed high reliability for atrial activity (κ = 0.7) and synchronization (κ = 0.7), but poor for chamber (κ = −0.26). In Scenario B, understanding achieved near-perfect agreement (Prevalence-Adjustment and Bias-Adjustment Kappa (PABAK) = 1), while ventricular activity remained unreliable (κ = −0.11). In Scenarios C (30%) and D (25%), accuracy was lower, and agreement between baseline and second-month responses remained fair (κ = 0.285 and 0.3, respectively), indicating limited consistency in complex decision-making tasks.
Conclusion: This study provides the first systematic evaluation of ChatGPT-4o in EGM interpretation, demonstrating promising accuracy and reliability in structured tasks. While the model integrated contextual data well, its adaptability to complex cases was limited. Further optimization and validation are needed before clinical use.