2Health Sciences University, Van Training and Research Hospital, Van, Türkiye
3İstanbul Aydın University, Medical Park Florya Hospital, İstanbul, Türkiye
4Department of Radiology, Machine and Hybrid Intelligence Lab, Northwestern University, Chicago, IL, USA
5Department of Cardiology, Health Sciences University, Sultan Abdulhamid Han Training and Research Hospital, İstanbul, Türkiye
Abstract
Background: Interpreting intracardiac electrograms (EGMs) requires expertise that many cardiologists lack. Artificial intelligence models like ChatGPT-4o may improve diagnostic accuracy. This study evaluates ChatGPT-4o’s performance in EGM interpretation across 4 scenarios (A-D) with increasing contextual information.
Methods: Twenty EGM cases from The EHRA Book of Pacemaker, ICD, and CRT Troubleshooting were analyzed using ChatGPT-4o. Ten predefined features were assessed in Scenarios A and B, while Scenarios C and D required 20 correct responses per scenario across all cases. Performance was evaluated over 2 months using McNemar’s test, Cohen’s Kappa, and Prevalence- and Bias-Adjusted Kappa (PABAK).
Results: Providing clinical context enhanced ChatGPT-4o’s accuracy, improving from 57% (Scenario A) to 66% (Scenario B). “No Answer” rates decreased from 19.5% to 8%, while false responses increased from 8.5% to 11%, suggesting occasional misinterpretation. Agreement in Scenario A showed high reliability for atrial activity (κ = 0.7) and synchronization (κ = 0.7), but poor for chamber (κ = −0.26). In Scenario B, understanding achieved near-perfect agreement (Prevalence-Adjustment and Bias-Adjustment Kappa (PABAK) = 1), while ventricular activity remained unreliable (κ = −0.11). In Scenarios C (30%) and D (25%), accuracy was lower, and agreement between baseline and second-month responses remained fair (κ = 0.285 and 0.3, respectively), indicating limited consistency in complex decision-making tasks.
Conclusion: This study provides the first systematic evaluation of ChatGPT-4o in EGM interpretation, demonstrating promising accuracy and reliability in structured tasks. While the model integrated contextual data well, its adaptability to complex cases was limited. Further optimization and validation are needed before clinical use.
Highlights
- ChatGPT-4o was systematically evaluated for the first time in EGM interpretation and demonstrated promising accuracy rates.
- The model’s accuracy was 57% when using only EGM data, increasing to 66% when clinical context and device parameters were included.
- While high reliability was observed in critical variables, the model’s consistency remained limited, particularly in complex decision-making processes and certain rhythm parameters.
- Advanced optimization and large-scale validation studies are required for the integration of large language models (LLMs) into clinical practice.
Introduction
Intracardiac electrograms (EGMs) offer a highly detailed view of cardiac electrical activity, serving as a critical tool in the management of pacemaker (PM) patients. However, their intricate nature often necessitates specialized expertise, which may not always be readily available among clinicians. This gap in knowledge and expertise poses a significant challenge to achieving accurate and timely diagnoses, potentially impacting patient outcomes. Moreover, the global surge in cardiac implantable electronic device (CIED) procedures has placed increasing strain on healthcare systems, both in terms of clinical capacity and financial resources. These combined factors underscore the urgent need for innovative solutions to streamline EGM interpretation and improve the efficiency of CIED management in modern healthcare settings.
Approximately, 25% of patients miss follow-ups within the first year, with access challenges particularly affecting elderly, disabled, and rural populations. Remote device management for CIED, including routine remote follow-up and event-triggered remote monitoring (RM), has improved access and follow-up compliance. The coronavirus disease 2019 (COVID-19) pandemic underscored the importance of remote healthcare for overcoming logistical barriers.1-
Artificial intelligence (AI), particularly large language models (LLMs), holds transformative potential for addressing these challenges. While extensively used in electrocardiogram (ECG) interpretation, LLMs’ application in analyzing the more complex EGMs remains underexplored.6-
This study evaluates ChatGPT-4o’s ability to interpret intracardiac EGMs from PM patients using a scenario-based framework. The assessment focuses on its accuracy, reliability, and adaptability across varying clinical complexities. Findings provide insights into how LLMs could be integrated into clinical workflows to support diagnostics and improve patient care.
Methods
Twenty PM EGM examples were sourced from
The first 20 EGM cases were selected sequentially without specific criteria, avoiding randomization or stratification by difficulty, which may impact generalizability. ChatGPT-4o, utilizing image processing, interpreted these cases across 4 scenarios of increasing complexity, ranging from isolated EGM analysis to clinical context and multiple-choice formats.
Ten predefined features were assessed in Scenarios A and B, while Scenarios C and D required 20 correct responses per scenario across all cases.
Four scenarios based on the same case were presented to the model at 1-week intervals:
Intracardiac electrograms images were presented in their original format, as extracted from the source material, without modifications such as preprocessing, resizing, or filtering. The resolution and clarity of these images were consistent with those in the EHRA book. This ensured that the model’s performance was evaluated under realistic conditions.
The procedures detailed above were conducted again over a second month, with each scenario spaced 1 week apart, to evaluate the consistency of the model’s responses. This repeated evaluation allowed for an assessment of intra-model variability and potential learning effects across sessions.
Evaluation Method
In the first 2 scenarios, an assessment was conducted to determine whether the features in the EGM were accurately identified. This analysis was based on 10 specific criteria (
In the third and fourth scenarios, the assessment aimed to determine if the model (ChatGPT-4o) could accurately give or select the correct answer from the given choices (False or True). Scenarios C and D were evaluated based on whether the model selected the correct answer for each case. This grading system ensured a detailed and structured analysis of the model’s performance in interpreting EGM data. The responses to all 4 scenarios were evaluated by 2 independent cardiac device specialists. In cases of disagreement, a third expert was consulted. The inter-rater reliability between evaluators was not quantified, but consensus-based adjudication was employed to resolve discrepancies.
ChatGPT-4o
ChatGPT-4o, developed by OpenAI, is an advanced LLM designed to process and generate human-like text. Built on deep learning techniques, it supports a wide range of tasks, including natural language understanding, text generation, and image analysis. Its multimodal capabilities allow it to integrate and analyze both visual and textual data, making it particularly suitable for complex, data-rich applications.
In this study, ChatGPT-4o was tasked with interpreting EGM visuals alongside accompanying textual information, such as clinical context and device parameters. This dual-input approach leveraged the model’s ability to synthesize diverse data types, providing a comprehensive framework for evaluating diagnostic accuracy. By combining visual signal interpretation with textual analysis, ChatGPT-4o demonstrated its potential as a versatile tool for improving the interpretation of EGMs and supporting clinical workflows.
Repeated testing over a second month was performed to evaluate intra-model consistency. As ChatGPT-4o does not retain memory between sessions, this design does not reflect model learning or adaptation. While the model itself was the subject of analysis, it was not used to generate any scientific content or interpret study results. Language polishing was performed with AI support, but all intellectual and analytical contributions were made by the authors.
Statistical Analysis
All statistical analyses were conducted using Python version 3.1.4 (Python Software Foundation, USA). The distributions of responses in Scenarios A and B were presented with the percentages of answers (True, False, Non-Relevant, No Answer) to evaluate the model’s baseline performance across categories. Responses were categorized using the classification system described previously. For demonstration of percentages, the authors used pie charts and stacked bar charts. For comparisons and agreement analyses of answers between scenarios, true responses were compared with the other responses. Percentage differences between Scenario A and Scenario B were calculated for each variable, and the results were visualized using a heatmap to provide a clear comparison of performance variations across categories.
Because the answers were paired (dependent) and categorical, McNemar’s test was used for comparisons between Scenarios A-B and C-D, as well as between baseline and second-month percentages. Agreement between baseline and second-month responses was assessed using Cohen’s Kappa and PABAK (Prevalence-Adjustment and Bias-Adjustment Kappa) analyses, with interpretations provided to classify the degree of agreement (e.g., poor, slight, moderate, substantial, and perfect).
It was determined that Cohen’s Kappa has a prominent limitation in data with different prevalences (e.g. very high or very low) and in non-balanced data.11,
This study employed the AI-based LLM ChatGPT-4o (OpenAI) for the interpretation of EGMs. Additionally, AI-assisted tools were used for language editing to improve the clarity and readability of the manuscript. However, all scientific content, analysis, and conclusions were generated by the authors without AI influence.
Transparency Statement
ChatGPT-4o was evaluated solely as the subject of analysis in this study. It was not used to generate, interpret, or revise any scientific content related to study design, data analysis, or conclusions. All scientific reasoning, methodology development, and diagnostic interpretation were performed independently by the authors. Minor language editing was conducted using external AI-based proofreading tools (ChatGPT-4o and Grammarly), limited to stylistic and grammatical refinement only.
Results
Scenarios A and B: Distribution of Answers and Variable Performance
The distributions of answers across Scenarios A and B highlight key differences in performance and engagement. In Scenario A, the majority of responses (57%) were True, while 19.5% were categorized as “No Answer,” 15% as “Non-Relevant,” and 8.5% as “False.” Conversely, in Scenario B, the percentage of True answers improved to 66%. The proportion of “No Answer” responses decreased to 8%, while “False” responses increased slightly to 11% (
The percentage differences between Scenarios A and B are visualized in
Accuracy Comparison Between Scenarios C and D
In Scenario C, 30% of responses were True, while in Scenario D, this rate decreased to 25% (
Agreement Metrics for Baseline and Second-Month Responses
In Scenario A, agreement levels ranged from poor to substantial, with substantial agreement observed for atrial activity (kappa = 0.7) and synchronizations (kappa = 0.7), indicating high reliability. Moderate agreement was noted for pseudomalfunction (kappa = 0.47), while fair or slight agreement was seen for timing intervals (kappa = 0.4) and Pacing Mode (kappa = 0.08). Poor agreement was identified for chamber (kappa = −0.26) and ventricular activity (kappa = −0.25).
In Scenario B, agreement ranged from poor to almost perfect, with understanding (PABAK = 1) achieving almost perfect agreement, reflecting high consistency. Moderate agreement was noted for pseudomalfunction (kappa = 0.6) and timing intervals (kappa = 0.4), while fair agreement was observed for pacing mode (kappa = 0.3) and synchronizations (kappa = 0.3). Poor agreement was evident for ventricular activity (kappa = −0.11).
Overall, Scenarios A and B highlighted strong reliability in certain features, such as understanding and atrial activity, but also significant variability in features like chamber and ventricular activity.
Discussion
Our study evaluates the performance of ChatGPT-4o in interpreting intracardiac EGMs and provides important findings on how the model can support physicians working with cardiac devices. The accuracy rate in Scenario A was 57%, increasing to 66% in Scenario B when additional contextual information and device parameters were provided. This improvement highlights the role of enriched contextual data in enhancing the model’s diagnostic performance. Additionally, the proportion of “No Answer” responses decreased from 19.5% to 8%, indicating the model’s increased ability to generate responses when given additional information. However, the rise in “False” responses (from 8.5% to 11%) suggests that the model sometimes misinterprets contextual data, leading to incorrect predictions.
The model demonstrated high accuracy in specific variables, particularly in pacing mode, which reached 85% in Scenario B, suggesting that contextual support can significantly enhance diagnostic precision. However, in the second-month evaluation, the accuracy rate declined to 60%, indicating potential consistency issues in long-term performance. Similarly, the chamber variable showed a significant improvement from 55% to 90% at baseline, but by the second month, this difference was no longer statistically significant (
The model’s low accuracy in more complex tasks, particularly in Scenarios C and D (30% and 25%, respectively, at baseline), is noteworthy. This decline in performance can be attributed to several factors, including the requirement for direct answer selection and the complexity of the clinical context. Despite the addition of visual EGM data, text-based clinical information, and specific questions, accuracy in Scenario C remained stagnant at 30% (6 correct answers) between the baseline and second-month evaluations. While Scenario D demonstrated an improvement to 50% (10 correct answers) in the second month, this change was not statistically significant (
In Scenario D, the model exhibited persistent difficulties in EGM interpretation, particularly in timing analysis, sensing detection, and PM algorithm recognition. It frequently misclassified pacing behaviors, such as AV crosstalk, mode switching, and noise reversion, leading to incorrect conclusions about pacing inhibition and atrial tracking. Additionally, sensing issues, particularly ventricular and atrial undersensing, resulted in misdiagnoses of AV block, device malfunction, or loss of capture. The model also struggled with pacing mode classification and threshold determination, affecting its ability to assess capture consistency and pacing behavior. These findings suggest that the model still lacks the ability to correlate programmed device settings with real-time EGM findings and distinguish between similar pacing abnormalities. While its capacity to process EGM data is evident, improving its integration of device-specific algorithms and refining structured decision pathways are essential to optimize its accuracy in complex pacing scenarios. Taken together, these results indicate that despite encouraging performance in simpler settings, the model’s current reliability remains insufficient for clinical decision support—particularly in complex or high-risk pacing scenarios. Further refinement and device-specific training are likely required for safe and effective clinical deployment.
The reliance on physician-centered approaches for EGM interpretation in PM evaluations poses challenges such as variability in accuracy, time constraints, and potential delays in patient care; however, the adoption of remote device management for CIEDs, has significantly improved access and follow-up compliance. Remote monitoring systems are recommended to reduce in-office visits and extend follow-up intervals to up to 24 months. However, RM faces several challenges, including staff shortages, organizational inefficiencies, and data overload. PMs generate a high volume of non-urgent alerts, many of which result from false positives due to arrhythmia misclassification, while Implantable Cardioverter Defibrillators (ICDs) produce fewer but predominantly critical alerts, often caused by oversensing or lead noise.13-
Although previous studies have demonstrated the potential of AI and LLMs in cardiac diagnostics, their approaches and focuses vary significantly (
Günay et al9 evaluated GPT-4-based ChatGPT for ECG interpretation using 40 multiple-choice questions derived from clinical cases, reporting an impressive 90.8% accuracy. However, since their model relied on textual descriptions rather than direct analysis of ECG images, its applicability in visual signal interpretation remained limited.8 Another study trained LLMs on ECG-report alignment tasks to improve heart failure (HF) risk prediction from 12-lead ECG data. By correlating ECG signals with corresponding clinical reports, the model effectively identified risk markers, offering promising insights for long-term risk assessment. However, this approach focused on population-level predictions rather than immediate diagnostic applications, making it less relevant to real-time EGM interpretation.9
Our study is the first to evaluate ChatGPT-4o’s performance in interpreting intracardiac EGMs across varying clinical contexts, highlighting its potential as a diagnostic aid. The model demonstrated reasonable accuracy and reliability, particularly in visualized EGM analysis, providing a strong foundation for addressing diagnostic gaps in cardiology. The authors’ approach differs from previous studies by integrating both visual and contextual data, assessing the model’s adaptability across diverse clinical settings, and identifying critical areas for improvement.
AI-assisted EGM analysis should integrate both visual and structured text-based data to enhance accuracy. The ideal format—raw waveforms vs. structured parameters—remains uncertain, as different manufacturers use distinct sensing and pacing algorithms. Instead of a universal model, fine-tuned models per manufacturer may improve reliability. AI’s scope is also crucial—should it analyze full device data or focus on specific tasks like capture loss detection? While full autonomy is a long-term goal, a structured approach incorporating contextual data (e.g., ECGs, patient exams, imaging) may enhance interpretation. A recent study have explored the role of extended passive and active EGM recordings in optimizing device diagnostics, highlighting the importance of systematic data processing for improved detection of anomalies.16 In future clinical workflows, models like ChatGPT-4o could be integrated into remote monitoring systems to assist in alert triage or used as point-of-care diagnostic aids during ambulatory device evaluations. These applications would benefit from further optimization, including manufacturer-specific fine-tuning and structured feedback loops,
Our design also has methodological implications. First, the EGM cases were selected sequentially from the EHRA book without randomization or balancing by difficulty. Although this approach ensured real-world representativeness, it may have introduced selection bias and limited generalizability. Second, the repetition of the same cases over time was intended solely to assess intra-model decision consistency. Since ChatGPT-4o lacks session memory, this process cannot be interpreted as fine-tuning or learning. These design elements should be refined in future research. This study evaluated the model without fine-tuning, yet it demonstrated self-correction with feedback, highlighting its potential for structured training (Supplementary File 1). With further optimization, AI models could surpass human performance, particularly in electrophysiology and cardiac device troubleshooting, where expertise varies. For LLMs to integrate into clinical workflows, continuous learning and real-time clinician feedback are essential. Artificial intelligence-driven adaptive systems can refine performance, enhance RM, and automate data interpretation to reduce diagnostic errors. Future research should focus on fine-tuning models with diverse datasets to improve generalizability. As AI advances, its role in cardiology is expected to transform diagnostics, optimize precision, and improve patient outcomes.
Study Limitations
This study has several limitations that warrant consideration. The model was evaluated on a relatively small dataset of 20 EGM cases which may not fully capture the variability and complexity of real-world clinical scenarios. Moreover, it was not tested in real-time clinical settings, where factors such as noise, incomplete data, and variations across device types could significantly impact performance. The dataset also lacked representation of all PM brands and models, limiting the model’s generalizability to a wider range of devices. Furthermore, the absence of direct comparison with human experts restricts the ability to benchmark its diagnostic utility effectively. Another limitation is the sequential selection of EGM cases without randomization or difficulty stratification, which may affect generalizability. Repeated testing assessed decision stability, not learning, as ChatGPT-4o lacks memory. Finally, while AI tools aided in language editing, all scientific content and interpretations were author-generated.
Conclusion
This study represents the first systematic evaluation of ChatGPT-4o’s ability to interpret intracardiac EGMs, demonstrating promising potential as a diagnostic support tool. The model exhibited reasonable accuracy and reliability, particularly in scenarios involving visualized EGM data. However, further optimization and large-scale validation are necessary to fully integrate LLM-based models into clinical practice.
Supplementary Materials
Footnotes
References
- O’Shea CJ, Middeldorp ME, Hendriks JM. Remote monitoring alert burden: an analysis of transmission in >26,000 patients. JACC Clin Electrophysiol. 2021;7(2):226-234.
- Hess PL, Mi X, Curtis LH, Wilkoff BL, Hegland DD, Al-Khatib SM. Follow-up of patients with new cardiovascular implantable electronic devices: is adherence to the experts’ recommendations associated with improved outcomes?. Heart Rhythm. 2013;10(8):1127-1133.
- Çay S, Duygu KB, Bozyel S. Remote follow-up/monitoring of cardiac implantable electronic devices. Turk Kardiyol Dern Ars. 2025;53(1):53-61.
- Varma N, Epstein AE, Irimpen A, Schweikert R, Love C. Efficacy and safety of automatic remote monitoring for implantable cardioverter-defibrillator follow-up: the Lumos-T Safely Reduces Routine Office Device Follow-up (TRUST) trial. Circulation. 2010;122(4):325-332.
- Crossley GH, Boyle A, Vitense H, Chang Y, Mead RH. The CONNECT (Clinical Evaluation of Remote Notification to Reduce Time to Clinical Decision) trial: the value of wireless remote monitoring with automatic clinician alerts. J Am Coll Cardiol. 2011;57(10):1181-1189.
- Ferreira HDS, Ferrer F, Cabanelas N. Artificial intelligence: an attempt to automate remote device follow-up. EP Europace. 2021;23(Suppl 3):-.
- Pang X, Zhang Y, Li J, Wang Z. Innovative application of large language models for ECG analysis: a CSSOTP-based feature extraction approach. J Biomed Signal Process. 2023;12(3):123-135.
- Chen C. Large Language Model-Informed ECG Dual Attention Network for Heart Failure Risk Prediction. . 2024;():-.
- Günay S, Öztürk A, Özerol H, Yiğit Y, Erenler AK. Comparison of emergency medicine specialist, cardiologist, and Chat-GPT in electrocardiography assessment. Am J Emerg Med. 2024;80():51-60.
- Burri H, Israel C, Deharo JC. . The EHRA Book of Pacemaker, ICD, and CRT Troubleshooting: Case-Based Learning with Multiple Choice Questions. 2015;():-.
- Chen G, Faris P, Hemmelgarn B, Walker RL, Quan H. Measuring agreement of administrative data with chart data using prevalence unadjusted and adjusted kappa. BMC Med Res Methodol. 2009;9():5-.
- Tan KS, Yeh YC, Adusumilli PS, Travis WD. Quantifying interrater agreement and reliability between thoracic pathologists: paradoxical behavior of Cohen’s kappa in the presence of a high prevalence of the histopathologic feature in lung cancer. JTO Clin Res Rep. 2024;5(1):-.
- Glikson M, Nielsen JC, Kronborg MB. 2021 ESC Guidelines on cardiac pacing and cardiac resynchronization therapy. Eur Heart J. 2021;42(35):3427-3520.
- Ferrick AM, Raj SR, Deneke T. 2023 HRS/EHRA/APHRS/LAHRS expert consensus statement on practical management of the remote device clinic. Heart Rhythm. 2023;20(9):e92-e144.
- Hindricks G, Varma N, Kacet S. Daily remote monitoring of implantable cardioverter-defibrillators: insights from the pooled patient-level data from three randomized controlled trials (IN-TIME, ECOST, TRUST). Eur Heart J. 2017;38(22):1749-1755.
- Quinonero C, Strik M, Catalan PA. Improving diagnostic yield for analyzing periodic electrograms in the remote detection of pacemaker lead issues. Sensors (Basel). 2025;25(3):-.