Large Language Models in Intracardiac Electrogram Interpretation: A New Frontier in Cardiac Diagnostics for Pacemaker Patients

Bozyel, Serdar; Duman, Ahmet Berk; Dalgıç, Şadiye Nur; Şipal, Abdülcebar; Şaylık, Faysal; Gölcük Önder, Şükriye Ebru; Çağdaş, Metin; Güler, Tümer Erdem; Aksu, Tolga; Bağcı, Ulas; Keser, Nurgül

doi:10.14744/AnatolJCardiol.2025.5238

Original Investigation

Large Language Models in Intracardiac Electrogram Interpretation: A New Frontier in Cardiac Diagnostics for Pacemaker Patients

Serdar Bozyel ¹

, Ahmet Berk Duman ¹

, Şadiye Nur Dalgıç ¹

, Abdülcebar Şipal ¹

, Faysal Şaylık ²

, Şükriye Ebru Gölcük Önder ¹

, Metin Çağdaş ¹

, Tümer Erdem Güler ¹

, Tolga Aksu ³

, Ulas Bağcı ⁴

, Nurgül Keser ⁵

¹Department of Cardiology, Health Sciences University, Kocaeli City Hospital, Kocaeli, Türkiye
²Health Sciences University, Van Training and Research Hospital, Van, Türkiye
³İstanbul Aydın University, Medical Park Florya Hospital, İstanbul, Türkiye
⁴Department of Radiology, Machine and Hybrid Intelligence Lab, Northwestern University, Chicago, IL, USA
⁵Department of Cardiology, Health Sciences University, Sultan Abdulhamid Han Training and Research Hospital, İstanbul, Türkiye

Anatol J Cardiol 2025; 29(10): 533-542 PubMed ID: 40637685 PMCID: PMC12503097 DOI: 10.14744/AnatolJCardiol.2025.5238

Full Text PDF

Abstract

Background: Interpreting intracardiac electrograms (EGMs) requires expertise that many cardiologists lack. Artificial intelligence models like ChatGPT-4o may improve diagnostic accuracy. This study evaluates ChatGPT-4o’s performance in EGM interpretation across 4 scenarios (A-D) with increasing contextual information.

Methods: Twenty EGM cases from The EHRA Book of Pacemaker, ICD, and CRT Troubleshooting were analyzed using ChatGPT-4o. Ten predefined features were assessed in Scenarios A and B, while Scenarios C and D required 20 correct responses per scenario across all cases. Performance was evaluated over 2 months using McNemar’s test, Cohen’s Kappa, and Prevalence- and Bias-Adjusted Kappa (PABAK).

Results: Providing clinical context enhanced ChatGPT-4o’s accuracy, improving from 57% (Scenario A) to 66% (Scenario B). “No Answer” rates decreased from 19.5% to 8%, while false responses increased from 8.5% to 11%, suggesting occasional misinterpretation. Agreement in Scenario A showed high reliability for atrial activity (κ = 0.7) and synchronization (κ = 0.7), but poor for chamber (κ = −0.26). In Scenario B, understanding achieved near-perfect agreement (Prevalence-Adjustment and Bias-Adjustment Kappa (PABAK) = 1), while ventricular activity remained unreliable (κ = −0.11). In Scenarios C (30%) and D (25%), accuracy was lower, and agreement between baseline and second-month responses remained fair (κ = 0.285 and 0.3, respectively), indicating limited consistency in complex decision-making tasks.

Conclusion: This study provides the first systematic evaluation of ChatGPT-4o in EGM interpretation, demonstrating promising accuracy and reliability in structured tasks. While the model integrated contextual data well, its adaptability to complex cases was limited. Further optimization and validation are needed before clinical use.

Keywords: Artificial intelligence, large language models, intracardiac electrograms, pacemaker, ChatGPT-4o

Highlights

ChatGPT-4o was systematically evaluated for the first time in EGM interpretation and demonstrated promising accuracy rates.
The model’s accuracy was 57% when using only EGM data, increasing to 66% when clinical context and device parameters were included.
While high reliability was observed in critical variables, the model’s consistency remained limited, particularly in complex decision-making processes and certain rhythm parameters.
Advanced optimization and large-scale validation studies are required for the integration of large language models (LLMs) into clinical practice.

Introduction

Intracardiac electrograms (EGMs) offer a highly detailed view of cardiac electrical activity, serving as a critical tool in the management of pacemaker (PM) patients. However, their intricate nature often necessitates specialized expertise, which may not always be readily available among clinicians. This gap in knowledge and expertise poses a significant challenge to achieving accurate and timely diagnoses, potentially impacting patient outcomes. Moreover, the global surge in cardiac implantable electronic device (CIED) procedures has placed increasing strain on healthcare systems, both in terms of clinical capacity and financial resources. These combined factors underscore the urgent need for innovative solutions to streamline EGM interpretation and improve the efficiency of CIED management in modern healthcare settings.

Approximately, 25% of patients miss follow-ups within the first year, with access challenges particularly affecting elderly, disabled, and rural populations. Remote device management for CIED, including routine remote follow-up and event-triggered remote monitoring (RM), has improved access and follow-up compliance. The coronavirus disease 2019 (COVID-19) pandemic underscored the importance of remote healthcare for overcoming logistical barriers.^1-3 Studies like TRUST and CONNECT demonstrate RM’s effectiveness in detecting arrhythmias, often matching or surpassing in-person visits.^4,5 However, current systems rely on rigid algorithms that lack real-time contextual adaptability, leading to delays, increased workloads, and potential diagnostic errors.

Artificial intelligence (AI), particularly large language models (LLMs), holds transformative potential for addressing these challenges. While extensively used in electrocardiogram (ECG) interpretation, LLMs’ application in analyzing the more complex EGMs remains underexplored.^6-9 Artificial intelligence tools like ChatGPT-4o can automate the interpretation of such datasets, identifying patterns beyond the scope of conventional algorithms. With its ability to integrate textual and visual data, ChatGPT-4o emerges as a valuable diagnostic aid, especially in resource-limited settings lacking specialized expertise.

This study evaluates ChatGPT-4o’s ability to interpret intracardiac EGMs from PM patients using a scenario-based framework. The assessment focuses on its accuracy, reliability, and adaptability across varying clinical complexities. Findings provide insights into how LLMs could be integrated into clinical workflows to support diagnostics and improve patient care.

Methods

Twenty PM EGM examples were sourced from The European Heart Rhythm Association (EHRA) Book of Pacemaker, ICD, and CRT Troubleshooting, a key reference for cardiologists specializing in cardiac devices.¹⁰ This resource, commissioned by the EHRA and developed under the guidance of the EHRA Education Committee, features tests structured as multiple-choice questions with correct answers and brief explanations. Each case-based question comprises 4 sections:

The first 20 EGM cases were selected sequentially without specific criteria, avoiding randomization or stratification by difficulty, which may impact generalizability. ChatGPT-4o, utilizing image processing, interpreted these cases across 4 scenarios of increasing complexity, ranging from isolated EGM analysis to clinical context and multiple-choice formats.

Ten predefined features were assessed in Scenarios A and B, while Scenarios C and D required 20 correct responses per scenario across all cases.

Four scenarios based on the same case were presented to the model at 1-week intervals:

Intracardiac electrograms images were presented in their original format, as extracted from the source material, without modifications such as preprocessing, resizing, or filtering. The resolution and clarity of these images were consistent with those in the EHRA book. This ensured that the model’s performance was evaluated under realistic conditions.

The procedures detailed above were conducted again over a second month, with each scenario spaced 1 week apart, to evaluate the consistency of the model’s responses. This repeated evaluation allowed for an assessment of intra-model variability and potential learning effects across sessions.

Evaluation Method

In the first 2 scenarios, an assessment was conducted to determine whether the features in the EGM were accurately identified. This analysis was based on 10 specific criteria (Table 1). ChatGPT-4o’s responses to 20 EGM cases were evaluated using a structured accuracy assessment:

In the third and fourth scenarios, the assessment aimed to determine if the model (ChatGPT-4o) could accurately give or select the correct answer from the given choices (False or True). Scenarios C and D were evaluated based on whether the model selected the correct answer for each case. This grading system ensured a detailed and structured analysis of the model’s performance in interpreting EGM data. The responses to all 4 scenarios were evaluated by 2 independent cardiac device specialists. In cases of disagreement, a third expert was consulted. The inter-rater reliability between evaluators was not quantified, but consensus-based adjudication was employed to resolve discrepancies.

ChatGPT-4o

ChatGPT-4o, developed by OpenAI, is an advanced LLM designed to process and generate human-like text. Built on deep learning techniques, it supports a wide range of tasks, including natural language understanding, text generation, and image analysis. Its multimodal capabilities allow it to integrate and analyze both visual and textual data, making it particularly suitable for complex, data-rich applications.

In this study, ChatGPT-4o was tasked with interpreting EGM visuals alongside accompanying textual information, such as clinical context and device parameters. This dual-input approach leveraged the model’s ability to synthesize diverse data types, providing a comprehensive framework for evaluating diagnostic accuracy. By combining visual signal interpretation with textual analysis, ChatGPT-4o demonstrated its potential as a versatile tool for improving the interpretation of EGMs and supporting clinical workflows.

Repeated testing over a second month was performed to evaluate intra-model consistency. As ChatGPT-4o does not retain memory between sessions, this design does not reflect model learning or adaptation. While the model itself was the subject of analysis, it was not used to generate any scientific content or interpret study results. Language polishing was performed with AI support, but all intellectual and analytical contributions were made by the authors.

Statistical Analysis

All statistical analyses were conducted using Python version 3.1.4 (Python Software Foundation, USA). The distributions of responses in Scenarios A and B were presented with the percentages of answers (True, False, Non-Relevant, No Answer) to evaluate the model’s baseline performance across categories. Responses were categorized using the classification system described previously. For demonstration of percentages, the authors used pie charts and stacked bar charts. For comparisons and agreement analyses of answers between scenarios, true responses were compared with the other responses. Percentage differences between Scenario A and Scenario B were calculated for each variable, and the results were visualized using a heatmap to provide a clear comparison of performance variations across categories.

Because the answers were paired (dependent) and categorical, McNemar’s test was used for comparisons between Scenarios A-B and C-D, as well as between baseline and second-month percentages. Agreement between baseline and second-month responses was assessed using Cohen’s Kappa and PABAK (Prevalence-Adjustment and Bias-Adjustment Kappa) analyses, with interpretations provided to classify the degree of agreement (e.g., poor, slight, moderate, substantial, and perfect).

It was determined that Cohen’s Kappa has a prominent limitation in data with different prevalences (e.g. very high or very low) and in non-balanced data.^11,12 Therefore, it is advised to use PABAK or Gwet’s AC1 in such cases. Since the authors’ data has a non-balanced distribution between answer groups, the authors aimed to demonstrate in this study that Cohen’s Kappa has significant limitations when applied to imbalanced data, which can lead to incorrect results, and to show the superiority of using PABAK in such cases instead. These methods allowed for a comprehensive assessment of performance consistency and reliability across scenarios and time points. All figures were reviewed during manuscript preparation to ensure clarity, consistency in axis labels, and alignment between visual content and descriptive text.

This study employed the AI-based LLM ChatGPT-4o (OpenAI) for the interpretation of EGMs. Additionally, AI-assisted tools were used for language editing to improve the clarity and readability of the manuscript. However, all scientific content, analysis, and conclusions were generated by the authors without AI influence.

Transparency Statement

ChatGPT-4o was evaluated solely as the subject of analysis in this study. It was not used to generate, interpret, or revise any scientific content related to study design, data analysis, or conclusions. All scientific reasoning, methodology development, and diagnostic interpretation were performed independently by the authors. Minor language editing was conducted using external AI-based proofreading tools (ChatGPT-4o and Grammarly), limited to stylistic and grammatical refinement only.

Results

Scenarios A and B: Distribution of Answers and Variable Performance

The distributions of answers across Scenarios A and B highlight key differences in performance and engagement. In Scenario A, the majority of responses (57%) were True, while 19.5% were categorized as “No Answer,” 15% as “Non-Relevant,” and 8.5% as “False.” Conversely, in Scenario B, the percentage of True answers improved to 66%. The proportion of “No Answer” responses decreased to 8%, while “False” responses increased slightly to 11% (Figures 1A and B). These visual distributions clearly illustrate how the inclusion of clinical context in Scenario B not only improved accuracy but also reduced uncertainty, as reflected by the decline in “No Answer” rates.

Figures 1C and D illustrate the distribution of correct responses across key variables for Scenarios A and B, with statistical comparisons summarized in Table 2. The understanding variable maintained perfect accuracy (100%) across both scenarios, while Marker Annotations improved slightly from 85% to 90%, though this was not statistically significant. The most notable improvements were observed in pacing mode, which increased from 35% to 85% (P = .002), and chamber, which rose from 55% to 90% (P = .039) in Scenario B. Persistent challenges were noted in malfunction (remaining at 10%) and pseudomalfunction, which increased slightly from 25% to 30%, both without statistical significance. Overall, the most pronounced and persistent disparity was observed in pacing mode, while other features showed either stability or minor, non-significant variations. Figure 1A-D illustrate how contextual enrichment improved the distribution of response types and diagnostic accuracy across key variables, particularly by reducing uncertainty and increasing correct classifications.

The percentage differences between Scenarios A and B are visualized in Figure 1E. Pacing mode and chamber demonstrated the largest gains, with improvements of +50% and +35%, respectively. The most prominent improvements in diagnostic accuracy were observed in pacing mode (+50%) and chamber identification (+35%), providing key insight into the model’s enhanced performance when contextual information is included. In contrast, atrial activity exhibited a decline of −10%, though this was not statistically significant (P = .5). Other variables, such as marker annotations (+5%) and synchronizations (+10%), showed modest improvements. The heatmap format in Figure 1E enables a side-by-side comparison of variable-specific accuracy changes, highlighting the model’s strengths and limitations across diagnostic domains.

Accuracy Comparison Between Scenarios C and D

In Scenario C, 30% of responses were True, while in Scenario D, this rate decreased to 25% (Figure 2). Statistical analysis indicates that this difference is not significant (P = 1). Both scenarios demonstrate low accuracy, highlighting the need for further improvement. In the second month, accuracy in Scenario C was 30% and increased to 50% in Scenario D (P = .125), indicating no significant difference or improvement between the scenarios (Table 3).

Figure 3 demonstrates ChatGPT’s ability to interpret an EGM, identifying ventricular oversensing characterized by irregular R-R intervals and pacing inhibition in a Ventricular Demand Pacing with Dual atrial sensing (VDD) PM.

Agreement Metrics for Baseline and Second-Month Responses

Tables 4 and 5 summarize the agreement metrics comparing baseline and second-month responses in Scenarios A and B across 10 key features.

In Scenario A, agreement levels ranged from poor to substantial, with substantial agreement observed for atrial activity (kappa = 0.7) and synchronizations (kappa = 0.7), indicating high reliability. Moderate agreement was noted for pseudomalfunction (kappa = 0.47), while fair or slight agreement was seen for timing intervals (kappa = 0.4) and Pacing Mode (kappa = 0.08). Poor agreement was identified for chamber (kappa = −0.26) and ventricular activity (kappa = −0.25).

In Scenario B, agreement ranged from poor to almost perfect, with understanding (PABAK = 1) achieving almost perfect agreement, reflecting high consistency. Moderate agreement was noted for pseudomalfunction (kappa = 0.6) and timing intervals (kappa = 0.4), while fair agreement was observed for pacing mode (kappa = 0.3) and synchronizations (kappa = 0.3). Poor agreement was evident for ventricular activity (kappa = −0.11).

Overall, Scenarios A and B highlighted strong reliability in certain features, such as understanding and atrial activity, but also significant variability in features like chamber and ventricular activity.

Table 6 summarizes the agreement metrics for baseline and second-month responses in Scenarios C and D. Both scenarios demonstrated ‘Fair agreement,’ with a Cohen’s kappa value of 0.285 for Scenario C and 0.3 for Scenario D. These findings indicate moderate alignment between observers, suggesting no substantial variability across the evaluations.

Discussion

Our study evaluates the performance of ChatGPT-4o in interpreting intracardiac EGMs and provides important findings on how the model can support physicians working with cardiac devices. The accuracy rate in Scenario A was 57%, increasing to 66% in Scenario B when additional contextual information and device parameters were provided. This improvement highlights the role of enriched contextual data in enhancing the model’s diagnostic performance. Additionally, the proportion of “No Answer” responses decreased from 19.5% to 8%, indicating the model’s increased ability to generate responses when given additional information. However, the rise in “False” responses (from 8.5% to 11%) suggests that the model sometimes misinterprets contextual data, leading to incorrect predictions.

The model demonstrated high accuracy in specific variables, particularly in pacing mode, which reached 85% in Scenario B, suggesting that contextual support can significantly enhance diagnostic precision. However, in the second-month evaluation, the accuracy rate declined to 60%, indicating potential consistency issues in long-term performance. Similarly, the chamber variable showed a significant improvement from 55% to 90% at baseline, but by the second month, this difference was no longer statistically significant (P = .453). These findings suggest that while the model benefits from additional contextual information, its ability to retain and consistently apply this knowledge over time remains limited.

The model’s low accuracy in more complex tasks, particularly in Scenarios C and D (30% and 25%, respectively, at baseline), is noteworthy. This decline in performance can be attributed to several factors, including the requirement for direct answer selection and the complexity of the clinical context. Despite the addition of visual EGM data, text-based clinical information, and specific questions, accuracy in Scenario C remained stagnant at 30% (6 correct answers) between the baseline and second-month evaluations. While Scenario D demonstrated an improvement to 50% (10 correct answers) in the second month, this change was not statistically significant (P = .125). The combination of visual and text-based data, along with specific questions and potentially similar answer options, may have increased the cognitive load on the model, making it more challenging to synthesize and prioritize relevant information efficiently. These findings highlight the need for enhanced multimodal learning strategies that can improve the model’s ability to extract critical insights from complex datasets.

In Scenario D, the model exhibited persistent difficulties in EGM interpretation, particularly in timing analysis, sensing detection, and PM algorithm recognition. It frequently misclassified pacing behaviors, such as AV crosstalk, mode switching, and noise reversion, leading to incorrect conclusions about pacing inhibition and atrial tracking. Additionally, sensing issues, particularly ventricular and atrial undersensing, resulted in misdiagnoses of AV block, device malfunction, or loss of capture. The model also struggled with pacing mode classification and threshold determination, affecting its ability to assess capture consistency and pacing behavior. These findings suggest that the model still lacks the ability to correlate programmed device settings with real-time EGM findings and distinguish between similar pacing abnormalities. While its capacity to process EGM data is evident, improving its integration of device-specific algorithms and refining structured decision pathways are essential to optimize its accuracy in complex pacing scenarios. Taken together, these results indicate that despite encouraging performance in simpler settings, the model’s current reliability remains insufficient for clinical decision support—particularly in complex or high-risk pacing scenarios. Further refinement and device-specific training are likely required for safe and effective clinical deployment.

The reliance on physician-centered approaches for EGM interpretation in PM evaluations poses challenges such as variability in accuracy, time constraints, and potential delays in patient care; however, the adoption of remote device management for CIEDs, has significantly improved access and follow-up compliance. Remote monitoring systems are recommended to reduce in-office visits and extend follow-up intervals to up to 24 months. However, RM faces several challenges, including staff shortages, organizational inefficiencies, and data overload. PMs generate a high volume of non-urgent alerts, many of which result from false positives due to arrhythmia misclassification, while Implantable Cardioverter Defibrillators (ICDs) produce fewer but predominantly critical alerts, often caused by oversensing or lead noise.^13-15 These inefficiencies increase the workload for clinicians, highlighting the need for AI-driven solutions that can automate data processing, prioritize alerts, and enhance diagnostic accuracy to improve the efficiency of RM.

Although previous studies have demonstrated the potential of AI and LLMs in cardiac diagnostics, their approaches and focuses vary significantly (Table 7). Ferreira et al⁶ developed software utilizing optical character recognition (OCR) and natural language processing (NLP) to analyze CIED reports, extracting patient details, device parameters, and event data. While their system processed 30 reports in under 5 minutes, it focused solely on text-based report analysis rather than direct signal interpretation, limiting its diagnostic relevance.⁶ Similarly, Pang et al⁷ introduced a framework that transforms surface ECG signals into textual patterns for arrhythmia classification, achieving an accuracy of 96.20% on a public dataset. However, their study focused on surface ECGs rather than intracardiac EGMs and did not assess diagnostic reliability in real-world clinical settings.

Günay et al⁹ evaluated GPT-4-based ChatGPT for ECG interpretation using 40 multiple-choice questions derived from clinical cases, reporting an impressive 90.8% accuracy. However, since their model relied on textual descriptions rather than direct analysis of ECG images, its applicability in visual signal interpretation remained limited.⁸ Another study trained LLMs on ECG-report alignment tasks to improve heart failure (HF) risk prediction from 12-lead ECG data. By correlating ECG signals with corresponding clinical reports, the model effectively identified risk markers, offering promising insights for long-term risk assessment. However, this approach focused on population-level predictions rather than immediate diagnostic applications, making it less relevant to real-time EGM interpretation.⁹

Our study is the first to evaluate ChatGPT-4o’s performance in interpreting intracardiac EGMs across varying clinical contexts, highlighting its potential as a diagnostic aid. The model demonstrated reasonable accuracy and reliability, particularly in visualized EGM analysis, providing a strong foundation for addressing diagnostic gaps in cardiology. The authors’ approach differs from previous studies by integrating both visual and contextual data, assessing the model’s adaptability across diverse clinical settings, and identifying critical areas for improvement.

AI-assisted EGM analysis should integrate both visual and structured text-based data to enhance accuracy. The ideal format—raw waveforms vs. structured parameters—remains uncertain, as different manufacturers use distinct sensing and pacing algorithms. Instead of a universal model, fine-tuned models per manufacturer may improve reliability. AI’s scope is also crucial—should it analyze full device data or focus on specific tasks like capture loss detection? While full autonomy is a long-term goal, a structured approach incorporating contextual data (e.g., ECGs, patient exams, imaging) may enhance interpretation. A recent study have explored the role of extended passive and active EGM recordings in optimizing device diagnostics, highlighting the importance of systematic data processing for improved detection of anomalies.¹⁶ In future clinical workflows, models like ChatGPT-4o could be integrated into remote monitoring systems to assist in alert triage or used as point-of-care diagnostic aids during ambulatory device evaluations. These applications would benefit from further optimization, including manufacturer-specific fine-tuning and structured feedback loops, to ensure safe and effective deployment. Even at a basic level, AI-driven anomaly detection can support patient safety and clinical workflows.

Our design also has methodological implications. First, the EGM cases were selected sequentially from the EHRA book without randomization or balancing by difficulty. Although this approach ensured real-world representativeness, it may have introduced selection bias and limited generalizability. Second, the repetition of the same cases over time was intended solely to assess intra-model decision consistency. Since ChatGPT-4o lacks session memory, this process cannot be interpreted as fine-tuning or learning. These design elements should be refined in future research. This study evaluated the model without fine-tuning, yet it demonstrated self-correction with feedback, highlighting its potential for structured training (Supplementary File 1). With further optimization, AI models could surpass human performance, particularly in electrophysiology and cardiac device troubleshooting, where expertise varies. For LLMs to integrate into clinical workflows, continuous learning and real-time clinician feedback are essential. Artificial intelligence-driven adaptive systems can refine performance, enhance RM, and automate data interpretation to reduce diagnostic errors. Future research should focus on fine-tuning models with diverse datasets to improve generalizability. As AI advances, its role in cardiology is expected to transform diagnostics, optimize precision, and improve patient outcomes.

Study Limitations

This study has several limitations that warrant consideration. The model was evaluated on a relatively small dataset of 20 EGM cases which may not fully capture the variability and complexity of real-world clinical scenarios. Moreover, it was not tested in real-time clinical settings, where factors such as noise, incomplete data, and variations across device types could significantly impact performance. The dataset also lacked representation of all PM brands and models, limiting the model’s generalizability to a wider range of devices. Furthermore, the absence of direct comparison with human experts restricts the ability to benchmark its diagnostic utility effectively. Another limitation is the sequential selection of EGM cases without randomization or difficulty stratification, which may affect generalizability. Repeated testing assessed decision stability, not learning, as ChatGPT-4o lacks memory. Finally, while AI tools aided in language editing, all scientific content and interpretations were author-generated.

Conclusion

This study represents the first systematic evaluation of ChatGPT-4o’s ability to interpret intracardiac EGMs, demonstrating promising potential as a diagnostic support tool. The model exhibited reasonable accuracy and reliability, particularly in scenarios involving visualized EGM data. However, further optimization and large-scale validation are necessary to fully integrate LLM-based models into clinical practice.

Supplementary Materials

Footnotes

Ethics Committee Approval: Ethical approval was not required for this study, as it involved no human participants or patient data. The EGM cases used for analysis were obtained from a publicly available textbook (The EHRA Book of Pacemaker, ICD, and CRT Troubleshooting), consisting of anonymized, educational material. The study involved evaluating the interpretive performance of a large language model (ChatGPT-4o) on these standard cases, without generating new clinical data.

Informed Consent: Verbal/Written informed consent was not required, as no human participants or patient data were involved in this study. The electrogram (EGM) cases used were anonymized, educational examples obtained from a publicly available textbook (The EHRA Book of Pacemaker, ICD, and CRT Troubleshooting).

Peer-review: Externally peer reviewed.

Author Contributions: S.B.: Conception, Design, Supervision, Writer, Materials, Data Collection, and/or Processing; A.B.D.: Resource, Data Collection, and/or Processing; N.D.: Resource, Literature Review; Abdülcebbar Şipal: Resource, and Literature Review; Ş.E.Ö.: Materials, Analysis, and/or Interpretation; M.Ç.: Design, Data Collection, and/or Processing; T.E.G.: Design, Analysis, and/or Interpretation; T.A.: Critical Review; U.B.: Conception and Critical Review; N.K.: Conception and Critical Review.

Declaration of Interests: The authors have no conflicts of interest to declare.

References

O’Shea CJ, Middeldorp ME, Hendriks JM. Remote monitoring alert burden: an analysis of transmission in >26,000 patients. JACC Clin Electrophysiol. 2021;7(2):226-234.
Hess PL, Mi X, Curtis LH, Wilkoff BL, Hegland DD, Al-Khatib SM. Follow-up of patients with new cardiovascular implantable electronic devices: is adherence to the experts’ recommendations associated with improved outcomes?. Heart Rhythm. 2013;10(8):1127-1133.
Çay S, Duygu KB, Bozyel S. Remote follow-up/monitoring of cardiac implantable electronic devices. Turk Kardiyol Dern Ars. 2025;53(1):53-61.
Varma N, Epstein AE, Irimpen A, Schweikert R, Love C. Efficacy and safety of automatic remote monitoring for implantable cardioverter-defibrillator follow-up: the Lumos-T Safely Reduces Routine Office Device Follow-up (TRUST) trial. Circulation. 2010;122(4):325-332.
Crossley GH, Boyle A, Vitense H, Chang Y, Mead RH. The CONNECT (Clinical Evaluation of Remote Notification to Reduce Time to Clinical Decision) trial: the value of wireless remote monitoring with automatic clinician alerts. J Am Coll Cardiol. 2011;57(10):1181-1189.
Ferreira HDS, Ferrer F, Cabanelas N. Artificial intelligence: an attempt to automate remote device follow-up. EP Europace. 2021;23(Suppl 3):-.
Pang X, Zhang Y, Li J, Wang Z. Innovative application of large language models for ECG analysis: a CSSOTP-based feature extraction approach. J Biomed Signal Process. 2023;12(3):123-135.
Chen C. Large Language Model-Informed ECG Dual Attention Network for Heart Failure Risk Prediction. . 2024;():-.
Günay S, Öztürk A, Özerol H, Yiğit Y, Erenler AK. Comparison of emergency medicine specialist, cardiologist, and Chat-GPT in electrocardiography assessment. Am J Emerg Med. 2024;80():51-60.
Burri H, Israel C, Deharo JC. . The EHRA Book of Pacemaker, ICD, and CRT Troubleshooting: Case-Based Learning with Multiple Choice Questions. 2015;():-.
Chen G, Faris P, Hemmelgarn B, Walker RL, Quan H. Measuring agreement of administrative data with chart data using prevalence unadjusted and adjusted kappa. BMC Med Res Methodol. 2009;9():5-.
Tan KS, Yeh YC, Adusumilli PS, Travis WD. Quantifying interrater agreement and reliability between thoracic pathologists: paradoxical behavior of Cohen’s kappa in the presence of a high prevalence of the histopathologic feature in lung cancer. JTO Clin Res Rep. 2024;5(1):-.
Glikson M, Nielsen JC, Kronborg MB. 2021 ESC Guidelines on cardiac pacing and cardiac resynchronization therapy. Eur Heart J. 2021;42(35):3427-3520.
Ferrick AM, Raj SR, Deneke T. 2023 HRS/EHRA/APHRS/LAHRS expert consensus statement on practical management of the remote device clinic. Heart Rhythm. 2023;20(9):e92-e144.
Hindricks G, Varma N, Kacet S. Daily remote monitoring of implantable cardioverter-defibrillators: insights from the pooled patient-level data from three randomized controlled trials (IN-TIME, ECOST, TRUST). Eur Heart J. 2017;38(22):1749-1755.
Quinonero C, Strik M, Catalan PA. Improving diagnostic yield for analyzing periodic electrograms in the remote detection of pacemaker lead issues. Sensors (Basel). 2025;25(3):-.

Bozyel S, Duman AB, Dalgıç ŞN, Şipal A, Şaylık F, Gölcük Önder ŞE, Çağdaş M, Güler TE, Aksu T, Bağcı U, Keser N. Large Language Models in Intracardiac Electrogram Interpretation: A New Frontier in Cardiac Diagnostics for Pacemaker Patients. Anatol J Cardiol. 2025 ;29(10):533-542. doi: 10.14744/AnatolJCardiol.2025.5238.

Bozyel S, Duman AB, Dalgıç ŞN, et al. Large Language Models in Intracardiac Electrogram Interpretation: A New Frontier in Cardiac Diagnostics for Pacemaker Patients. Anatol J Cardiol. 2025;29(10):533-542. doi: 10.14744/AnatolJCardiol.2025.5238

Bozyel, S., Duman, A. B., Dalgıç, Ş. N., Şipal, A., Şaylık, F., Gölcük Önder, Ş. E., Çağdaş, M., Güler, T. E., Aksu, T., Bağcı, U., & Keser, N. (2025). Large Language Models in Intracardiac Electrogram Interpretation: A New Frontier in Cardiac Diagnostics for Pacemaker Patients. The Anatolian Journal of Cardiology, 29(10), 533-542. https://doi.org/10.14744/AnatolJCardiol.2025.5238

Bozyel, S., et al. "Large Language Models in Intracardiac Electrogram Interpretation: A New Frontier in Cardiac Diagnostics for Pacemaker Patients." The Anatolian Journal of Cardiology, vol. 29, no. 10, 2025, pp. 533-542. https://doi.org/10.14744/AnatolJCardiol.2025.5238.

Bozyel, S., et al. 2025. "Large Language Models in Intracardiac Electrogram Interpretation: A New Frontier in Cardiac Diagnostics for Pacemaker Patients." The Anatolian Journal of Cardiology 29, no. 10: 533-542. https://doi.org/10.14744/AnatolJCardiol.2025.5238.

Journal Display Format:

Authors: Serdar Bozyel, Ahmet Berk Duman, Şadiye Nur Dalgıç, Abdülcebar Şipal, Faysal Şaylık, Şükriye Ebru Gölcük Önder, Metin Çağdaş, Tümer Erdem Güler, Tolga Aksu, Ulas Bağcı, Nurgül Keser
Article Title: Large Language Models in Intracardiac Electrogram Interpretation: A New Frontier in Cardiac Diagnostics for Pacemaker Patients
Journal Name: The Anatolian Journal of Cardiology
Year: 2025
Volume: 29
Issue: 10
Pages: 533 - 542
DOI: 10.14744/AnatolJCardiol.2025.5238

RIS BibTeX EndNote Medlars