Clin Shoulder Elb Search

CLOSE


Clin Shoulder Elb > Volume 29(1); 2026 > Article
King, Bailey, Warren, Garrigues, Hammond, and Danilkowicz: Evaluating large language model responses to patient questions on ulnar collateral ligament repair

Abstract

Background

The incidence of ulnar collateral ligament (UCL) repair continues to increase, so evaluating the accuracy and readability of information about this procedure that is produced by artificial intelligence (AI) models is important. This study assesses AI-generated responses to common patient questions about UCL repair.

Methods

Twenty patient questions frequently encountered in clinical practice were submitted to ChatGPT, Gemini, and Grok. Three fellowship-trained orthopedic surgeons independently rated answer accuracy using the ChatGPT Response Rating System (CRRS) and AI Response Metric (AIRM), which assign scores from 1–5, with lower scores indicating better accuracy. Responses with CRRS >2 were classified as requiring more than minimal clarification. Readability was evaluated using the Flesch-Kincaid Reading Ease (FKRE) and Grade Level (FKGL) metrics. Responses with an FKGL >6 exceeded the American Medical Association (AMA) and National Institutes of Health (NIH) recommended 6th grade reading level for patient education materials.

Results

More than minimal clarification was required for 15% (3/20) of ChatGPT, 5% (1/20) of Gemini, and 40% (8/20) of Grok responses. Gemini (CRRS, 1.5±0.5; AIRM, 1.6±0.5) demonstrated significantly better accuracy than ChatGPT (CRRS, 2.0±0.4; P=0.0002; AIRM, 2.2±0.5; P=0.0001) and Grok (CRRS, 2.1±0.7; P=0.005; AIRM, 2.4±0.8; P=0.002). All responses exceeded the AMA/NIH 6th grade reading level threshold (FKGL >6). Gemini produced the highest FKGL (16.2±2.2), significantly higher than ChatGPT (14.4±1.6, P=0.005) and Grok (14.6±1.7, P=0.017). FKRE did not differ significantly among models (P=0.14).

Conclusions

AI models generated generally accurate information about UCL repair but at reading levels far above the AMA/NIH recommendations. In this study, Gemini was the most accurate model and produced the least readable content.

Level of evidence

III.

INTRODUCTION

The repetitive forces exerted on the ulnar collateral ligament (UCL) by the overhead throwing motion place it at significant risk for injury [1-5]. Although UCL reconstruction has long been considered the gold standard treatment for these injuries and produces excellent return to sport (RTS) rates, typical rehabilitation timelines exceed 12 months [6]. In recent years, UCL primary repair with suture augmentation has gained traction [7-10]. In a recent comparison of UCL primary repair versus reconstruction, Dugas et al. [2] reported no difference in RTS rates or patient reported outcomes between groups but found that UCL repair patients were able to return to practice (6.7 months vs. 10.2 months) and competition (9.2 months vs. 13.4 months) significantly faster than those who underwent UCL reconstruction. Given the potential for an expedited return to competition after UCL repair, patients are increasingly interested in pursuing this procedure following injury.
UCL injuries primarily affect young, active patient populations, and prior research has shown that such patients often use the internet to find health information [11]. Recent years have seen a dramatic advance in the commercial availability of generative artificial intelligence (AI) large language models (LLMs) such as ChatGPT (OpenAI, San Fransico, California), Gemini (Google, Mountain View California), and Grok (xAI, Palo Alto, California) [12]. These AI LLMs can rapidly provide information about a wide array of topics, including medicine [13]. However, those responses are not vetted by subject-level experts and can include inaccuracies or potentially misleading statements [14]. A flurry of investigations evaluating the accuracy and readability of generative AI responses to common patient questions has recently been published about a range of orthopedic topics, including UCL reconstruction [13,15,16]. Those studies have reported that although AI responses to questions about UCL reconstruction are generally accurate, nuanced questions might be answered insufficiently, and the level of readability generally exceeds what is recommended for patient education materials. To our knowledge, no studies have yet evaluated AI-generated responses to common patient questions specifically about UCL repair. Given the increasing interest in this procedure and the nuances surrounding its indications, it is essential to assess whether generative AI models provide accurate and readable information to patients.
Our objective in this prospective study was to evaluate AI-generated responses to common patient questions about UCL repair. It was hypothesized that the AI LLMs would generate reasonably accurate answers to general questions but that their answers to more nuanced questions would lack the detail needed to make clinical decisions and that the language used would be more complex than what is recommended for patient education materials.

METHODS

This study did not involve human participants or identifiable patient data; therefore, institutional review board approval and informed consent were not required.

Artificial Intelligence and Question Input

Twenty common patient questions about UCL primary repair were generated and reviewed by three fellowship-trained sports medicine/upper extremity orthopedic surgeons. The questions were developed collaboratively by the expert reviewer panel and were intended to represent the most common questions about UCL repair encountered in clinical practice. Several other peer-reviewed studies have used expert-chosen questions to assess the accuracy and readability of AI LLM responses [15,17]. The full list of questions is presented in Table 1, and the full list of AI-generated responses is available in Supplementary Material 1. In June 2025, the questions were presented to three freely accessible online AI LLMs: ChatGPT-4o, Gemini 2.5 Flash, and Grok 3. Those three models were chosen as the most up-to-date versions of the most popular and widely used options on the market at the time of data collection [18]. Several previous peer-reviewed studies also assessed the quality of healthcare-related output specifically from ChatGPT, Gemini, and Grok [19-21]. The responses to each question from each model were recorded and assessed for accuracy and readability.

Accuracy Analysis

Three fellowship-trained sports medicine/upper extremity orthopedic surgeons (RMD, KH, GG) assessed the accuracy of each response using two separate evidence-based rating systems. The reviewers were blinded to the AI LLM of origin and each other’s ratings while completing their accuracy assessments. The average score among the three raters for each response from each rating system was analyzed. The first rating system used was the ChatGPT Response Rating System (CRRS) [22]. CRRS grading is based on the needed for clarification of information presented in an AI response (Table 2).
The second rating system used was the AI Response Metric (AIRM) [23]. AIRM grading evaluates whether patients can adequately understand and use the information provided by AI and is based on the following 4 criteria: (1) surgeon comfortability with the patient reading the response, (2) response alignment with the current literature, (3) grammatical and syntactical clarity, and (4) response completeness (Table 3).

Readability Analysis

The readability of each response was assessed using the publicly available online calculator WebFX (https://www.webfx.com/tools/read-able/) [15]. The Flesch-Kincaid Reading Ease (FKRE) and Flesch-Kincaid Grade Level (FKGL) metrics were recorded for each response. FKRE provides a numerical score on a scale of 0 to 100, with 0 being “unreadable” and 100 being “easy to read [17].” The FKGL is an adapted version of the FKRE that correlates with educational level, with higher scores indicating that information is more difficult to read [17]. FKGL scores correspond to United States educational grade levels as follows: 0–3: early elementary (kindergarten to 3rd grade), 3–6: elementary school (3rd to 6th grade), 6–9: middle school (6th to 9th grade), 9–12: high school (9th to 12th grade), 12+: college and beyond. The American Medical Association (AMA) and National Institutes of Health (NIH) recommend that patient education materials be written at a 6th grade level or lower to ensure patient comprehension [24]. Therefore, responses with an FKGL >6 were deemed to be unsuitable for patient education.

Statistical Analysis

Statistical analyses were conducted using R software (R Core Team). Statistical significance was set at P<0.05 for each test. Inter-rater reliability among the three independent raters in their accuracy scores was assessed using two-way random-effects intraclass correlation coefficients (ICCs). Reliability was classified as described by Landis and Koch: 0.0–0.2 poor, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, and 0.81–1.00 almost perfect [25]. Differences between the average CRRS and AIRM scores across the three models were assessed using one-way repeated measures analysis of variance (ANOVA) for each rating scale. Assumptions of sphericity were tested using Mauchly’s test, and Greenhouse-Geisser corrections were applied when sphericity was violated. Post hoc pairwise comparisons were performed using Bonferroni-adjusted estimated marginal means to identify differences between specific models. A multivariate analysis of variance (MANOVA) was used to compare the readability, measured as FKRE and FKGL, of the three AI models. Because the MANOVA was deemed significant, individual one-way ANOVAs were conducted for each readability metric, followed by Tukey’s honestly significant difference tests to evaluate direct differences between models. All data were assessed for normality using the Shapiro-Wilk test and for homogeneity of covariance matrices using Box’s M test. The correlation between the two readability measures was assessed with Pearson’s correlation coefficients.

RESULTS

Inter-rater Reliability

Inter-rater reliability across all three models was strong for both accuracy rating systems, with all ICC values being considered almost perfect (ICC >0.80) (Table 4) [25].

Accuracy

More than minimal clarification (CRRS >2) was required for 15% (3/20) of ChatGPT, 5% (1/20) of Gemini, and 40% (8/20) of Grok responses (Figs. 1 and 2, Supplementary Table 1). Mauchly’s test indicated a violation of sphericity for both CRRS (P=0.03) and AIRM (P=0.03), so Greenhouse-Geisser corrections were applied to the repeated-measures ANOVAs. The comparison of CRRS scores across models revealed a significant difference in accuracy across the three models (P<0.001). Post hoc comparisons of Bonferroni-adjusted estimated marginal means showed that Gemini (CRRS=1.5±0.5) scored significantly lower (greater accuracy) than both ChatGPT (CRRS=2.0±0.4, P=0.0002) and Grok (CRRS=2.1±0.7; P=0.005) with no significant difference between ChatGPT and Grok (P=1.0) (Table 5). Similarly, AIRM scores differed significantly among the models (P<0.001) with Gemini (AIRM=1.6±0.5) scoring significantly lower (greater accuracy) than both ChatGPT (AIRM=2.2 ± 0.5, P=0.0001) and Grok (AIRM=2.4±0.8; P=0.002). The AIRM scores did not differ significantly between ChatGPT and Grok (P=0.73).

Readability

The FKGLs of all responses from all three AI models were above the AMA/NIH 6th grade reading level threshold, with the average FKGL for all three models at a collegiate level or higher (Figs. 3 and 4, Supplementary Table 2). MANOVA testing indicated that the AI model had a significant overall effect on readability (P=0.001). Univariate ANOVAs revealed significant differences in FKGL (P=0.004) but no significant difference in FKRE (P=0.14) across models. Post hoc Tukey tests revealed that Gemini (FKGL, 16.3±2.2) responses were written at a significantly higher grade level (decreased readability) than both ChatGPT (FKGL, 14.4±1.6; P=0.005) and Grok (FKGL, 14.6±1.7; P=0.017), with no significant differences between ChatGPT and Grok (P=0.90) (Table 6).

DISCUSSION

The most significant finding of this study is that all three AI LLMs produced generally accurate information about UCL repair, but at a reading level well above recommended standard for patient education. Additionally, comparisons between models revealed that Gemini produced more accurate information than the other AI LLMs but with a tradeoff of reduced readability. Gemini was also the only AI LLM included in this study that routinely cited reputable sources for each response. To our knowledge, this is the first study to evaluate AI-generated responses about UCL primary repair with suture augmentation and how it differs from reconstruction.
These findings are consistent with previous studies that evaluated AI-generated responses to UCL reconstruction questions. In an analysis of ChatGPT 3.5 and ChatGPT 4’s responses to UCL reconstruction related–questions, Shaari et al. found that although most AI responses were generally accurate, the newer ChatGPT 4 model produced more accurate answers than ChatGPT 3.5, indicating that AI-generated information about UCL injuries might be improving over time [15]. In a 2024 study, Johns et al. [16] evaluated ChatGPT 3.5 responses to ten common patient questions about UCL reconstruction and reported that its response to the question “Should I have a UCL reconstruction or repair?” was “unsatisfactory requiring substantial clarification” (CRRS=4) due to a significant oversimplification of the complexities of UCL repair indications. In this study, both ChatGPT 4o (Q7: CRRS=2, Q8: CRRS=2) and Gemini 2.5 Flash (Q7: CRRS=1.67, Q8: CRRS=2) provided satisfactory information requiring no more than minimal clarification about the difference between UCL reconstruction and repair and the indications for UCL repair. It is possible that AI-generated responses to questions about the nuances of UCL repair are rapidly improving over time.
However, a major concern about the utility of AI LLMs as patient education tools remains the readability of responses. In this study, every response from all three models was written far above the AMA/NIH 6th grade reading level recommendation, with most responses being written at a college level or higher. Previous studies have also found that AI-generated responses to orthopedic questions are written above what many patients can comprehend [15,17]. Although it is encouraging that the accuracy of AI-generated responses seems to be improving, that improvement is of little value to patients if they cannot comprehend the information as it is presented. Additionally, there appears to be a tradeoff between accuracy and readability. This study found that although Gemini was the most accurate AI LLM tested, its responses were also written at the most complex reading level. Caution and an individualized approach must be taken when recommending the use of AI LLMs because some patients could be overwhelmed by the complexity of the information provided by these models. Clinicians should offer patients guidance on the strengths and limitations of AI and serve as the final expert resource who provides clarification as needed to prevent misinformation and frustration.
In terms of comparison among the three tested AI LLMs, the findings of this study are largely consistent with the literature. Other studies evaluating Gemini’s performance on orthopedic topics, including anterior cruciate ligament reconstruction and Achilles tendon rupture, similarly found it to outperform ChatGPT and Grok in terms of accuracy and clarity [20,21]. In this study, Gemini not only produced the most accurate responses but was also the only AI LLM that included sources for each response. The sources cited by Gemini were generally reputable, such as Massachusetts General Hospital, the American Academy of Orthopedic Surgeons, and peer-reviewed medical journals [26-28]. Citation of sources is imperative because it not only provides interested patients with a roadmap for source information but can also give clinicians an idea of where AI is falling short by allowing them to reference alternative trustworthy sources [13]. Although readability remains a concern, Gemini might have the greatest current utility for patient education due to its superior accuracy and consistent citation of reputable sources in this study.
This study is not without limitations. First and foremost, the accuracy ratings were the subjective opinions of the raters. To limit bias, all raters were fellowship trained in either sports medicine or shoulder/elbow surgery and blinded to each other’s ratings, and the three independent ratings were averaged for each question during the data analysis. Additionally, AI LLMs are developing rapidly. They continue to change over time and might not provide the same responses if prompted at the time of publication. Therefore, further advances in AI could limit the long-term applicability of these findings. Finally, this study did not assess patient comprehension directly, which would be an important direction for future work.
AI models generated generally accurate information about UCL repair but at reading levels far above the AMA/NIH recommendations. In this study, Gemini was the most accurate model, but it produced the least readable content.

NOTES

Author contributions

Conceptualization: BWK, EPB, EW, RMD. Data curation: BWK, GG, KH, RMD. Formal Analysis: BWK. Methodology: BWK. Project administration: BWK. Supervision: RMD. Visualization: BWK. Writing – original draft: BWK, EPB, EW. Writing – review & editing: GG, KH, RMD. All authors read and agreed to the published version of the manuscript.

Conflict of interest

None.

Funding

None.

Data availability

Contact the corresponding author for data availability.

Acknowledgments

None.

Supplementary materials

Supplementary materials can be found via https://doi.org/10.5397/cise.2025.01214.
Supplementary Material 1.
cise-2025-01214-Supplementary-Material-1.pdf
Supplementary Table 1.
AI model accuracy for each response
cise-2025-01214-Supplementary-Table-1.pdf
Supplementary Table 2.
AI model readability for each response
cise-2025-01214-Supplementary-Table-2.pdf

Fig. 1.
ChatGPT Response Rating System (CRRS) score by question. This line graph depicts CRRS scores (y-axis) for each model for each question (x-axis).
cise-2025-01214f1.jpg
Fig. 2.
AI Response Metric (AIRM) score by question. This line graph depicts AIRM scores (y-axis) for each model for each question (x-axis).
cise-2025-01214f2.jpg
Fig. 3.
Flesch-Kincaid Reading Ease (FKRE) score by question. This line graph depicts FKRE scores (y-axis) for each model for each question (x-axis).
cise-2025-01214f3.jpg
Fig. 4.
Flesch-Kincaid Grade Level (FKGL) score by question. This line graph depicts FKGL scores (y-axis) for each model for each question (x-axis). The dashed line at y=6 depicts the 6th grade reading level threshold recommended by the American Medical Association and National Institutes of Health.
cise-2025-01214f4.jpg
Table 1.
Full question list
Question
1. I am a baseball pitcher, and I just felt a painful pop on the inside of my elbow. What did I injure?
2. What is the function of the elbow UCL in baseball pitchers?
3. What caused me to tear my elbow UCL as a baseball pitcher?
4. What are the treatment options following an elbow UCL tear?
5. Does the UCL heal on its own?
6. Can injections like PRP help heal my UCL tear?
7. What is the difference between elbow UCL reconstruction and elbow UCL repair?
8. Who is a candidate for elbow UCL repair surgery?
9. Is elbow UCL repair surgery an acceptable treatment option following a UCL tear?
10. How is elbow UCL repair surgery performed and what devices are used?
11. How long does elbow UCL repair surgery take?
12. How painful is elbow UCL repair surgery?
13. What could go wrong during elbow UCL repair surgery?
14. Do I have to do physical therapy after UCL repair surgery?
15. What is the success rate of elbow UCL repair surgery?
16. How long until I can return to pitching following elbow UCL repair surgery vs. reconstruction?
17. How likely am I to return to pitching at my prior level following elbow UCL repair surgery?
18. How likely am I to re-tear my UCL following elbow UCL repair surgery?
19. Will elbow UCL repair surgery make me throw harder when I return?
20. Give me three high quality resources to learn more about elbow UCL repair surgery.

UCL: ulnar collateral ligament, PRP: Platelet-rich plasma.

Table 2.
ChatGPT Response Rating System
Response accuracy score Response accuracy description
1 Excellent response not requiring clarification
2 Satisfactory response requiring minimal clarification
3 Satisfactory response requiring moderate clarification
4 Unsatisfactory response requiring substantial clarification
Table 3.
AI Response Metric
Response score Score description
1 This response is something I would be comfortable with my patient reading. This response is clearly in line with the current literature consensus regarding this topic. This response is clear with regard to grammar and syntax. This response is complete and covers the topic in an appropriately nuanced manner.
2 This response is something I would be comfortable with my patient reading. This response is mostly in line with the current literature consensus regarding this topic. This response is clear with regard to grammar and syntax. This response is mostly complete and covers the topic in a somewhat appropriately nuanced manner.
3 This response is potentially something I would be comfortable with my patient reading. This response is somewhat in line with the current literature consensus regarding this topic. This response is somewhat clear in regard to grammar and syntax. This response is partially complete and covers the topic in a basic manner.
4 This response is something I would not be comfortable with my patient reading. This response is partially in line with the current literature consensus regarding this topic. This response is not clear with regard to grammar and syntax. This response is incomplete and covers the topic in a basic manner.
5 This response is something I would not be comfortable with my patient reading. This response is clearly not in line with the current literature consensus regarding this topic. This response is not clear with regard to grammar and syntax. This response is incomplete and does not cover the topic in an appropriately nuanced manner.

AI: artificial intelligence.

Table 4.
Inter-rater reliability
ChatGPT Gemini Grok
CRRS interclass correlation coefficient 0.825 (P<0.001*) 0.813 (P<0.001*) 0.926 (P<0.001*)
AIRM interclass correlation coefficient 0.854 (P<0.001*) 0.846 (P<0.001*) 0.949 (P<0.001*)

CRRS: ChatGPT Response Rating System, AIRM: Artificial Intelligence Response Metric.

*Indicate a statistically significant value (P<0.05).

Table 5.
Average AI model accuracy
ChatGPT Gemini Grok
CRRS score 2.0±0.4 1.45±0.5 2.13±0.7
 P-value ChatGPT vs. Gemini: <0.001* Gemini vs. Grok: 0.005* Grok vs. ChatGPT: 1.00
AIRM score 2.2±0.5 1.63±0.5 2.42±0.8
 P-value ChatGPT vs. Gemini: <0.001* Gemini vs. Grok: 0.002* Grok vs. ChatGPT: 0.73

Values are presented as mean±standard deviation.

AI: artificial intelligence, CRRS: ChatGPT Response Rating System, AIRM: AI Response Metric.

*Indicates a statistically significant value (p<0.05).

Table 6.
Average AI model readability
ChatGPT Gemini Grok
FKRE 31.7±7.4 26.3±10.0 28.2±8.6
FKGL score 14.4±1.6 16.3±2.2 14.6±1.7
P-value ChatGPT vs. Gemini: 0.005* Gemini vs. Grok: 0.02* Grok vs. ChatGPT: 0.90

Values are presented as mean±standard deviation.

AI: artificial intelligence, FKRE: Flesch-Kincaid Reading Ease, FKGL: Flesch-Kincaid Grade Level.

*Indicates a statistically significant value (P<0.05).

Univariate analysis of variances revealed no significant difference in FKRE score (P=0.14) across models.

REFERENCES

1. Bruce JR, Andrews JR. Ulnar collateral ligament injuries in the throwing athlete. J Am Acad Orthop Surg 2014;22:315-25.
crossref pmid
2. Dugas JR, Froom RJ, Mussell EA, et al. Clinical outcomes of ulnar collateral ligament repair with internal brace versus ulnar collateral ligament reconstruction in competitive athletes. Am J Sports Med 2025;53:525-36.
crossref pmid
3. Camp CL, Tubbs TG, Fleisig GS, et al. The relationship of throwing arm mechanics and elbow varus torque: within-subject variation for professional baseball pitchers across 82,000 throws. Am J Sports Med 2017;45:3030-5.
crossref pmid
4. Cohen AD, Garibay EJ, Solomito MJ. The association among trunk rotation, ball velocity, and the elbow varus moment in collegiate-level baseball pitchers. Am J Sports Med 2019;47:2816-20.
crossref pmid
5. Fleisig GS, Escamilla RF. Biomechanics of the elbow in the throwing athlete. Oper Tech Sports Med 1996;4:62-8.
crossref
6. Cain EL, Andrews JR, Dugas JR, et al. Outcome of ulnar collateral ligament reconstruction of the elbow in 1281 athletes: results in 743 athletes with minimum 2-year follow-up. Am J Sports Med 2010;38:2426-34.
crossref pmid
7. Dugas JR, Looze CA, Capogna B, et al. Ulnar collateral ligament repair with collagen-dipped FiberTape augmentation in overhead-throwing athletes. Am J Sports Med 2019;47:1096-102.
crossref pmid
8. Danilkowicz RM, O'Connell RS, Satalich J, O'Donnell JA, Flamant E, Vap AR. Increase in use of medial ulnar collateral ligament repair of the elbow: a large database analysis. Arthrosc Sports Med Rehabil 2021;3:e527-33.
crossref pmid pmc
9. Solomito MJ, Kostyun RO, Sabitsky JT, Nissen CW. Trends in ulnar collateral ligament injuries and surgery from 2010 to 2019: an analysis of a national medical claims database. Orthop J Sports Med 2024;12:23259671241290532.
crossref pmid pmc
10. Willenbring TJ, Epner EC, Warth RJ, Gregory JM. Increasing rates of ulnar collateral ligament repair outpace reconstruction in isolated injuries: review of a Texas surgical database. JSES Int 2023;7:192-7.
crossref pmid
11. Krempec J, Hall J, Biermann JS. Internet use by patients in orthopaedic surgery. Iowa Orthop J 2003;23:80-2.
pmid pmc
12. Alhur A. Redefining healthcare with Artificial Intelligence (AI): the contributions of ChatGPT, Gemini, and Co-pilot. Cureus 2024;16:e57795.
crossref pmid pmc
13. Varady NH, Lu AZ, Mazzucco M, et al. Understanding how ChatGPT may become a clinical administrative tool through an investigation on the ability to answer common patient questions concerning ulnar collateral ligament injuries. Orthop J Sports Med 2024;12:23259671241257516.
crossref pmid pmc
14. Sparks CA, Fasulo SM, Windsor JT, et al. ChatGPT is moderately accurate in providing a general overview of orthopaedic conditions. JB JS Open Access 2024;9:e23.00129.
crossref
15. Shaari AL, Fano AN, Anakwenze O, Klifto C. Appraisal of ChatGPT's responses to common patient questions regarding Tommy John surgery. Shoulder Elbow 2024;16:429-35.
crossref pmid pmc
16. Johns WL, Kellish A, Farronato D, Ciccotti MG, Hammoud S. ChatGPT can offer satisfactory responses to common patient questions regarding elbow ulnar collateral ligament reconstruction. Arthrosc Sports Med Rehabil 2024;6:100893.
crossref pmid pmc
17. Hurley ET, Crook BS, Lorentz SG, et al. Evaluation high-quality of information from ChatGPT (artificial intelligence-large language model) artificial intelligence on shoulder stabilization surgery. Arthroscopy 2024;40:726-31.e6.
crossref pmid
18. Marcaccini G, Corradini L, Shadid O, et al. From prompts to practice: evaluating ChatGPT, Gemini, and Grok against plastic surgeons in local flap decision-making. Diagnostics (Basel) 2025;15:2646.
crossref
19. Taşyürek M, Adıgüzel Ö, Ortaç H. Comparative evaluation of responses from ChatGPT-5, Gemini 2.5 Flash, Grok 4, and Claude Sonnet-4 chatbots to questions about endodontic iatrogenic events. Healthcare (Basel) 2025;13:2615.

20. Quinn M, Milner JD, Schmitt P, et al. Artificial intelligence large language models address anterior cruciate ligament reconstruction: superior clarity and completeness by Gemini compared with ChatGPT-4 in response to American Academy of Orthopaedic Surgeons clinical practice guidelines. Arthroscopy 2025;41:2002-8.
crossref pmid
21. Collins CE, Giammanco PA, Guirgus M, et al. Evaluating the quality and readability of generative Artificial Intelligence (AI) chatbot responses in the management of Achilles tendon rupture. Cureus 2025;17:e78313.
crossref pmid pmc
22. Ottomanelli DI, Sweeney PG, Silver SG, Bassora R, Kohan EM, Vazquez O. ChatGPT can provide satisfactory answers to patient questions regarding biceps tenodesis. JSES Int 2025;9:1378-84.
crossref pmid pmc
23. Anastasio AT, Mills FB, Karavan MP, Adams SB. Evaluating the quality and usability of artificial intelligence-generated responses to common patient questions in foot and ankle surgery. Foot Ankle Orthop 2023;8:24730114231209919.
crossref pmid pmc
24. Gulbrandsen MT, O'Reilly OC, Gao B, et al. Health literacy in rotator cuff repair: a quantitative assessment of the understandability of online patient education material. JSES Int 2023;7:2344-8.
crossref pmid pmc
25. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33:159-74.
crossref pmid
26. Podesta L, Crow SA, Volkmer D, Bert T, Yocum LA. Treatment of partial ulnar collateral ligament tears in the elbow with platelet-rich plasma. Am J Sports Med 2013;41:1689-94.
crossref pmid
27. Mass General Brigham. What is a UCL injury? [Internet]. Mass General Brigham; 2025 [cited 2025 Dec 1]. Available from: https://www.massgeneralbrigham.org/en/patient-care/services-and-specialties/sports-medicine/conditions/hand-arm/ucl-injuries

28. OrthoInfo. Ulnar collateral ligament (UCL) injury [Internet]. American Academy of Orthopaedic Surgeons; 2025 [cited 2025 Dec 1]. Available from: https://orthoinfo.aaos.org/en/diseases--conditions/ulnar-collateral-ligament-ucl-injury

TOOLS
Share :
Facebook Twitter Linked In Google+ Line it
METRICS Graph View
  • 0 Crossref
  •    
  • 538 View
  • 19 Download
Related articles in Clin Should Elbow


ABOUT
ARTICLE CATEGORY

Browse all articles >

BROWSE ARTICLES
EDITORIAL POLICY
FOR CONTRIBUTORS
Editorial Office
#413, 10, Bamgogae-ro 1-gil, Gangnam-gu, Seoul, Republic of Korea
E-mail: journal@cisejournal.org                

Copyright © 2026 by Korean Shoulder and Elbow Society.

Developed in M2PI

Close layer
prev next