Asian Spine J Search

CLOSE


Asian Spine J > Online first
Saturno, Mejia, Ahmed, Yu, Duey, Zaidat, Hijji, Markowitz, Kim, and Cho: Can generative artificial intelligence provide accurate medical advice?: a case of ChatGPT versus Congress of Neurological Surgeons management of acute cervical spine and spinal cord injuries clinical guidelines

Abstract

Study Design

An experimental study.

Purpose

To explore the concordance of ChatGPT responses with established national guidelines for the management of cervical spine and spinal cord injuries.

Overview of Literature

ChatGPT-4.0 is an artificial intelligence model that can synthesize large volumes of data and may provide surgeons with recommendations for the management of spinal cord injuries. However, no available literature has quantified ChatGPT’s capacity to provide accurate recommendations for the management of cervical spine and spinal cord injuries.

Methods

Referencing the “Management of acute cervical spine and spinal cord injuries” guidelines published by the Congress of Neurological Surgeons (CNS), a total of 36 questions were formulated. Questions were stratified into therapeutic, diagnostic, or clinical assessment categories as seen in the guidelines. Questions were secondarily grouped according to whether the corresponding recommendation contained level I evidence (highest quality) versus only level II/III evidence (moderate and low quality). ChatGPT-4.0 was prompted with each question, and its responses were assessed by two independent reviewers as “concordant” or “nonconcordant” with the CNS clinical guidelines. “Nonconcordant” responses were rationalized into “insufficient” and “contradictory” categories.

Results

In this study, 22/36 (61.1%) of ChatGPT’s responses were concordant with the CNS guidelines. ChatGPT’s responses aligned with 17/24 (70.8%) therapeutic questions and 4/7 (57.1%) diagnostic questions. ChatGPT’s response aligned with only one of the five clinical assessment questions. Notably, the recommendations supported by level I evidence were the least likely to be replicated by ChatGPT. ChatGPT’s responses agreed with 80.8% of the recommendations supported exclusively by level II/III evidence.

Conclusions

ChatGPT-4 was moderately accurate when generating recommendations that aligned with the clinical guidelines. The model frequently aligned with low evidence and therapeutic recommendations but exhibited inferior performance on topics that contained high-quality evidence or pertained to diagnostic and clinical assessment strategies. Medical practitioners should monitor its usage until further models can be rigorously trained on medical data.

Introduction

The cervical spine is a specialized region of the vertebral column that is highly susceptible to traumatic injury [1]. Cervical spine injuries involve a primary traumatic insult to the vertebral column and the spinal cord within, resulting in a secondary disruption of the local nervous tissue, blood vessels, and cell membranes [2]. Acute cervical spine injuries often require immediate intervention because delayed treatment may lead to detrimental outcomes, including complete loss of upper-limb function [3], respiratory failure [4], and autonomic dysreflexia [5]. These interventions are appreciably nuanced given the complex innervations found within the cervical region and the unique atlanto-occipital bone structure. Spine surgeons take on a critical role in the management of cervical spine injuries, taking care to correlate mechanism(s) of injury, patient anatomy, and symptoms with the appropriate clinical assessments and diagnostic tests [6].
To aid with the complex clinical decision-making required in the management of cervical spine injuries, the Congress of Neurological Surgeons (CNS) developed the comprehensive “Management of acute cervical spine and spinal cord injuries” in 2013 [7]. The guidelines present a series of evidence-based recommendations formulated from a critical evaluation of the medical literature by a working group of experts in spinal surgery and neurotrauma.
Artificial intelligence (AI) has garnered public interest because it continues to be successfully integrated into several industries. A large language model (LLM) is a type of AI model designed to understand and generate human language text. Chat Generative Pretrained Transformer, version 4.0 (ChatGPT-4.0; OpenAI, San Francisco, CA, USA) is the newest edition of an LLM that demonstrates a commendable capacity to synthesize large volumes of data that are reproducible and simple to understand. Physicians have begun utilizing LLMs in the medical context because recent evidence showed that ChatGPT can pass the USMLE (United States Medical Licensing Examination) exams [811]. As such, clinicians and patients may seek to explore the potential application of ChatGPT as a clinical recommendation and support tool.
Considering the complexity of acute cervical spine and spinal cord injury management, ChatGPT may be a valuable tool for clinicians and patients to consolidate information and understand the appropriate course of action. Thus, this study aimed to explore ChatGPT’s concordance with the “Management of acute cervical spine and spinal cord injuries” published by the CNS [7]. We hypothesized that ChatGPT-4.0 would provide recommendations that were technically accurate but often lacked specificity and thus generally nonconcordant with the guidelines.

Materials and Methods

Ethics statement

Institutional review board approval was not required because ChatGPT is a publicly available resource, and no clinical data or patient information was used in this study.

Study design

This is an original experimental study. The guidelines for the “Management of acute cervical spine and spinal cord injuries” developed by the CNS contains recommendations for 21 relevant topics [7]. The CNS working group also rated the quality of evidence for each recommendation using a modified scale based on the North American Spine Society Schema. Level I is the highest quality evidence, including randomized controlled trials (RCTs) and systematic reviews of RCTs, level II indicates lesser-quality evidence, and level III represents the lowest quality evidence from case series and expert opinions.
All distinct CNS recommendations across the 21 topics were collected and stratified into “clinical assessment,” “diagnostic,” and “treatment” categories. Referencing these categories, a total of 36 questions were generated and validated by the senior author, a board-certified spine surgeon, to ensure clinical relevance. These questions were posed to ChatGPT-4.0 on April 7, 2023. A complete list of these questions, associated CNS recommendations, and subsequent responses are listed in Supplement 1. To prevent the model’s stored memory from biasing future responses, a new window was created when prompting ChatGPT with each question. The questions were prompted to the LLMs only once, with no follow-up questions, simulating a “zero-shot” scenario to assess their baseline capabilities without prior training or learning biases. After compiling ChatGPT’s responses, two reviewers (M.S. and W.A.) graded each response independently as “concordant” or “nonconcordant” to the guidelines. Both reviewers were medical students working as clinical research associates under the training and guidance of the senior author. They had 1 year of research experience in validating AI models with clinical practice guidelines [1220]. ChatGPT responses and corresponding first-pass grades from the two reviewers were then presented to the senior author for further evaluation (S.C., a board-certified spine surgeon with >20 years of experience). If any grades remained contested or uncertain, relevant ChatGPT responses were presented to select members of the research team, upon which an extensive discussion with the entire team was held until a unanimous agreement was achieved for each prompt. These specific members of the research team, tasked with evaluating and resolving any ChatGPT responses with conflicting grades, consisted of five additional members: three medical students (M.R.M., A.D., and B.Z.), a sixth-year orthopedic surgery resident actively undergoing an accredited spine surgery fellowship (F.H.), and a second board-certified spine surgeon with 4 years of experience (J.K.). ChatGPT responses were scored using the following grading method: a generated response that faithfully replicated all key aspects of the associated CNS recommendation was classified as “concordant.” If the response failed to sufficiently replicate the key points in the guidelines or directly contradicted the guidelines, the response was graded as “nonconcordant.” To more clearly specify the underlying rationale, nonconcordant responses were further stratified into the following: (1) Insufficient: ChatGPT failed to include one or more key aspects of the recommendation or did not provide adequate specificity. (2) Contradictory: ChatGPT presented a recommendation that was contrary to those put forth in the guidelines.
Finally, associations between the concordance of ChatGPT responses and guideline recommendation strength were assessed, comparing those containing at least one article with level I evidence and those containing only level II or III evidence.

Results

In this study, 22/36 (61.1%) of ChatGPT’s responses were concordant with the CNS guidelines (Table 1). Of the remaining 14 (38.9%) nonconcordant responses, 11 (78.6%) were insufficient, and 3 (21.4%) were contradictory. ChatGPT’s responses more frequently aligned with the guidelines regarding questions in the treatment and diagnostic categories, providing 17/24 (70.8%) and 4/7 (57.1%) concordant responses for treatment and diagnostics, respectively. The model demonstrated an inferior performance on questions related to clinical assessments, aligning with the guidelines on only one (20%) of the five questions presented. A majority (66.7%) of the responses within the diagnostic category were nonconcordant because of insufficient details. The three contradictory nonconcordant responses were dispersed evenly among the treatment, diagnostic, and clinical assessment categories.
Notably, the CNS recommendations supported by the highest quality of evidence (level I) were the least likely to be replicated by ChatGPT (20%) (Table 2). Conversely, ChatGPT’s responses were concordant with 80.8% of the recommendations supported exclusively by lower quality (level II/III) evidence such as those from case series and suboptimal RCTs. All nonconcordant responses for the level II/III evidence group were considered insufficient, whereas the level I evidence group had nonconcordant responses that were 62.5% insufficient and 37.5% contradictory. Table 3 displays the concordancy grade that ChatGPT received for each question, as well as a commentary explaining the rationale for the given grade. A complete list of the questions and their corresponding CNS and ChatGPT recommendations are provided in Supplement 1.

Discussion

Acute cervical spine and spinal cord injuries necessitate a multifaceted approach, requiring a comprehensive understanding of the multiple assessment, diagnostic, and treatment modalities [21]. Prompt and effective prehospital care, along with thorough clinical assessment and decision-making, is essential to improve the outcomes of patients with cervical spine injuries [6]. Evidence-based guidelines, such as those published by the CNS, serve as vital references for surgeons to optimally counsel and treat their patients. However, considering that referencing these lengthy resources can be time-consuming and overwhelming, physicians may benefit from utilizing emerging AI technologies that can synthesize data and provide up-to-date management recommendations. This study revealed that ChatGPT was moderately effective in generating recommendations that aligned with the clinical guidelines; however, it frequently provided recommendations that lacked sufficient evidence.

Treatment

ChatGPT performed well when prompted with questions relating to the prehospital and emergency management of cervical spine and spinal cord injuries. It accurately emphasized the need for immediate cervical spine immobilization using a rigid cervical collar, head blocks with straps or tapes, and a long board with straps. ChatGPT cited evidence from the Royal College of Surgeons of Edinburgh [22], which presents recommendations that closely resemble the CNS guidelines. However, ChatGPT also vaguely alluded to “a number of organizations” that support the use of a long spinal board, Sked stretcher, or vacuum mattress. It did not explicitly cite these organizations. In addition to the long backboard, these recommendations were not included in the CNS guidelines and remain debatable. For example, a recent study reported that the use of a vacuum mattress leads to significantly higher angular motion in the axial plane that may induce harmful secondary injuries [23]. Although the responses of ChatGPT were considered concordant with the guidelines because they properly synthesized and summarized the various recommendations currently available, they did not delineate those that are considered “gold standard” and those that still lack a universal consensus. Physicians should remain vigilant when referring to ChatGPT and intentionally contextualize its recommendations to topics of academic contention.
Excluding one question, ChatGPT generated concordant recommendations with respect to the treatment of atlantoaxial injuries. Specifically, ChatGPT recommendations aligned with the CNS guidelines regarding the treatment of occipital condyle fractures, atlanto-occipital dislocation injuries, hangman fractures, isolated atlas fractures, and combined atlas–axis fractures, often citing the guidelines themselves. These concepts are inherently narrower in scope than the topics with which ChatGPT did not align, indicating that ChatGPT may generate more accurate responses when prompted with highly specific questions. In contrast, when prompted with a broader question, such as “What is the recommended protocol for treating subaxial ankylosing spondylitis following cervical spinal injury?”, ChatGPT’s response was insufficient because it excluded surgical stabilization, posterior long-segment instrumentation and fusion, or combined dorsal/anterior procedure. This response failed to specify that patients requiring surgical stabilization should undergo posterior long-segment instrumentation and fusion or combined dorsal and anterior procedures rather than an anterior standalone instrumentation and fusion. This is a significant omission considering that anterior standalone instrumentation is associated with a failure rate of up to 50% in this patient population [24]. Given that patients with ankylosing subaxial cervical spine injuries are at high risk for potentially life-threatening recurrent fractures following even minor trauma [2527], such a reductive recommendation from ChatGPT is dangerously not specific. This finding illustrates that ChatGPT is liable to miss critical details when prompted with more generally worded questions.
ChatGPT’s responses were considered nonconcordant when prompted with seven questions within the treatment category, six of which were attributed to an insufficient level of evidence.
For example, on the topic of thromboembolic prophylaxis, ChatGPT lacked a detailed description of the timing (within 72 hours) and duration (3 months) of pharmacological therapy, as well as the pertinent contraindicated measures recommended by the CNS, such as low-dose heparin therapy, standalone oral anticoagulants, or vena cava filters in select cases. Arnold et al. [28] revealed that expedient timing is of the utmost importance for effective prophylaxis, demonstrating a significant decrease in the risk of deep vein thrombosis when therapy was initiated within the recommended 72 hours. ChatGPT failed to include this recent high-impact study in its response, which may be because most of ChatGPT’s training data comes from an open web-based repository that does not include PubMed and instead prioritizes open-access publications.
A single nonconcordant response from ChatGPT in the treatment category was attributed to a contradictory recommendation. When asked about pharmacological therapy options, ChatGPT referenced FlintReha, a nonacademic neurology blog, when promoting the utility of corticosteroids in minimizing damage following spinal cord injury. This conflicted with the CNS guidelines, which determined that corticosteroids, specifically methylprednisolone and GM-1 ganglioside, are not recommended for the management of acute spinal cord injuries and can even be associated with harmful side effects [2932]. Currently, no guidelines advocate for corticosteroids as a definitive treatment for acute spinal cord injuries because most recent evidence fails to prove a significant improvement in outcomes following methylprednisolone therapy [33]. Compared with insufficient recommendations, contradictory recommendations such as this one pose an even greater clinical threat because they advocate for a medical action that is invalid or directly opposes the evidence-based recommendation.

Diagnostic

ChatGPT struggled to generate concordant responses when prompted with questions pertaining to diagnostic recommendations compared with treatment recommendations. Only 57.1% of the responses relating to diagnostic recommendations were graded as concordant, compared with 70.8% relating to treatment recommendations. However, the model’s response was concordant with the CNS guidelines when asked to determine the appropriate methods of diagnosing vertebral arterial injuries following nonsurgical spine trauma. This finding is in contrast to its nonconcordant recommendation for treatments of the same condition. Between the two prompt topics, only one word was changed, i.e., “treatment” to “diagnostic”; however, ChatGPT generated responses of dramatically different accuracy. This level of inquiry sensitivity is concerning and could prove challenging in the clinical setting.
When prompted to provide diagnostic recommendations, ChatGPT generated three nonconcordant responses. Two responses were deemed insufficient because of the lack of evidence, and one response was contradictory to the CNS guidelines. ChatGPT omitted vital diagnostic recommendations, including the use of computed tomography (CT) to assess the condyle C1 interval (CCI). In addition, ChatGPT did not mention the contraindications for cervical spine imaging in children aged <3 years. Limiting unnecessary imaging is highly recommended given the increasing incidence of cancer in pediatric populations exposed to ionizing radiation [34,35]. This is one of the multiple instances in which ChatGPT failed to discuss pertinent contraindications; thus, a more targeted query would be needed to assess contraindications separately. When providing diagnostic recommendations for atlanto-occipital dislocation injuries, ChatGPT starkly contradicted the CNS guidelines. Instead of recommending CT for CCI determination, ChatGPT proposed X-ray imaging. Although Shim et al. [36] showed that U-Net can be used for image segmentation of X-ray scans with an accuracy of 99%, which can potentially be used to evaluate cervical injuries, this modality must be further evaluated to validate its use in atlanto-occipital dislocations. This demonstrates how ChatGPT may incorporate emerging research without validating the strength of evidence.

Clinical assessments

ChatGPT demonstrated the worst performance when answering questions pertaining to clinical assessments. Of the five questions presented, ChatGPT only generated one concordant recommendation. The model generated a contradictory recommendation when asked about radiographic assessments for awake and asymptomatic patients. ChatGPT suggested that CT, X-ray imaging, and magnetic resonance imaging are indicated for these cases depending on the available equipment, whereas the guidelines posit that awake and asymptomatic patients with normal neurological examinations and functional ranges of motion do not need radiographic evaluations. This recommendation is based on the landmark National Emergency X-Radiography Utilization Study [37], a decision-making protocol defined by five criteria: no midline cervical tenderness, no focal neurologic deficit, normal alertness, no intoxication, and no painful distracting injury. This tool boasts a sensitivity of 99.6% and is routinely used as a safe and effective means of avoiding unnecessary imaging in patients with cervical spine trauma [38,39].

Recommendation strength

Surprisingly, ChatGPT was far less concordant with level I recommendations (20%) than with level II/III recommendations (80.8%). This is contradictory to what would be expected, as high-quality research is more likely to receive a high-impact publication with more citations. Accordingly, a much higher volume of lower-quality evidence may potentially bias the model’s performance to be more concordant with strictly level II/III recommendations. This underscores a serious constraint of ChatGPT as a clinical decision-making tool, given its inability to thoroughly evaluate the credibility of scientific evidence in the literature that served as its training dataset. Thus, ChatGPT cannot properly weigh the saliency of its reference literature and delineate low- versus high-quality evidence.

Implications

The use of AI models such as ChatGPT in medical contexts introduces important considerations for patients seeking health information. Although ChatGPT and similar tools can offer quick, accessible information, patients must critically evaluate AI-sourced health recommendations. Patients should be encouraged to treat AI responses as preliminary information rather than definitive and consider that AI-generated outputs may lack the context-specific details, up-to-date knowledge, or evidence quality required for sound clinical guidance. Key guidelines for evaluating AI-sourced health information include checking for reputable sources or references, identifying specific evidence behind any recommendations, and recognizing any lack of clarity or ambiguity in responses. Furthermore, patients must verify AI-generated information with healthcare professionals who can contextualize advice based on individual health needs and the latest evidence-based standards. By viewing AI tools as adjuncts to, rather than replacements for, professional medical advice given ChatGPT’s current state, patients can make more informed decisions and avoid potential risks associated with unverified recommendations.

Future directions

This study provides valuable insights into the performance of ChatGPT-4o in spinal trauma clinical guidance; however, several areas require further investigation to enhance the clinical utility of LLMs in healthcare. Future research should prioritize training LLMs on up-to-date medical literature. If AI models are not regularly updated with the latest evidence, they will fail to maintain their accuracy and precision. Integrating updated medical information could also involve adapting LLMs such as ChatGPT to align with specific preestablished guidelines, enabling these models to operate within a specific context and deliver complex, highly specialized recommendations. Future research should examine how tailored data sets—such as clinical guidelines for spinal trauma or pain management—might improve the ability of LLMs to generate precise, contextually appropriate advice. Tools such as OpenAI’s “Create a GPT” feature may provide promising avenues for customizing LLMs with specific prompts, uploaded resources, and additional functionalities that address particular clinical needs.
Furthermore, to broaden the applicability of these findings, AI performance must be evaluated within more specialized medical subfields. Future studies could focus on assessing LLMs in areas such as neurosurgery, pediatric orthopedics, or trauma surgery, where clinical needs and decision-making criteria may differ significantly from those in general practice. Exploring the effectiveness of LLMs in these subdomains would not only validate the adaptability of AI in various medical landscapes but also provide insights into necessary model adjustments to enhance accuracy and relevance for distinct clinical subspecialties. These studies may identify unique datasets, guidelines, and case scenarios that better optimize AI performance for more targeted needs.
Lastly, the ethical implications of integrating AI models such as ChatGPT into the healthcare space must be considered. If AI systems are to inevitably become a facet of clinical decision-making, concerns surrounding patient safety, accountability, and data privacy must be addressed. AI models currently lack transparency in their decision-making processes, raising questions about the ability to trace and justify recommendations in high-stakes medical scenarios. In addition, AI may inadvertently perpetuate biases present in training data, which could propagate preexisting disparities in care. Ensuring that AI is used responsibly requires establishing clear protocols for clinician oversight, limitations on autonomous decision-making, and continuous updates to the AI training datasets that reflect current standards of care. Promoting ethical awareness, transparency, and regulatory frameworks facilitates healthcare networks to integrate AI responsibly, thus enhancing patient care while mitigating associated risks.

Limitations

This study has several limitations. First, the CNS clinical guidelines used as a reference were published in 2013. Although this makes them liable to be outdated, it also served as an opportunity to assess ChatGPT’s capability to generate updated recommendations. ChatGPT-4.0 was trained on publicly available data, the vast majority of which predates September 2021. Thus, the model may have excluded more recent literature in its responses. Moreover, the lack of follow-up or clarifying questions for nonconcordant responses may not accurately reflect real-world AI interactions, where users typically prompt additional follow-up queries to obtain more complete information.
Second, this study focuses exclusively on ChatGPT, which restricts the generalizability of our findings within the broader landscape of AI-assisted medical decision-making. Evaluating the responses of other prominent language models, such as Claude, Google Gemini, and Perplexity, would provide a more comprehensive understanding of the capabilities and limitations of various AI platforms in interpreting and applying medical guidelines. Such a comparative analysis could help clarify whether the “nonconcordant” responses observed in this study are specific to ChatGPT-4.0 or reflect a common challenge among multiple models. Future studies incorporating additional AI models are warranted to assess the consistency of guideline adherence across different platforms.
Third, this study exclusively relied on CNS clinical guidelines as a reference. Although these guidelines were chosen as a proxy for gold-standard clinical decision-making, they do not incorporate real-world data or patient outcomes to validate the applicability of ChatGPT’s recommendations in an evolving clinical setting. Assessing the practical relevance and potential effect of AI-generated responses on patient care is challenging if they are not evaluated using actual clinical data. Thus, future studies should prioritize integrating clinical scenarios and patient-reported outcomes as more authentic standards to comprehensively evaluate the utility and reliability of AI-based decision-making tools.
Finally, the procedure used to score ChatGPT’s concordance with the guidelines categories was subjective and did not serve as a precise quantitative assessment of the model’s accuracy. However, this study offers an insightful analysis of ChatGPT-4.0 and its ability to construct evidence-based recommendations for the management of acute cervical spine and spinal cord injuries.

Conclusions

This study demonstrates that although the concordance of ChatGPT-4.0 responses with the evidence-based guidelines is highly inconsistent and unpredictable, it is moderately capable of generating recommendations for the assessment, diagnosis, and treatment of acute cervical spine and spinal cord injuries. It frequently missed critical details, omitted pertinent contraindications, and even offered contradictory recommendations, illustrating its limitations in comprehensively processing the complexities of surgical cervical spine care. To optimize the performance of ChatGPT and other AI models in the clinical setting, they must be further refined and rigorously trained on extensive medical datasets. To utilize AI models such as ChatGPT as clinical decision-making tools, physicians must be engaged with the evolution of such technologies and should exercise caution when evaluating the responses of AI tools.

Key Points

  • ChatGPT-4.0 demonstrated a moderate level of accuracy, with 61.1% of its responses aligning with the Congress of Neurological Surgeons (CNS) guidelines for managing acute cervical spine and spinal cord injuries.

  • ChatGPT’s responses were more frequently concordant with treatment-related questions (70.8%) than with diagnostic (57.1%) or clinical assessment questions (20%).

  • The artificial intelligence (AI) model was significantly less likely to align with recommendations supported by high-quality (level I) evidence, achieving only 20% concordance, while it aligned with 80.8% of recommendations based on lower-quality (level II/III) evidence.

  • The AI model often provided responses that lacked specificity, omitted key details, or contradicted established guidelines, particularly regarding diagnostic imaging and pharmacological recommendations.

  • While ChatGPT may be a useful supplementary tool, it is not reliable as a standalone clinical decision-making resource. Physicians should exercise caution and verify AI-generated recommendations with evidence-based guidelines.

Notes

Conflict of Interest

Samuel Cho has disclosures including roles as a board or committee member for the American Academy of Orthopaedic Surgeons, the American Orthopaedic Association, AOSpine North America, the Cervical Spine Research Society, the North American Spine Society, and the Scoliosis Research Society; fellowship support from Cerapedics and Globus Medical; IP royalties from Globus Medical; and a paid consultant position with SI-Bone. Jun Kim has disclosures as a paid consultant for Stryker and ATEC. Except for that, no potential conflict of interest relevant to this article was reported.

Author Contributions

Conceptualization: JK, SC. Methodology: MS, AD, BZ. Formal analysis: MS. Writing: MS, MR, WA, AY. Editing: MS, MR, WA, AY, FH, JM, JK, SC. Final approval of the manuscript: all authors.

Supplementary Materials

Supplementary materials can be available from https://doi.org/10.31616/asj.2024.0301.
Supplement 1. List of questions, guideline recommendations, and ChatGPT recommendations.
asj-2024-0301-Supplement-1.pdf

Table 1
ChatGPT concordance scores stratified by topic
Treatment (n=24) Diagnostic (n=7) Clinical assessment (n=5) Total (n=36)
Concordant 17 (70.8) 4 (57.1) 1 (20.0) 22 (61.1)
Nonconcordant 7 (29.2) 3 (42.9) 4 (80.0) 14 (38.9)
Insufficient 6 (85.7) 2 (66.7) 3 (75.0) 11 (78.6)
Contradictory 1 (20.0) 1 (33.3) 1 (25.0) 3 (21.4)

Values are presented as number (%).

Table 2
ChatGPT concordance scores stratified by level of evidence
Contains level I evidence Containing only level II/III evidence
Concordant 2 (20.0) 21 (80.8)
Nonconcordant 8 (80.0) 5 (19.2)
Insufficient 5 (62.5) 5 (100.0)
Contradictory 3 (37.5) -

Values are presented as number (%).

Table 3
Compiled ChatGPT concordance scores and commentary
Question Commentary Grade
Treatment
 What is the recommended protocol for prehospital cervical spinal immobilization after trauma? ChatGPT generated a response that appropriately emphasized the most critical aspects of the recommendation: spinal immobilization using a rigid cervical collar, head blocks, and a backboard with straps. ChatGPT provided additional information on Kendrick extrication device, pelvic sling, and vacuum mattress usage that was supplemental and did not detract from the accuracy of its response. Concordant
 What is the recommended protocol for transporting patients with acute traumatic cervical spine injuries? ChatGPT quoted the guidelines directly, making clear the importance of “expeditious and careful transport… from the site of injury.” Concordant
 What is the recommended protocol for early closed reduction of cervical spinal fracture-dislocation injuries? ChatGPT quoted the guidelines directly, making note of the following critical points: early closed reduction with craniocervical traction, contraindication of closed reduction in patients with an additional rostral injury, and MRI for patients who cannot be examined during the attempted closed reduction for any reason. Concordant
 What is the recommended protocol for acute cardiopulmonary management of patients with cervical spinal cord injuries? ChatGPT quoted the guidelines directly when recommending monitoring in an intensive care unit using cardiac, hemodynamic, and respiratory devices. ChatGPT also appropriately suggested that mean arterial blood pressure should be maintained between 85 mm Hg and 90 mm Hg for the first 7 days following an acute cervical spinal cord injury. Concordant
 What is the recommended pharmacological therapy for acute spinal cord injury? ChatGPT recommended corticosteroid usage, while the guidelines support the contrary. Nonconcordant: contradictory
 What is the recommended protocol for treating occipital condyle fractures? ChatGPT accurately recommended external cervical immobilization using a cervical collar or halo vest device, while also including the caveat that patients with instability may require posterior fusion for occipitocervical stabilization. Concordant
 What is the recommended protocol for treating traumatic atlanto-occipital dislocation injuries? ChatGPT included internal fixation and fusion as mentioned in the guidelines, and also noted the 10% risk of neurological damage following traction. Concordant
 What is the recommended protocol for isolated fractures of the atlas in adults? As noted in the guidelines, ChatGPT stated that the integrity of the transverse atlantal ligament determines whether cervical immobilization or surgical fixation and fusion is recommended. Concordant
 What is the recommended protocol for managing odontoid fractures in adults? ChatGPT only discussed nonsurgical and surgical measures for type II odontoid fractures, but neglected type I and type III fractures additionally discussed in the guidelines. Nonconcordant: insufficient
 What is the recommended protocol for managing Hangman fractures in adults? ChatGPT appropriately recommended external immobilization in the acute management setting, followed by surgical intervention using either C2–C3 fusion or posterior C1–C3 fixation depending on the fracture severity. Concordant
 What is the recommended protocol for managing isolated fractures of the axis body in adults? ChatGPT cited a systematic review claiming there is insufficient evidence to support treatment guidelines for isolated fractures of the axis body, yet still accurately recommended external immobilization, conservative primary management, and potential surgical stabilization for joint instability. Concordant
 What is the recommended protocol for managing combination atlas-axis fractures in adults? ChatGPT recommended rigid external immobilization and cited the appropriate criteria for surgery: atlantoaxial interval of ≥5 mm or angulation of C2 on C3 of ≥11° Concordant
 What is the recommended protocol for treating subaxial cervical spinal injuries? ChatGPT mentioned the utility of closed or open reduction with the goal of spinal cord decompression, and additionally recommended stable immobilization (via internal fixation or external immobilization) for early patient mobilization, as cited in the guidelines. Concordant
 What is the recommended protocol for treating subaxial ankylosing spondylitis following cervical spinal injury? ChatGPT recommended the routine use of CT and MRI for all trauma victims with ankylosing spondylitis, similar to the guidelines. However, ChatGPT failed to mention that patients who ultimately require surgery should undergo posterior long segment instrumentation and fusion or combined dorsal and anterior procedures, rather than anterior standalone, as the latter has been associated with a failure rate of up to 50%. Nonconcordant: insufficient
 What is the recommended management of acute traumatic central cord syndrome? ChatGPT aligned with all major aspects of the guidelines when recommending medical management in an intensive care unit using cardiac, hemodynamic, and respiratory monitoring. ChatGPT accurately noted a target mean arterial pressure of 85–90 mm Hg during the first week post-injury. Concordant
 What is the recommended protocol for treating cervical spine and spinal cord injuries in children under 8 years old? As is stated in the guidelines, ChatGPT recommended thoracic elevation when restrained supine on an otherwise flat backboard to allow for better neutral alignment and immobilization of the cervical spine. Concordant
 What is the recommended protocol for treating cervical spine and spinal cord injuries in children under 7 years old? ChatGPT mirrored the guidelines when recommending closed reduction and halo immobilization for injuries of the C2 synchondrosis. Concordant
 What is the recommended protocol for treating acute AARF in children? ChatGPT agreed with the CNS guidelines in that it recommended halter traction followed by immobilization with a halo device. Additionally, it suggested fusion for recurrent or irreducible AARF. Concordant
 What is the recommended protocol for treating isolated ligamentous injuries and/or dislocation in cervical spine injuries in children? ChatGPT called for the consideration of primary operative therapy, as seen in the guidelines. Concordant
 What is the recommended protocol for treating pediatric cervical spine and spinal cord injuries that previously failed non-operative management? ChatGPT cited the guidelines directly when suggesting the role of primary operative therapy that fail non-operative management. Concordant
 What is the recommended protocol for treating spinal cord injury without radiographic abnormality? ChatGPT managed to address two of the three key recommendations: external immobilization for up to 12 weeks and early discontinuation of external mobilization for asymptomatic patients. However, it failed to recommend avoidance of “high-risk” activities for up to 6 months. Nonconcordant: insufficient
 What is the recommended prophylactic treatment for deep venous thrombosis and thromboembolism in patients with cervical spinal cord injuries? While ChatGPT broadly generated the appropriate general recommendations—low molecular weight heparins, rotating beds, or a combination of modalities—it lacked detail regarding the temporality and duration of pharmacologic therapies. It also did not cite the appropriate contraindicated measures, such as vena cava filters and standalone low dose heparin therapy. Nonconcordant: insufficient
 What is the recommended nutritional support for patients following a spinal cord injury? ChatGPT failed to mention both indirect calorimetry and early enteral nutrition within 72 hours, as recommended in the CNS guidelines. Nonconcordant: insufficient
 What is the recommended treatment protocol for vertebral artery injuries following non-penetrating cervical trauma? While ChatGPT broadly recommended the appropriate anticoagulation and antiplatelet therapy, it neglected to specify that the use of these therapies should be tailored to the patient’s vertebral artery injury, the associated injuries, and the risk of bleeding. Nonconcordant: insufficient
Diagnostic recommendations
 What is the recommended protocol for diagnosing occipital condyle fractures? ChatGPT properly recommended CT imaging but neglected to mention the use of MRI for assessing craniocervical ligament integrity. Nonconcordant: insufficient
 What is the recommended protocol for diagnosing traumatic atlanto-occipital dislocation injuries? The guidelines state that CT imaging should be used to assess the craniocervical junction in all patients, as well as determine the condyle-C1 interval in pediatric patients. Conversely, ChatGPT recommended X-ray. Nonconcordant: contradictory
 What is the recommended protocol for diagnosing and evaluating os odontoideum? ChatGPT generated a recommendation identical to that seen in the guidelines, which recommends plain radiographs of the cervical spine (anterior-posterior, open mouth-odontoid, and lateral) and plain dynamic lateral radiographs performed in flexion and extension. These can be done with or without tomography or MRI of the craniocervical junction. Concordant
 What is the recommended protocol for diagnosing pediatric cervical spine and spinal cord injuries? ChatGPT neglected to recommend CT imaging for determining the condyle-C1 interval. Additionally, the AI did not include the many contraindications for cervical spine imaging in children under 3 years of age. Nonconcordant: insufficient
 What is the recommended protocol for diagnosing spinal cord injury without radiographic abnormality? ChatGPT accurately suggested MRI of the region of suspected neurological injury, as well as radiographic screening of the entire spinal column. ChatGPT also noted that neither spinal angiography nor myelography is recommended in evaluating patients with spinal cord injury without radiographic abnormality. Concordant
 What is the recommended diagnostic protocol for vertebral artery injuries following non-penetrating cervical trauma? ChatGPT mirrored the guidelines in recommending that computed tomographic angiography is recommended as a screening tool after blunt cervical trauma if patients meet the modified Denver Screening Criteria for suspected vertebral artery injury. It also properly suggested conventional angiography or magnetic resonance angiography is recommended for the diagnosis of vertebral artery injury or vertebral subluxation. Concordant
 What is the recommended diagnostic protocol for deep venous thrombosis and thromboembolism in patients with cervical spinal cord injuries? ChatGPT reference all of the diagnostic tools listed in the guidelines, including: ultrasound, impedance plethysmography, venous occlusion plethysmography, venography, and the clinical examination Concordant
Clinical assessment recommendations
 What are the recommended clinical assessments for a patient with an acute cervical spinal cord injury? ChatGPT managed to recommend neurological and functional outcome assessments, but did not mention the International Spinal Cord Injury Basic Pain Data Set to assess patients presenting with pain associated with their spinal cord injury. It also recommended the modified Barthel index as a functional outcome assessment, rather than the Spinal Cord Independence Measure III as found in the guidelines. Nonconcordant: insufficient
 What are the recommended radiographic assessments for a patient who sustained a cervical spinal cord injury and is awake and asymptomatic? The guidelines suggest that awake, asymptomatic patients who are without neck pain or tenderness, have a normal neurological examination, and can complete a functional range of motion need not require radiographic evaluation. ChatGPT, conversely, recommended CT, X-ray, and MRI imaging depending on the available equipment and examination results. Nonconcordant: contradictory
 What are the recommended radiographic assessments for a patient who sustained a cervical spinal cord injury and is awake and symptomatic? ChatGPT recommended a three-view cervical spine series, but did not mention CT imaging. It also did not detail the protective de-escalation actions that should be taken if the patient has normal CT imaging or three-view cervical spine series. Nonconcordant: insufficient
 What are the recommended radiographic assessments for a patient who sustained a cervical spinal cord injury and is obtunded/unevaluable? Though ChatGPT’s answer was significantly briefer, it replicated the critical aspects of the guideline recommendations: CT imaging, during which the patient should be immobilized in a cervical spine collar. Concordant
 What are the different classification systems for subaxial cervical spine injury and under what circumstances should they be used? ChatGPT only recommended the Subaxial Injury Classification, the AO spine cervical spine injury classification system, and the Cervical Spine Injury Severity Score. It did not discuss the Harris or Allen classifications, as described in the CNS guidelines. Nonconcordant: insufficient

MRI, magnetic resonance imaging; CT, computed tomography; AARF, atlantoaxial rotatory fixation; CNS, Congress of Neurological Surgeons; AI, artificial intelligence.

References

1. Wyndaele M, Wyndaele JJ. Incidence, prevalence and epidemiology of spinal cord injury: what learns a worldwide literature survey? Spinal Cord 2006;44:523–9.
crossref pmid pdf
2. Rowland JW, Hawryluk GW, Kwon B, Fehlings MG. Current status of acute spinal cord injury pathophysiology and emerging therapies: promise on the horizon. Neurosurg Focus 2008;25:E2.
crossref
3. Javeed S, Greenberg JK, Zhang JK, et al. Association of upper-limb neurological recovery with functional outcomes in high cervical spinal cord injury. J Neurosurg Spine 2023;39:355–62.
crossref pmid
4. Sezer N, Akkus S, Ugurlu FG. Chronic complications of spinal cord injury. World J Orthop 2015;6:24–33.
crossref pmid pmc
5. Tollefsen E, Fondenes O. Respiratory complications associated with spinal cord injury. Tidsskr Nor Laegeforen 2012;132:1111–4.
pmid
6. Okereke I, Mmerem K, Balasubramanian D. The management of cervical spine injuries: a literature review. Orthop Res Rev 2021;13:151–62.
crossref pmid pmc pdf
7. Walters BC, Hadley MN, Hurlbert RJ, et al. Guidelines for the management of acute cervical spine and spinal cord injuries: 2013 update. Neurosurgery 2013;60(CN_suppl_1): 82–91.
pmid
8. Vaishya R, Misra A, Vaish A. ChatGPT: is this version good for healthcare and research? Diabetes Metab Syndr 2023;17:102744.
crossref pmid
9. Bernstein IA, Zhang YV, Govil D, et al. Comparison of ophthalmologist and large language model chatbot responses to online patient eye care questions. JAMA Netw Open 2023;6:e2330320.
crossref pmid pmc
10. Waisberg E, Ong J, Masalkhi M, et al. GPT-4: a new era of artificial intelligence in medicine. Ir J Med Sci 2023;192:3197–200.
crossref pmid pdf
11. Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health 2023;2:e0000198.
crossref pmid pmc
12. Ahmed W, Saturno M, Rajjoub R, et al. ChatGPT versus NASS clinical guidelines for degenerative spondylolisthesis: a comparative analysis. Eur Spine J 2024;33:4182–203.
crossref pmid pdf
13. Kwon DY, Wang A, Mejia MR, et al. Adherence of a large language model to clinical guidelines for craniofacial plastic and reconstructive surgeries. Ann Plast Surg 2024;92:261–2.
crossref pmid pmc
14. Mejia MR, Arroyave JS, Saturno M, et al. Use of ChatGPT for determining clinical and surgical treatment of lumbar disc herniation with radiculopathy: a North American Spine Society Guideline comparison. Neurospine 2024;21:149–58.
crossref pmid pmc pdf
15. Saturno MP, Mejia MR, Wang A, et al. Generative artificial intelligence fails to provide sufficiently accurate recommendations when compared to established breast reconstruction surgery guidelines. J Plast Reconstr Aesthet Surg 2023;86:248–50.
crossref pmid pmc
16. Hoang T, Liou L, Rosenberg AM, et al. An analysis of ChatGPT recommendations for the diagnosis and treatment of cervical radiculopathy. J Neurosurg Spine 2024;41:385–95.
crossref pmid
17. Nietsch KS, Shrestha N, Mazudie Ndjonko LC, et al. Can large language models (LLMs) predict the appropriate treatment of acute hip fractures in older adults?: comparing appropriate use criteria with recommendations from ChatGPT. J Am Acad Orthop Surg Glob Res Rev 2024;8:e24.00206..
crossref pmid pmc
18. Shrestha N, Shen Z, Zaidat B, et al. Performance of ChatGPT on NASS Clinical Guidelines for the diagnosis and treatment of low back pain: a comparison study. Spine (Phila Pa 1976) 2024;49:640–51.
pmid
19. Rajjoub R, Arroyave JS, Zaidat B, et al. ChatGPT and its role in the decision-making for the diagnosis and treatment of lumbar spinal stenosis: a comparative analysis and narrative review. Global Spine J 2024;14:998–1017.
crossref pmid pmc pdf
20. Zaidat B, Shrestha N, Rosenberg AM, et al. Performance of a large language model in the generation of clinical guidelines for antibiotic prophylaxis in spine surgery. Neurospine 2024;21:128–46.
crossref pmid pmc pdf
21. Cusick JF, Yoganandan N. Biomechanics of the cervical spine 4: major injuries. Clin Biomech (Bristol) 2002;17:1–20.
crossref pmid
22. Connor D, Greaves I, Porter K, Bloch M, consensus group, Faculty of Pre-Hospital Care. Pre-hospital spinal immobilisation: an initial consensus statement. Emerg Med J 2013;30:1067–9.
crossref pmid
23. Liengswangwong W, Lertviboonluk N, Yuksen C, et al. Comparing the efficacy of long spinal board, sked stretcher, and vacuum mattress in cervical spine immobilization; a method-oriented experimental study. Arch Acad Emerg Med 2023;11:e44.
pmid pmc
24. Einsiedel T, Schmelz A, Arand M, et al. Injuries of the cervical spine in patients with ankylosing spondylitis: experience at two trauma centers. J Neurosurg Spine 2006;5:33–45.
crossref pmid
25. Westerveld LA, Verlaan JJ, Oner FC. Spinal fractures in patients with ankylosing spinal disorders: a systematic review of the literature on treatment, neurological status and complications. Eur Spine J 2009;18:145–56.
crossref pmid pmc pdf
26. Olerud C, Frost A, Bring J. Spinal fractures in patients with ankylosing spondylitis. Eur Spine J 1996;5:51–5.
crossref pmid pdf
27. Sapkas G, Kateros K, Papadakis SA, et al. Surgical outcome after spinal fractures in patients with ankylosing spondylitis. BMC Musculoskelet Disord 2009;10:96.
crossref pmid pmc pdf
28. Arnold PM, Harrop JS, Merli G, et al. Efficacy, safety, and timing of anticoagulant thromboprophylaxis for the prevention of venous thromboembolism in patients with acute spinal cord injury: a systematic review. Global Spine J 2017;7(3 Suppl): 138S–150S.
crossref pmid pmc pdf
29. Bracken MB, Shepard MJ, Collins WF, et al. A randomized, controlled trial of methylprednisolone or naloxone in the treatment of acute spinal-cord injury: results of the Second National Acute Spinal Cord Injury Study. N Engl J Med 1990;322:1405–11.
crossref pmid
30. Bracken MB, Shepard MJ, Holford TR, et al. Administration of methylprednisolone for 24 or 48 hours or tirilazad mesylate for 48 hours in the treatment of acute spinal cord injury: results of the Third National Acute Spinal Cord Injury Randomized Controlled Trial: National Acute Spinal Cord Injury Study. JAMA 1997;277:1597–604.
crossref pmid
31. Bracken MB, Collins WF, Freeman DF, et al. Efficacy of methylprednisolone in acute spinal cord injury. JAMA 1984;251:45–52.
crossref pmid
32. Pointillart V, Petitjean ME, Wiart L, et al. Pharmacological therapy of spinal cord injury during the acute phase. Spinal Cord 2000;38:71–6.
crossref pmid pdf
33. Sultan I, Lamba N, Liew A, et al. The safety and efficacy of steroid treatment for acute spinal cord injury: a systematic review and meta-analysis. Heliyon 2020;6:e03414.
crossref pmid pmc
34. Kutanzi KR, Lumen A, Koturbash I, Miousse IR. Pediatric exposures to ionizing radiation: carcinogenic considerations. Int J Environ Res Public Health 2016;13:1057.
crossref pmid pmc
35. Meulepas JM, Ronckers CM, Smets AM, et al. Radiation exposure from pediatric CT scans and subsequent cancer risk in the Netherlands. J Natl Cancer Inst 2019;111:256–63.
crossref pmid pmc
36. Shim JH, Kim WS, Kim KG, Yee GT, Kim YJ, Jeong TS. Evaluation of U-Net models in automated cervical spine and cranial bone segmentation using X-ray images for traumatic atlanto-occipital dislocation diagnosis. Sci Rep 2022;12:21438.
crossref pmid pmc pdf
37. Hoffman JR, Mower WR, Wolfson AB, Todd KH, Zucker MI. Validity of a set of clinical criteria to rule out injury to the cervical spine in patients with blunt trauma. National Emergency X-Radiography Utilization Study Group. N Engl J Med 2000;343:94–9.
crossref pmid
38. Vazirizadeh-Mahabadi M, Yarahmadi M. Canadian C-spine rule versus NEXUS in screening of clinically important traumatic cervical spine injuries; a systematic review and meta-analysis. Arch Acad Emerg Med 2023;11:e5.
pmid pmc
39. Ala A, Shams Vahdati S, Ghaffarzad A, Mousavi H, Mirza-Aghazadeh-Attari M. National emergency X-radiography utilization study guidelines versus Canadian C-Spine guidelines on trauma patients: a prospective analytical study. PLoS One 2018;13:e0206283.
crossref pmid pmc
TOOLS
Share :
Facebook Twitter Linked In Google+ Line it
METRICS Graph View
  • 0 Crossref
  •   Scopus
  • 490 View
  • 49 Download
Related articles in ASJ


ABOUT
ARTICLE CATEGORY

Browse all articles >

BROWSE ARTICLES
EDITORIAL POLICY
FOR CONTRIBUTORS
Editorial Office
Department of Orthopedic Surgery, Asan Medical Center, University of Ulsan College of Medicine
88, Olympic-ro 43-gil, Songpa-gu, Seoul 05505, Korea
Tel: +82-2-3010-3530    Fax: +82-2-3010-8555    E-mail: asianspinejournal@gmail.com                
Korean Society of Spine Surgery
27, Dongguk-ro, Ilsandong-gu, Goyang-si 10326, Korea
Tel: +82-31-966-3413    Fax: +82-2-831-3414    E-mail: office@spine.or.kr                

Copyright © 2025 by Korean Society of Spine Surgery.

Developed in M2PI

Close layer
prev next