Can generative artificial intelligence provide accurate medical advice?: a case of ChatGPT versus Congress of Neurological Surgeons management of acute cervical spine and spinal cord injuries clinical guidelines
Article information
Abstract
Study Design
An experimental study.
Purpose
To explore the concordance of ChatGPT responses with established national guidelines for the management of cervical spine and spinal cord injuries.
Overview of Literature
ChatGPT-4.0 is an artificial intelligence model that can synthesize large volumes of data and may provide surgeons with recommendations for the management of spinal cord injuries. However, no available literature has quantified ChatGPT’s capacity to provide accurate recommendations for the management of cervical spine and spinal cord injuries.
Methods
Referencing the “Management of acute cervical spine and spinal cord injuries” guidelines published by the Congress of Neurological Surgeons (CNS), a total of 36 questions were formulated. Questions were stratified into therapeutic, diagnostic, or clinical assessment categories as seen in the guidelines. Questions were secondarily grouped according to whether the corresponding recommendation contained level I evidence (highest quality) versus only level II/III evidence (moderate and low quality). ChatGPT-4.0 was prompted with each question, and its responses were assessed by two independent reviewers as “concordant” or “nonconcordant” with the CNS clinical guidelines. “Nonconcordant” responses were rationalized into “insufficient” and “contradictory” categories.
Results
In this study, 22/36 (61.1%) of ChatGPT’s responses were concordant with the CNS guidelines. ChatGPT’s responses aligned with 17/24 (70.8%) therapeutic questions and 4/7 (57.1%) diagnostic questions. ChatGPT’s response aligned with only one of the five clinical assessment questions. Notably, the recommendations supported by level I evidence were the least likely to be replicated by ChatGPT. ChatGPT’s responses agreed with 80.8% of the recommendations supported exclusively by level II/III evidence.
Conclusions
ChatGPT-4 was moderately accurate when generating recommendations that aligned with the clinical guidelines. The model frequently aligned with low evidence and therapeutic recommendations but exhibited inferior performance on topics that contained high-quality evidence or pertained to diagnostic and clinical assessment strategies. Medical practitioners should monitor its usage until further models can be rigorously trained on medical data.
Introduction
The cervical spine is a specialized region of the vertebral column that is highly susceptible to traumatic injury [1]. Cervical spine injuries involve a primary traumatic insult to the vertebral column and the spinal cord within, resulting in a secondary disruption of the local nervous tissue, blood vessels, and cell membranes [2]. Acute cervical spine injuries often require immediate intervention because delayed treatment may lead to detrimental outcomes, including complete loss of upper-limb function [3], respiratory failure [4], and autonomic dysreflexia [5]. These interventions are appreciably nuanced given the complex innervations found within the cervical region and the unique atlanto-occipital bone structure. Spine surgeons take on a critical role in the management of cervical spine injuries, taking care to correlate mechanism(s) of injury, patient anatomy, and symptoms with the appropriate clinical assessments and diagnostic tests [6].
To aid with the complex clinical decision-making required in the management of cervical spine injuries, the Congress of Neurological Surgeons (CNS) developed the comprehensive “Management of acute cervical spine and spinal cord injuries” in 2013 [7]. The guidelines present a series of evidence-based recommendations formulated from a critical evaluation of the medical literature by a working group of experts in spinal surgery and neurotrauma.
Artificial intelligence (AI) has garnered public interest because it continues to be successfully integrated into several industries. A large language model (LLM) is a type of AI model designed to understand and generate human language text. Chat Generative Pretrained Transformer, version 4.0 (ChatGPT-4.0; OpenAI, San Francisco, CA, USA) is the newest edition of an LLM that demonstrates a commendable capacity to synthesize large volumes of data that are reproducible and simple to understand. Physicians have begun utilizing LLMs in the medical context because recent evidence showed that ChatGPT can pass the USMLE (United States Medical Licensing Examination) exams [8–11]. As such, clinicians and patients may seek to explore the potential application of ChatGPT as a clinical recommendation and support tool.
Considering the complexity of acute cervical spine and spinal cord injury management, ChatGPT may be a valuable tool for clinicians and patients to consolidate information and understand the appropriate course of action. Thus, this study aimed to explore ChatGPT’s concordance with the “Management of acute cervical spine and spinal cord injuries” published by the CNS [7]. We hypothesized that ChatGPT-4.0 would provide recommendations that were technically accurate but often lacked specificity and thus generally nonconcordant with the guidelines.
Materials and Methods
Ethics statement
Institutional review board approval was not required because ChatGPT is a publicly available resource, and no clinical data or patient information was used in this study.
Study design
This is an original experimental study. The guidelines for the “Management of acute cervical spine and spinal cord injuries” developed by the CNS contains recommendations for 21 relevant topics [7]. The CNS working group also rated the quality of evidence for each recommendation using a modified scale based on the North American Spine Society Schema. Level I is the highest quality evidence, including randomized controlled trials (RCTs) and systematic reviews of RCTs, level II indicates lesser-quality evidence, and level III represents the lowest quality evidence from case series and expert opinions.
All distinct CNS recommendations across the 21 topics were collected and stratified into “clinical assessment,” “diagnostic,” and “treatment” categories. Referencing these categories, a total of 36 questions were generated and validated by the senior author, a board-certified spine surgeon, to ensure clinical relevance. These questions were posed to ChatGPT-4.0 on April 7, 2023. A complete list of these questions, associated CNS recommendations, and subsequent responses are listed in Supplement 1. To prevent the model’s stored memory from biasing future responses, a new window was created when prompting ChatGPT with each question. The questions were prompted to the LLMs only once, with no follow-up questions, simulating a “zero-shot” scenario to assess their baseline capabilities without prior training or learning biases. After compiling ChatGPT’s responses, two reviewers (M.S. and W.A.) graded each response independently as “concordant” or “nonconcordant” to the guidelines. Both reviewers were medical students working as clinical research associates under the training and guidance of the senior author. They had 1 year of research experience in validating AI models with clinical practice guidelines [12–20]. ChatGPT responses and corresponding first-pass grades from the two reviewers were then presented to the senior author for further evaluation (S.C., a board-certified spine surgeon with >20 years of experience). If any grades remained contested or uncertain, relevant ChatGPT responses were presented to select members of the research team, upon which an extensive discussion with the entire team was held until a unanimous agreement was achieved for each prompt. These specific members of the research team, tasked with evaluating and resolving any ChatGPT responses with conflicting grades, consisted of five additional members: three medical students (M.R.M., A.D., and B.Z.), a sixth-year orthopedic surgery resident actively undergoing an accredited spine surgery fellowship (F.H.), and a second board-certified spine surgeon with 4 years of experience (J.K.). ChatGPT responses were scored using the following grading method: a generated response that faithfully replicated all key aspects of the associated CNS recommendation was classified as “concordant.” If the response failed to sufficiently replicate the key points in the guidelines or directly contradicted the guidelines, the response was graded as “nonconcordant.” To more clearly specify the underlying rationale, nonconcordant responses were further stratified into the following: (1) Insufficient: ChatGPT failed to include one or more key aspects of the recommendation or did not provide adequate specificity. (2) Contradictory: ChatGPT presented a recommendation that was contrary to those put forth in the guidelines.
Finally, associations between the concordance of ChatGPT responses and guideline recommendation strength were assessed, comparing those containing at least one article with level I evidence and those containing only level II or III evidence.
Results
In this study, 22/36 (61.1%) of ChatGPT’s responses were concordant with the CNS guidelines (Table 1). Of the remaining 14 (38.9%) nonconcordant responses, 11 (78.6%) were insufficient, and 3 (21.4%) were contradictory. ChatGPT’s responses more frequently aligned with the guidelines regarding questions in the treatment and diagnostic categories, providing 17/24 (70.8%) and 4/7 (57.1%) concordant responses for treatment and diagnostics, respectively. The model demonstrated an inferior performance on questions related to clinical assessments, aligning with the guidelines on only one (20%) of the five questions presented. A majority (66.7%) of the responses within the diagnostic category were nonconcordant because of insufficient details. The three contradictory nonconcordant responses were dispersed evenly among the treatment, diagnostic, and clinical assessment categories.
Notably, the CNS recommendations supported by the highest quality of evidence (level I) were the least likely to be replicated by ChatGPT (20%) (Table 2). Conversely, ChatGPT’s responses were concordant with 80.8% of the recommendations supported exclusively by lower quality (level II/III) evidence such as those from case series and suboptimal RCTs. All nonconcordant responses for the level II/III evidence group were considered insufficient, whereas the level I evidence group had nonconcordant responses that were 62.5% insufficient and 37.5% contradictory. Table 3 displays the concordancy grade that ChatGPT received for each question, as well as a commentary explaining the rationale for the given grade. A complete list of the questions and their corresponding CNS and ChatGPT recommendations are provided in Supplement 1.
Discussion
Acute cervical spine and spinal cord injuries necessitate a multifaceted approach, requiring a comprehensive understanding of the multiple assessment, diagnostic, and treatment modalities [21]. Prompt and effective prehospital care, along with thorough clinical assessment and decision-making, is essential to improve the outcomes of patients with cervical spine injuries [6]. Evidence-based guidelines, such as those published by the CNS, serve as vital references for surgeons to optimally counsel and treat their patients. However, considering that referencing these lengthy resources can be time-consuming and overwhelming, physicians may benefit from utilizing emerging AI technologies that can synthesize data and provide up-to-date management recommendations. This study revealed that ChatGPT was moderately effective in generating recommendations that aligned with the clinical guidelines; however, it frequently provided recommendations that lacked sufficient evidence.
Treatment
ChatGPT performed well when prompted with questions relating to the prehospital and emergency management of cervical spine and spinal cord injuries. It accurately emphasized the need for immediate cervical spine immobilization using a rigid cervical collar, head blocks with straps or tapes, and a long board with straps. ChatGPT cited evidence from the Royal College of Surgeons of Edinburgh [22], which presents recommendations that closely resemble the CNS guidelines. However, ChatGPT also vaguely alluded to “a number of organizations” that support the use of a long spinal board, Sked stretcher, or vacuum mattress. It did not explicitly cite these organizations. In addition to the long backboard, these recommendations were not included in the CNS guidelines and remain debatable. For example, a recent study reported that the use of a vacuum mattress leads to significantly higher angular motion in the axial plane that may induce harmful secondary injuries [23]. Although the responses of ChatGPT were considered concordant with the guidelines because they properly synthesized and summarized the various recommendations currently available, they did not delineate those that are considered “gold standard” and those that still lack a universal consensus. Physicians should remain vigilant when referring to ChatGPT and intentionally contextualize its recommendations to topics of academic contention.
Excluding one question, ChatGPT generated concordant recommendations with respect to the treatment of atlantoaxial injuries. Specifically, ChatGPT recommendations aligned with the CNS guidelines regarding the treatment of occipital condyle fractures, atlanto-occipital dislocation injuries, hangman fractures, isolated atlas fractures, and combined atlas–axis fractures, often citing the guidelines themselves. These concepts are inherently narrower in scope than the topics with which ChatGPT did not align, indicating that ChatGPT may generate more accurate responses when prompted with highly specific questions. In contrast, when prompted with a broader question, such as “What is the recommended protocol for treating subaxial ankylosing spondylitis following cervical spinal injury?”, ChatGPT’s response was insufficient because it excluded surgical stabilization, posterior long-segment instrumentation and fusion, or combined dorsal/anterior procedure. This response failed to specify that patients requiring surgical stabilization should undergo posterior long-segment instrumentation and fusion or combined dorsal and anterior procedures rather than an anterior standalone instrumentation and fusion. This is a significant omission considering that anterior standalone instrumentation is associated with a failure rate of up to 50% in this patient population [24]. Given that patients with ankylosing subaxial cervical spine injuries are at high risk for potentially life-threatening recurrent fractures following even minor trauma [25–27], such a reductive recommendation from ChatGPT is dangerously not specific. This finding illustrates that ChatGPT is liable to miss critical details when prompted with more generally worded questions.
ChatGPT’s responses were considered nonconcordant when prompted with seven questions within the treatment category, six of which were attributed to an insufficient level of evidence.
For example, on the topic of thromboembolic prophylaxis, ChatGPT lacked a detailed description of the timing (within 72 hours) and duration (3 months) of pharmacological therapy, as well as the pertinent contraindicated measures recommended by the CNS, such as low-dose heparin therapy, standalone oral anticoagulants, or vena cava filters in select cases. Arnold et al. [28] revealed that expedient timing is of the utmost importance for effective prophylaxis, demonstrating a significant decrease in the risk of deep vein thrombosis when therapy was initiated within the recommended 72 hours. ChatGPT failed to include this recent high-impact study in its response, which may be because most of ChatGPT’s training data comes from an open web-based repository that does not include PubMed and instead prioritizes open-access publications.
A single nonconcordant response from ChatGPT in the treatment category was attributed to a contradictory recommendation. When asked about pharmacological therapy options, ChatGPT referenced FlintReha, a nonacademic neurology blog, when promoting the utility of corticosteroids in minimizing damage following spinal cord injury. This conflicted with the CNS guidelines, which determined that corticosteroids, specifically methylprednisolone and GM-1 ganglioside, are not recommended for the management of acute spinal cord injuries and can even be associated with harmful side effects [29–32]. Currently, no guidelines advocate for corticosteroids as a definitive treatment for acute spinal cord injuries because most recent evidence fails to prove a significant improvement in outcomes following methylprednisolone therapy [33]. Compared with insufficient recommendations, contradictory recommendations such as this one pose an even greater clinical threat because they advocate for a medical action that is invalid or directly opposes the evidence-based recommendation.
Diagnostic
ChatGPT struggled to generate concordant responses when prompted with questions pertaining to diagnostic recommendations compared with treatment recommendations. Only 57.1% of the responses relating to diagnostic recommendations were graded as concordant, compared with 70.8% relating to treatment recommendations. However, the model’s response was concordant with the CNS guidelines when asked to determine the appropriate methods of diagnosing vertebral arterial injuries following nonsurgical spine trauma. This finding is in contrast to its nonconcordant recommendation for treatments of the same condition. Between the two prompt topics, only one word was changed, i.e., “treatment” to “diagnostic”; however, ChatGPT generated responses of dramatically different accuracy. This level of inquiry sensitivity is concerning and could prove challenging in the clinical setting.
When prompted to provide diagnostic recommendations, ChatGPT generated three nonconcordant responses. Two responses were deemed insufficient because of the lack of evidence, and one response was contradictory to the CNS guidelines. ChatGPT omitted vital diagnostic recommendations, including the use of computed tomography (CT) to assess the condyle C1 interval (CCI). In addition, ChatGPT did not mention the contraindications for cervical spine imaging in children aged <3 years. Limiting unnecessary imaging is highly recommended given the increasing incidence of cancer in pediatric populations exposed to ionizing radiation [34,35]. This is one of the multiple instances in which ChatGPT failed to discuss pertinent contraindications; thus, a more targeted query would be needed to assess contraindications separately. When providing diagnostic recommendations for atlanto-occipital dislocation injuries, ChatGPT starkly contradicted the CNS guidelines. Instead of recommending CT for CCI determination, ChatGPT proposed X-ray imaging. Although Shim et al. [36] showed that U-Net can be used for image segmentation of X-ray scans with an accuracy of 99%, which can potentially be used to evaluate cervical injuries, this modality must be further evaluated to validate its use in atlanto-occipital dislocations. This demonstrates how ChatGPT may incorporate emerging research without validating the strength of evidence.
Clinical assessments
ChatGPT demonstrated the worst performance when answering questions pertaining to clinical assessments. Of the five questions presented, ChatGPT only generated one concordant recommendation. The model generated a contradictory recommendation when asked about radiographic assessments for awake and asymptomatic patients. ChatGPT suggested that CT, X-ray imaging, and magnetic resonance imaging are indicated for these cases depending on the available equipment, whereas the guidelines posit that awake and asymptomatic patients with normal neurological examinations and functional ranges of motion do not need radiographic evaluations. This recommendation is based on the landmark National Emergency X-Radiography Utilization Study [37], a decision-making protocol defined by five criteria: no midline cervical tenderness, no focal neurologic deficit, normal alertness, no intoxication, and no painful distracting injury. This tool boasts a sensitivity of 99.6% and is routinely used as a safe and effective means of avoiding unnecessary imaging in patients with cervical spine trauma [38,39].
Recommendation strength
Surprisingly, ChatGPT was far less concordant with level I recommendations (20%) than with level II/III recommendations (80.8%). This is contradictory to what would be expected, as high-quality research is more likely to receive a high-impact publication with more citations. Accordingly, a much higher volume of lower-quality evidence may potentially bias the model’s performance to be more concordant with strictly level II/III recommendations. This underscores a serious constraint of ChatGPT as a clinical decision-making tool, given its inability to thoroughly evaluate the credibility of scientific evidence in the literature that served as its training dataset. Thus, ChatGPT cannot properly weigh the saliency of its reference literature and delineate low- versus high-quality evidence.
Implications
The use of AI models such as ChatGPT in medical contexts introduces important considerations for patients seeking health information. Although ChatGPT and similar tools can offer quick, accessible information, patients must critically evaluate AI-sourced health recommendations. Patients should be encouraged to treat AI responses as preliminary information rather than definitive and consider that AI-generated outputs may lack the context-specific details, up-to-date knowledge, or evidence quality required for sound clinical guidance. Key guidelines for evaluating AI-sourced health information include checking for reputable sources or references, identifying specific evidence behind any recommendations, and recognizing any lack of clarity or ambiguity in responses. Furthermore, patients must verify AI-generated information with healthcare professionals who can contextualize advice based on individual health needs and the latest evidence-based standards. By viewing AI tools as adjuncts to, rather than replacements for, professional medical advice given ChatGPT’s current state, patients can make more informed decisions and avoid potential risks associated with unverified recommendations.
Future directions
This study provides valuable insights into the performance of ChatGPT-4o in spinal trauma clinical guidance; however, several areas require further investigation to enhance the clinical utility of LLMs in healthcare. Future research should prioritize training LLMs on up-to-date medical literature. If AI models are not regularly updated with the latest evidence, they will fail to maintain their accuracy and precision. Integrating updated medical information could also involve adapting LLMs such as ChatGPT to align with specific preestablished guidelines, enabling these models to operate within a specific context and deliver complex, highly specialized recommendations. Future research should examine how tailored data sets—such as clinical guidelines for spinal trauma or pain management—might improve the ability of LLMs to generate precise, contextually appropriate advice. Tools such as OpenAI’s “Create a GPT” feature may provide promising avenues for customizing LLMs with specific prompts, uploaded resources, and additional functionalities that address particular clinical needs.
Furthermore, to broaden the applicability of these findings, AI performance must be evaluated within more specialized medical subfields. Future studies could focus on assessing LLMs in areas such as neurosurgery, pediatric orthopedics, or trauma surgery, where clinical needs and decision-making criteria may differ significantly from those in general practice. Exploring the effectiveness of LLMs in these subdomains would not only validate the adaptability of AI in various medical landscapes but also provide insights into necessary model adjustments to enhance accuracy and relevance for distinct clinical subspecialties. These studies may identify unique datasets, guidelines, and case scenarios that better optimize AI performance for more targeted needs.
Lastly, the ethical implications of integrating AI models such as ChatGPT into the healthcare space must be considered. If AI systems are to inevitably become a facet of clinical decision-making, concerns surrounding patient safety, accountability, and data privacy must be addressed. AI models currently lack transparency in their decision-making processes, raising questions about the ability to trace and justify recommendations in high-stakes medical scenarios. In addition, AI may inadvertently perpetuate biases present in training data, which could propagate preexisting disparities in care. Ensuring that AI is used responsibly requires establishing clear protocols for clinician oversight, limitations on autonomous decision-making, and continuous updates to the AI training datasets that reflect current standards of care. Promoting ethical awareness, transparency, and regulatory frameworks facilitates healthcare networks to integrate AI responsibly, thus enhancing patient care while mitigating associated risks.
Limitations
This study has several limitations. First, the CNS clinical guidelines used as a reference were published in 2013. Although this makes them liable to be outdated, it also served as an opportunity to assess ChatGPT’s capability to generate updated recommendations. ChatGPT-4.0 was trained on publicly available data, the vast majority of which predates September 2021. Thus, the model may have excluded more recent literature in its responses. Moreover, the lack of follow-up or clarifying questions for nonconcordant responses may not accurately reflect real-world AI interactions, where users typically prompt additional follow-up queries to obtain more complete information.
Second, this study focuses exclusively on ChatGPT, which restricts the generalizability of our findings within the broader landscape of AI-assisted medical decision-making. Evaluating the responses of other prominent language models, such as Claude, Google Gemini, and Perplexity, would provide a more comprehensive understanding of the capabilities and limitations of various AI platforms in interpreting and applying medical guidelines. Such a comparative analysis could help clarify whether the “nonconcordant” responses observed in this study are specific to ChatGPT-4.0 or reflect a common challenge among multiple models. Future studies incorporating additional AI models are warranted to assess the consistency of guideline adherence across different platforms.
Third, this study exclusively relied on CNS clinical guidelines as a reference. Although these guidelines were chosen as a proxy for gold-standard clinical decision-making, they do not incorporate real-world data or patient outcomes to validate the applicability of ChatGPT’s recommendations in an evolving clinical setting. Assessing the practical relevance and potential effect of AI-generated responses on patient care is challenging if they are not evaluated using actual clinical data. Thus, future studies should prioritize integrating clinical scenarios and patient-reported outcomes as more authentic standards to comprehensively evaluate the utility and reliability of AI-based decision-making tools.
Finally, the procedure used to score ChatGPT’s concordance with the guidelines categories was subjective and did not serve as a precise quantitative assessment of the model’s accuracy. However, this study offers an insightful analysis of ChatGPT-4.0 and its ability to construct evidence-based recommendations for the management of acute cervical spine and spinal cord injuries.
Conclusions
This study demonstrates that although the concordance of ChatGPT-4.0 responses with the evidence-based guidelines is highly inconsistent and unpredictable, it is moderately capable of generating recommendations for the assessment, diagnosis, and treatment of acute cervical spine and spinal cord injuries. It frequently missed critical details, omitted pertinent contraindications, and even offered contradictory recommendations, illustrating its limitations in comprehensively processing the complexities of surgical cervical spine care. To optimize the performance of ChatGPT and other AI models in the clinical setting, they must be further refined and rigorously trained on extensive medical datasets. To utilize AI models such as ChatGPT as clinical decision-making tools, physicians must be engaged with the evolution of such technologies and should exercise caution when evaluating the responses of AI tools.
Key Points
ChatGPT-4.0 demonstrated a moderate level of accuracy, with 61.1% of its responses aligning with the Congress of Neurological Surgeons (CNS) guidelines for managing acute cervical spine and spinal cord injuries.
ChatGPT’s responses were more frequently concordant with treatment-related questions (70.8%) than with diagnostic (57.1%) or clinical assessment questions (20%).
The artificial intelligence (AI) model was significantly less likely to align with recommendations supported by high-quality (level I) evidence, achieving only 20% concordance, while it aligned with 80.8% of recommendations based on lower-quality (level II/III) evidence.
The AI model often provided responses that lacked specificity, omitted key details, or contradicted established guidelines, particularly regarding diagnostic imaging and pharmacological recommendations.
While ChatGPT may be a useful supplementary tool, it is not reliable as a standalone clinical decision-making resource. Physicians should exercise caution and verify AI-generated recommendations with evidence-based guidelines.
Notes
Conflict of Interest
Samuel Cho has disclosures including roles as a board or committee member for the American Academy of Orthopaedic Surgeons, the American Orthopaedic Association, AOSpine North America, the Cervical Spine Research Society, the North American Spine Society, and the Scoliosis Research Society; fellowship support from Cerapedics and Globus Medical; IP royalties from Globus Medical; and a paid consultant position with SI-Bone. Jun Kim has disclosures as a paid consultant for Stryker and ATEC. Except for that, no potential conflict of interest relevant to this article was reported.
Author Contributions
Conceptualization: JK, SC. Methodology: MS, AD, BZ. Formal analysis: MS. Writing: MS, MR, WA, AY. Editing: MS, MR, WA, AY, FH, JM, JK, SC. Final approval of the manuscript: all authors.
Supplementary Materials
Supplementary materials can be available from https://doi.org/10.31616/asj.2024.0301.
Supplement 1. List of questions, guideline recommendations, and ChatGPT recommendations.
asj-2024-0301-Supplement-1.pdf