Pitfalls of Using ChatGPT and GPT-4 for Data Science

If you are reading any scientific journal right now, there’s no question that artificial intelligence (AI) is a hot topic once more. Just about every issue you pick up will be peppered with updates on how AI is changing any given subfield of science. Clinical Infectious Diseases (CID) recently published a Viewpoint on how large language models will change microbiology,1 which caught my eye.

In general, I think these articles are important and helpful! As my colleague Sanjat Kanjilal pointed out in another CID editorial,2 there is no question these technologies are fundamentally shifting the boundary of the tasks we imagine delegating to machines. And given that anybody can access large language models (LLMs) for very little cost,3 they will absolutely have far-reaching impact on both medicine and science. It’s helpful to discuss what those impacts could be, even if this train of thought occassionally veers into the hypothetical and fantastic. I am certainly no stranger to making optimistic claims about how certain technologies, e.g. next-generation sequencing, will revolutionize various aspects of infectious diseases,4 and finding some of these claims may be slightly delayed in the event of a global pandemic.

On the other hand, I’m a programmer and data scientist, so I will be a stickler over claims that AI is changing specific aspects of those disciplines, particularly as a means toward some larger goal like the improvement of clinical microbiology. On this front, I often find that the truth is much, much more nuanced than the way it’s presented in these articles.

For instance, the following examples were included in the aforementioned Viewpoint1 to support the assertion that “GPT-4 is used to analyze data by providing SQL and R scripts,” which I take to mean that GPT-4 can help write SQL and R code for such tasks. But truthfully, the examples show quite the opposite! Can you spot the problems? Here’s the first one:

Human: Please provide a SQL code to access a microbiology database. The code should look for Staphylococcus spp. (but exclude Staphylococcus aureus) and include strain information on oxacillin resistance and age. An additional criteria is that the original sample material must only include biopsies and implants from the University Hospital Zurich. Please comment the code and explain it.

GPT-4: Sure, here is an example of SQL code that can be used to access a microbiology database and retrieve information about strains that meet the specified criteria:

-- This code selects specific columns from the table
-- 'microbiology_samples' where the bacteria is "Staphylococcus spp."
-- (excluding "Staphylococcus aureus"), the resistance to "oxacillin" is
-- true, the age of patient is greater than 65 years old and the original
-- sample material only includes biopsies from the University Hospital
-- Zurich.

SELECT patient_id, patient_name, age, hospital, bacteria, resistance,
       sample_material
FROM microbiology_samples
WHERE age > 65 AND
      hospital = 'University Hospital Zurich' AND
      sample_material = 'biopsy' AND
      bacteria LIKE 'Staphylococcus %' AND
      bacteria NOT LIKE 'Staphylococcus aureus' AND
      resistance = 'oxacillin'

GPT-4: This code selects specific columns (patient_id, patient_name, age, hospital, bacteria, resistance, sample_material) from the table microbiology_samples where the age of patient is greater than 65, the original sample material only includes biopsies from the University Hospital Zurich, the bacteria is like ‘Staphylococcus %’ (all Staphylococcus species) and not like ‘Staphylococcus aureus’ and the resistance is equal to ‘oxacillin’.

It is important to note that this is a sample code and it may not run on your specific database without modification. The table and column names, as well as the database structure, may be different depending on the system you are using. Additionally, you may need to adjust the logic and syntax to match the specific requirements of your database management system.

See anything unusual? Setting aside the vagueness of the human’s prompt, which omits some pertinent details like, what’s the structure of this database? Which dialect of SQL? and so on… a bit like asking “How do I upgrade my car’s brakes” without specifying the make/model of the car… The response has major problems, even if GPT-4 cautions that it is pseudocode and not runnable as-is.

  1. GPT-4 interprets “age” as the patient’s age, but the prompt suggested it’s a property of the strain.
  2. OK, even if it’s the patient’s age: why filter to >65? That’s not what the user asked.
  3. Why filter to biopsies but omit implants? Also not what the user asked.
  4. The user asked for information on oxacillin resistance, not to filter to oxacillin-resistant isolates.

And I could go on. But even for pseudocode this is … pretty off base. Let’s look at the example given for writing R code, for which GPT-4 sounds more confident and doesn’t include any disclaimers about its output.

Human: Now please write a code in R based on this SQL output. The code should allow to visualize the output data with a boxplot (with median and IQR and min/max) of age distribution of people with a oxacillin resistant and sensitive strain. The graph with resistance should be red, the one for sensitive strains should be green. Please comment and explain the code.

GPT-4: Sure, here is an example of R code that can be used to visualize the output data from the SQL query in the form of a boxplot:

library(dplyr)
library(ggplot2)

# Connect to the database and retrieve the data using the SQL query
db_con <- dbConnect(...) # specify the connection details to your
                         # database
data <- dbGetQuery(db_con, "SELECT patient_id, patient_name, age,
                            resistance FROM microbiology_samples WHERE...")
dbDisconnect(db_con)

# Use dplyr to group the data by resistance and calculate the age
# statistics
data_grouped <- data %>%
  group_by(resistance) %>%
  summarize(median = median(age),
            IQR = IQR(age),
            min = min(age),
            max = max(age))

# Plot the data using ggplot
ggplot(data, aes(x=resistance, y=age, color=resistance)) +
  geom_boxplot(outlier.shape = NA) +
  geom_point(data=data_grouped, aes(x=resistance, y=median), size=3) +
  geom_errorbar(data=data_grouped, aes(x=resistance, ymin=median-IQR,
                ymax=median+IQR), width=0.2) +
  scale_color_manual(values = c("red","green")) +
  ggtitle("Age distribution of oxacillin resistant and sensitive
          strains") +
  xlab("Resistance") +
  y

GPT-4: Would you like me to explain or break down this code?

There’s plenty of problems here as well.

  1. Assuming you fill in the dbConnect and dbGetQuery arguments correctly … this code won’t run, because those functions come from the DBI package, which was never loaded.

  2. Let’s say you knew to add library(DBI) yourself, or created fake data, like I’ve done in this rewritten version. It still won’t run. The code GPT-4 wrote was cut off prematurely after the final “y”!

library(dplyr)
library(ggplot2)

data <- data.frame(
  age = rnorm(10, mean = 65, sd = 15),
  resistance = rep(c("oxacillin", "none"), 5)
)

data_grouped <- data %>% group_by(resistance) %>%
  summarize(median = median(age),
            IQR = IQR(age),
            min = min(age),
            max = max(age))

# Plot the data using ggplot
ggplot(data, aes(x=resistance, y=age, color=resistance)) +
  geom_boxplot(outlier.shape = NA) +
  geom_point(data=data_grouped, aes(x=resistance, y=median), size=3) +
  geom_errorbar(data=data_grouped, aes(x=resistance, ymin=median-IQR, ymax=median+IQR),
                width=0.2) +
  scale_color_manual(values = c("red", "green")) +
  ggtitle("Age distribution of oxacillin resistant and sensitive strains") +
  xlab("Resistance") +
  y
Error in eval(expr, envir, enclos): object 'y' not found.
  1. OK, maybe GPT-4 meant to add a ylab() but ran out of tokens, or something. Even after fixing that, still won’t run!
library(dplyr)
library(ggplot2)

data <- data.frame(
  age = rnorm(10, mean = 65, sd = 15),
  resistance = rep(c("oxacillin", "none"), 5)
)

data_grouped <- data %>% group_by(resistance) %>%
  summarize(median = median(age),
            IQR = IQR(age),
            min = min(age),
            max = max(age))

# Plot the data using ggplot
ggplot(data, aes(x=resistance, y=age, color=resistance)) +
  geom_boxplot(outlier.shape = NA) +
  geom_point(data=data_grouped, aes(x=resistance, y=median), size=3) +
  geom_errorbar(data=data_grouped, aes(x=resistance, ymin=median-IQR, ymax=median+IQR),
                width=0.2) +
  scale_color_manual(values = c("red", "green")) +
  ggtitle("Age distribution of oxacillin resistant and sensitive strains") +
  xlab("Resistance") +
  ylab("Age")
ERROR while rich displaying an object: Error in `geom_errorbar()`:
! Problem while computing aesthetics.
i Error occurred in the 3rd layer.
Caused by error in `FUN()`:
! object 'age' not found

The aesthetics are mapped wrong, for a data frame that GPT-4 defined every part of, by writing the code to create data_grouped. So, no excuses GPT-4, that bug is all on you.

The fix is quite subtle, actually. You have to move the aes(y=age) mapping from the entire ggplot() to the geom_boxplot(), so it doesn’t get applied to every geom.5 Once we do this, we’ve finally fixed this up enough to get a plot as output! 🎉

library(dplyr)
library(ggplot2)

data <- data.frame(
  age = rnorm(10, mean = 65, sd = 15),
  resistance = rep(c("oxacillin", "none"), 5)
)

data_grouped <- data %>% group_by(resistance) %>%
  summarize(median = median(age),
            IQR = IQR(age),
            min = min(age),
            max = max(age))

# Plot the data using ggplot
ggplot(data, aes(x=resistance, color=resistance)) +
  geom_boxplot(aes(y=age), outlier.shape = NA) +
  geom_point(data=data_grouped, aes(x=resistance, y=median), size=3) +
  geom_errorbar(data=data_grouped, aes(x=resistance, ymin=median-IQR, ymax=median+IQR),
                width=0.2) +
  scale_color_manual(values = c("red", "green")) +
  ggtitle("Age distribution of oxacillin resistant and sensitive strains") +
  xlab("Resistance") +
  ylab("Age")

Plot output

But there’s yet more weirdness to fix. Why are the error bars drawn as 2 × IQR, when the human originally asked for max and min? Why do we have a redundant geom_point()? How do we fix the colors, which are backwards from what was asked for? And it goes on. Could we goad GPT-4 into getting the right answer eventually? Maybe 🥴

To be fair, the CID Viewpoint does later mention the limitation that LLMs sometimes produce “coherent but incorrect” output.1 These examples portray that limitation much better than the assertion that GPT-4 can write code to do data analysis.

Some have gone as far as to call GPTs bullshit generators. I think that’s a bit over the top. A better framing is that they produce output that is probabilistically correct, and for some tasks that could be useful, while for others it’s wholly inappropriate. Just like all machine learning methods, that uncertainty (is it correct 95% of the time? 23%? etc.) should be estimated, ideally using large samples, and then considered before using any of the output further.

By contrast, if you’re planning to say “Hey GPT-4, write me some R code” and plug that blindly into your analytical pipeline for bioinformatics, diagnostics, etc.? Since that requires a 100% correct answer, it would be a bad idea, without some reliable way of validating the code (e.g., human expert or test suite) and then fixing it, just as we had to do above.

So my final verdict? No, I would not yet say GPT-4 can replace a data analyst, or that it can write any code that I’d trust blindly. But can it speed up certain sub-tasks within data science? Absolutely! Information extraction from unstructured text, drafting small bits of code like regular expressions, first-pass conversion from one format/language to another… LLMs shine when you know how to use them. There are so many bits of tedium in programming that coders would happily delegate, as long as the process remains supervised.

Copilot is a great example of an LLM product that already helps many coders write documentation, generate unit tests, and review pull requests faster—with a human still very much in the loop. This pattern can be imitated in future AI products for clinicians and microbiology laboratories. The AI tool is invited in by the user as an augmentative assistant with access to current working data, but its output is reviewed and scrutinized before it is implemented or added to the record. This pattern is already central to successful medical voice transcription products like Dragon, and given the dozens of AI-powered scribes entering the market now,6 it’s clearly seen as a pattern that works.

That’s my hope for the general framework in which AI tools enter medicine, clinical research, microbiology, and infectious diseases. It is incumbent on domain experts in these fields to actively identify and adapt AI technologies to the use cases in which they make sense, while preserving our profession’s core values, including equitable outcomes, patient privacy, and the patient-provider relationship. Otherwise, much like how electronic medical records often feel like they were designed with front-line clinicians as an afterthought, we risk becoming the passengers—rather than the pilots—of the next generation of healthcare technology.

  1. Egli A. ChatGPT, GPT-4, and Other Large Language Models: The Next Revolution for Clinical Microbiology? Clin Infect Dis. 2023 Nov 11;77(9):1322-1328. doi:10.1093/cid/ciad407 2 3

  2. Kanjilal S. Flying Into the Future With Large Language Models. Clin Infect Dis. 2024 Apr 10;78(4):867-869. doi:10.1093/cid/ciad635

  3. ChatGPT, Claude, Bard, and Gemini, among a growing field of competitors, all have options for free access. There’s also the option of free downloadable models (LLaMA, Gemma, Mixtral, etc.) that you run on your own computer. 

  4. Pak TR, Kasarskis A. How next-generation sequencing and multiscale data analysis will transform infectious disease management. Clin Infect Dis. 2015 Dec 1;61(11):1695-702. doi:10.1093/cid/civ670

  5. I actually tried to guide GPT-4 into fixing this bug itself, but it repeatedly failed, even when giving it the full error message. 

  6. These are products that passively listen to a doctor-patient conversation and try to draft a clinical note, similar to a human medical scribe. Some of the many entrants include: Turboscribe.ai, CarePatron, Amazon, Heidi, Otter, Tali, Ambience… 

Nuances of Observational Studies of Time-to-Antibiotics and Sepsis Mortality

I’m happy to announce a research article I published in Clinical Infectious Diseases with my mentors Mike Klompas and Chanu Rhee, entitled “Risk of Misleading Conclusions in Observational Studies of Time-to-Antibiotics and Mortality in Suspected Sepsis.” The full text is freely available using this link.

We were motivated to perform this study after we reviewed the state of evidence used to support clinical guidelines on timing of antibiotics for sepsis. The typical maxim given on rounds is that the risk of death increases by 8% per hour of delay in treatment, and this has been sloganized into phrases like It’s About TIME.

Essentially all evidence for this derives from observational cohorts. A close reading of these studies shows four potential areas of concern, including:

  1. limited adjustment for potential confounders,
  2. inclusion of outlier patients with very long delays until antibiotics,
  3. failure to differentiate between sepsis with and without shock, and
  4. use of linear models that assume that each additional hour until antibiotics has an equal effect.

We created our own cohort that compares favorably to prior studies (largest N, five hospitals, detailed patient-level clinical data) and tested each of these analytical concerns in turn. (In essence, this was a series of four sensitivity analyses.) We used the same basic method as the most prominent studies in this field (multivariable logistic regression), varied the assumption corresponding with each analytical concern, and checked if the results changed. For all four, they did.

Graphical abstract table

We conclude that this type of study is highly sensitive to analytic choices regarding confounding, outliers, inclusion criteria, and non-linearity, and it’s likely that most of the prior studies were afflicted by at least one of these problems.1 If we keep only the most conservative assumptions (shorten the study time to 6h and adjust for all possible confounders) we are able to find a significant hourly relationship in patients with shock, but not in the cohorts without shock.

Many thanks to the entire Sepsis Time Zero team, TIDE, and the Harvard ID T32 for supporting this work, as well as the American Thoracic Society for inviting me to speak about it at the 2023 meeting.

  1. The many, many pitfalls of time-to-intervention studies are summarized in this evergreen EMCrit post by Josh Farkas

Asymptomatic COVID-19 Screening At Hospitals

Many US hospitals have already or will soon stop testing all admitted patients for COVID-19. To support this decision, policymakers will likely cite new recommendations released by the Society for Healthcare Epidemiology of America (SHEA), the professional society that encompasses most hospital epidemiologists in the US.

Interestingly, SHEA’s position paper in ICHE was guarded in its conclusions, stating “the use of asymptomatic screening is a unique yet resource-intensive tool that arguably has been overused” and that “support and funding are needed for high-quality studies to examine the use of asymptomatic screening”—in essence, that we don’t have enough data to fully know the costs and benefits of the practice. On the other hand, the corresponding press release took it a step farther, leading off with, “Healthcare facilities should no longer routinely screen symptom-free patients for COVID-19 upon admission or before procedures.” So, this is likely to be the takeaway for hospitals: that they can and should stop all asymptomatic testing for COVID-19.

In a countering Viewpoint that my research mentors Mike Klompas and Chanu Rhee, Julia Köhler at Boston Children’s, and I wrote for Clinical Infectious Diseases, we offer seven reasons why testing should continue,1 particularly during times of increased community transmission.

CID viewpoint

Here’s the top-line points:

  1. Hospital-acquired COVID-19 remains common but under-reported and under-appreciated. Earlier studies found that ~15% of admitted patients with COVID-19 acquired it in hospitals. Omicron associated with another uptrend in hospital-onset cases, as could be seen in our healthcare system in Massachusetts.

  2. Presymptomatic and asymptomatic patients pose the highest risk. Viral loads are highest just before symptoms (if any) start. So, most transmissions are from people without symptoms. If you don’t screen all admissions, you’ll never catch it in time.

  3. Transmission risk in shared patient rooms is high. The transmission rate from a patient with undiagnosed COVID-19 to an uninfected patient in the same room is 20-40%; in one Boston academic center early in the pandemic, with both universal testing and masking in place, it was 39%. Most US hospitals have lots of shared rooms. And patients don’t get to pick their roommates.

  4. Hospital-acquired COVID-19 still causes substantial morbidity and mortality. In vulnerable inpatients, Omicron is far from “mild” (the infamous epithet it was given by talking heads in the media shortly before it killed more Americans than any prior variant). Crude mortality rates for hospital-onset Omicron are as high as 13% in other countries’ national health system data.2 We already undertake enormously expensive measures to stop nosocomial infections like MRSA, VRE, and Clostridium difficile, all of which have arguably lower risk to the average hospital inpatient than COVID-19. For example, some hospitals (including MGH and BWH) perform asymptomatic screening of all ICU patients for MRSA and VRE using periodic swab testing.

  5. More testing is associated with fewer nosocomial cases and better outcomes. When England and Scotland stopped universal admission testing in Aug-Sep 2022, hospital-onset cases rose by 26-41% compared to community-onset infections.3

  6. Potential downsides of asymptomatic testing are real, but can be mitigated. The cost ($55/patient in 1 study) is minimal compared to 1 inpatient-day ($3-10k). False positives occur, but retesting and discontinuing precautions can be automated, as many US healthcare systems figured out during the pandemic. Mass General Brigham automated this process using an Epic SmartForm called CORAL.

  7. Hospitals have an ethical responsibility to protect patients from COVID-19. Forcing all patients to accept increased risk of COVID-19 when they enter hospitals in exchange for minimal benefit (saving $55, avoiding unnecessary airborne isolation, small differences in care timeliness) violates non-maleficence and beneficence. As COVID-19 has disproportionately harmed Black, Latinx, and Native American populations in the US, and continues to pose the highest risk to elderly, immunocompromised, and disabled people, the removal of various layers of COVID-19 precautions in hospitals will worsen systematic healthcare inequity in the US.

As an aside: Universal masking in hospitals is also under fire, so when “existing layers of protection” are offered as the compensatory mechanism that justifies ending asymptomatic respiratory viral testing, it is hard to accept such arguments in good faith.

In conclusion, we believe that more evidence and a serious plan for monitoring potential harms is needed before hospitals end universal admission testing. For more, read our full editorial in CID.

  1. First released as a preprint and also as a Twitter thread

  2. We have our own study under way that quantifies the mortality risk attributable to a hospital-onset Omicron infection across several US hospitals using a propensity-matched analysis. 

  3. When we published these results, the most common concern raised was that this increase was biased by misclassification of asymptomatic shedding cases as hospital-onset cases after the policy change. We’ve now replicated an even larger increase across a regional US healthcare system that ended both universal admission testing and masking simultaneously in May 2023, and we used chart reviews to find that there is a neglible proportion of misclassified hospital-onset cases after the policy change.