Up to this point, even AI organizations have encountered difficulties in developing tools that can accurately identify when a piece of writing has been created using a vast language model. Currently, a team of researchers has introduced an innovative approach for estimating LLM utilization across an extensive range of scientific literature by examining which “extra words” began appearing much more frequently during the LLM period (specifically, 2023 and 2024). According to the researchers, the findings “indicate that at least 10 percent of 2024 abstracts were generated with LLMs.”

In a preliminary paper shared earlier this month, four researchers from the University of Tübingen in Germany and Northwestern University mentioned that their work was influenced by studies that assessed the repercussions of the Covid-19 outbreak by examining excess mortality rates in comparison to recent times. Through a similar analysis of “surplus word usage” following the widespread availability of LLM writing tools towards the end of 2022, the researchers observed that “the emergence of LLMs caused a sudden surge in the prevalence of specific stylistic terms” that was “unparalleled in terms of both quality and quantity.”

Exploring Deeper

To gauge these alterations in vocabulary, the researchers scrutinized 14 million abstracts of papers published on PubMed between 2010 and 2024, monitoring the relative occurrence of each word annually. Subsequently, they compared the anticipated frequency of these words (according to the trend line prior to 2023) with the actual occurrence of these words in abstracts from 2023 and 2024, periods when LLMs were extensively utilized.

The outcomes unveiled several words that were exceedingly rare in these scientific abstracts before 2023 but suddenly gained popularity post the introduction of LLMs. For instance, the term “explores” showed up in 25 times more 2024 papers than the trendline prior to LLM would project; words like “displaying” and “emphasizes” also encountered a ninefold increase in usage. Furthermore, previously common words notably became more prevalent in post-LLM abstracts: The occurrence of “potential” spiked by 4.1 percentage points, “discoveries” by 2.7 percentage points, and “critical” by 2.6 percentage points, to name a few.

These variations in word usage could occur independently of LLM adoption, keeping in mind that the natural progression of language might cause certain terms to gain or lose favor. Nevertheless, the researchers noticed that during the pre-LLM period, such substantial and abrupt year-on-year increments were only evident for terms linked to significant global health occurrences: “epidemic” in 2015; “virus” in 2017; and expressions like “flu,” “quarantine,” and “outbreak” in the 2020 to 2022 timeframe.

Conversely, in the post-LLM era, the researchers detected numerous words exhibiting sudden, notable increases in scientific usage that were unrelated to global events. In contrast to the predominance of nouns in the excess words during the Covid crisis, the researchers found that the upsurge in frequency of words following LLM implementation predominantly belonged to “stylistic words” such as verbs, adjectives, and adverbs (a minor selection: “throughout, additionally, comprehensive, crucial, enhancing, showcased, perceptions, notably, particularly, within”).

While not an entirely new discovery—the heightened occurrence of “explore” in scientific papers has been widely acknowledged in recent times, for instance. Nonetheless, prior studies primarily resorted to juxtapositions with “authentic” human-authored samples or catalogs of predetermined LLM indicators secured from external sources. Here, the set of pre-2023 abstracts functions as an efficient control group, displaying how vocabulary selection has evolved overall in the post-LLM epoch.

A Complex Interaction

By highlighting numerous so-described “indicator terms” that witnessed a significant uptick during the post-LLM era, the distinctive indications of LLM usage sometimes emerge distinctly. Take this abstract snippet identified by the researchers, with the indicator words emphasized: “A detailed understanding of the sophisticated interaction between […] and […] is crucial for efficient therapeutic approaches.”

After conducting statistical assessments of the appearance of indicator words across individual papers, the researchers estimate that at least 10 percent of the post-2022 papers in the PubMed data pool were formulated with some level of LLM support. According to the researchers, the actual figure could be higher as their dataset might lack LLM-assisted abstracts that omit any of the recognized indicator terms.