Commercial LLMs challenged with tests of originality and creativity generate results that are more similar to one another than people’s responses.
SHARE
LLMs Are Different. LLMs Are the Same.
There are already hundreds of thousands of large language models (LLMs) in existence with a few dozen commercial systems dominating the market. Between options such as GPT-4, Claude and Gemini, many people have their favorite, especially when it comes to creative tasks such as writing.
Those preferences, however, are likely entirely in the eye of the beholder. According to new research from Duke University, the creative outputs of commercial LLMs are more similar to each other than users might hope. When challenged with three standard tasks assessing creativity, answers from commercial LLMs are much more alike than their human counterparts.
“People might wonder if different LLMs will take them in different directions with the same prompts for creative projects,” said Emily Wenger, the Cue Family Assistant Professor of Electrical and Computer Engineering at Duke. “This paper basically says no. LLMs are less creative as a population than humans.”
People might wonder if different LLMs will take them in different directions with the same prompts for creative projects. This paper basically says no. LLMs are less creative as a population than humans.
Emily WengerCue Family Assistant Professor of Electrical and Computer Engineering at Duke
According to a 2024 survey by Adobe, over half of Americans have already used LLMs as creative partners for brainstorming, writing, creating images or writing code. Because an overwhelming majority of users trust them for help with being more creative, researchers have been trying to find out if that trust is misplaced.
One seminal paper in this emerging field conducted by Anil Doshi and Oliver Hauser found that writers who used GPT-4 produced more creative stories than humans working alone. However, the same study showed that those LLM-aided stories were more similar to each other than were stories from human writers working solo.
This research, and other papers like it, only looked at people using one specific LLM. Wenger, who studies how data gets into AI models, was curious how these types of results would translate between different LLMs.
AI at Duke ECE
We have built a team of internationally recognized experts in artificial intelligence and machine learning.
“Commercial LLMs have all been trained on the same dataset—the entirety of the internet—and they all have the same goal,” Wenger said. “It seemed likely to me that this would limit the amount of diversity we’d see in their creativity, so I decided to find out.”
To explore her hunch, Wenger turned to Yoed Kenett, a cognitive neuroscientist and associate professor of data and decision sciences at the Technion – Israel Institute of Technology. Together, they settled on three standard tasks used to assess creativity levels and put 22 LLMs to the test against over 100 people.
One test, called the Alternative Uses Test (AUT), challenges participants to name different ways that an object could be used from its intended use. For example, using a book as a doorstop, fly swatter or kindling for a fire. The second test, called the Divergent Association Task (DAT), asks participants to name 10 different words, each as different as possible from the others in every sense. Lastly, the Forward Flow (FF) test provides a starting prompt word and asks participants to write down the next word that follows in their mind from the previous word for up to 20 words. For example, fire, candle, wax, hair, comb, honey, bee, stripes, zebra, etc.
If you’re trying to come up with an original concept or product to stand out from the crowd, this work highly suggests you should bring together a diverse group of people to brainstorm rather than relying on AI.
Emily WengerCue Family Assistant Professor of Electrical and Computer Engineering at Duke
Together, these tests seek to measure the divergent and dissociative thinking abilities that facilitate creativity.
“Significant empirical research on the past few decades highlight how much human creativity depends on variability,” said Yoed Kenett. “The problem, as we and others are increasingly showing, is that while LLMs appear to generate extremely original outputs, they are overly homogenized and not variable in their responses. This could have detrimental long-term impact on human creative thinking and thus must be addressed.”
The results, which aimed to measure the variability and originality in responses between LLMs and people, were clear. While individual LLMs might outperform individual people in levels of creativity, as a whole, the algorithms’ responses were much more similar to each other than the people’s. Importantly, altering the LLM system prompt to encourage higher creativity only slightly increased their variability—and human responses still won out.
“This work has broad implications as people continue adopting and integrating LLMs into their daily life,” Wenger said. “Over reliance on these tools will smooth the world’s work toward the same underlying set of words or grammar, tending to make writing all look the same.”
“If you’re trying to come up with an original concept or product to stand out from the crowd,” Wenger continued, “this work highly suggests you should bring together a diverse group of people to brainstorm rather than relying on AI.”
Aaron Franklin’s ECE 512 course on emerging nanoelectronic devices gives students a current look at the fast-changing world of semiconductor device research.
Robert Calderbank, a professor of electrical and computer engineering, is an information technology pioneer who has made important contributions to communications technology.
Cookie Consent
This website uses cookies as well as similar tools and technologies to understand visitors' experiences. By continuing to use this website, you consent to Duke University's usage of cookies and similar technologies, in accordance with the Duke Privacy Statement.
Cookie Preferences
Manage your cookie preferences below:
Essential cookies enable basic functions and are necessary for the proper function of the website.
Name
Description
Duration
Cookie Preferences
This cookie is used to store the user's cookie consent preferences.
30 days
Google reCAPTCHA helps protect websites from spam and abuse by verifying user interactions through challenges.
Name
Description
Duration
_GRECAPTCHA
Google reCAPTCHA sets a necessary cookie (_GRECAPTCHA) when executed for the purpose of providing its risk analysis.
179 days
Google Tag Manager simplifies the management of marketing tags on your website without code changes.
Name
Description
Duration
td
Registers statistical data on users' behaviour on the website. Used for internal analytics by the website operator.