OpenAI isn’t doing enough to make ChatGPT’s limitations clear

“May occasionally generate incorrect information.”

This is the warning OpenAI pins to the homepage of its AI chatbot ChatGPT — one point among nine that detail the system’s capabilities and limitations.

“May occasionally generate incorrect information.”

It’s a warning you could tack on to just about any information source, from Wikipedia to Google to the front page of The New York Times, and it would be more or less correct.

“May occasionally generate incorrect information.”

Because when it comes to preparing people to use technology as powerful, as hyped, and as misunderstood as ChatGPT, it’s clear OpenAI isn’t doing enough.

The misunderstood nature of ChatGPT was made clear for the umpteenth time this weekend when news broke that US lawyer Steven A. Schwartz had turned to the chatbot to find supporting cases in a lawsuit he was pursuing against Colombian airline Avianca. The problem, of course, was that none of the cases ChatGPT suggested exist.

Schwartz claims he was “unaware of the possibility that [ChatGPT’s] content could be false,” though transcripts of his conversation with the bot show he was suspicious enough to check his research. Unfortunately, he did so by asking ChatGPT, and again, the system misled him, reassuring him that its fictitious case history was legitimate:

a:hover]:text-gray-63 [&>a:hover]:shadow-underline-black dark:[&>a:hover]:text-gray-bd dark:[&>a:hover]:shadow-underline-gray [&>a]:shadow-underline-gray-63 dark:[&>a]:text-gray-bd dark:[&>a]:shadow-underline-gray”>Image: SDNY

Schwartz deserves plenty of blame in this scenario, but the frequency with which cases like this are occurring — when users of ChatGPT treat the system as a reliable source of information — suggests there also needs to be a wider reckoning.

Over the past few months, there have been numerous reports of people being fooled by ChatGPT’s falsehoods. Most cases are trivial and have had little or no negative impact. Usually, the system makes up a news story or an academic paper or a book, then someone tries to find this source and either wastes their time or looks like a fool (or both). But it’s easy to see how ChatGPT’s misinformation could lead to more serious consequences.

In May, for example, one Texas A&M professor used the chatbot to check whether or not students’ essays had been written with the help of AI. Ever obliging, ChatGPT said, yes, all the students’ essays were AI-generated, even though it has no reliable capability to make this assessment. The professor threatened to flunk the class and withhold their diplomas until his error was pointed out. Then, in April, a law professor recounted how the system generated false news stories accusing him of sexual misconduct. He only found out when a colleague, who was doing research, alerted him to the fact. “It was quite chilling,” the professor told The Washington Post. “An allegation of this kind is incredibly harmful.”

I don’t think cases like these invalidate the potential of ChatGPT and other chatbots. In the right scenario and with the right safeguards, it’s clear these tools can be fantastically useful. I also think this potential includes tasks like retrieving information. There’s all sorts of interesting research being done that shows how these systems can and will be made more factually grounded in the future. The point is, right now, it’s not enough.

This is partly the fault of the media. Lots of reporting on ChatGPT and similar bots portray these systems as human-like intelligences with emotions and desires. Often, journalists fail to emphasize the unreliability of these systems — to make clear the contingent nature of the information they offer.

People use ChatGPT as a search engine. OpenAI needs to recognize that and warn them in advance

But, as I hope the beginning of this piece made clear, OpenAI could certainly help matters, too. Although chatbots are being presented as a new type of technology, it’s clear people use them as search engines. (And many are explicitly launched as search engines, so of course they get confused.) This isn’t surprising: a generation of internet users have been trained to type questions into a box and receive answers. But while sources like Google and DuckDuckGo provide links that invite scrutiny, chatbots muddle their information in regenerated text and speak in the chipper tone of an all-knowing digital assistant. A sentence or two as a disclaimer is not enough to override this sort of priming.

Interestingly, I find that Bing’s chatbot (which is powered by the same tech as ChatGPT) does slightly better on these sorts of fact-finding tasks; mostly, it tends to search the web in responses to factual queries and supplies users with links as sources. ChatGPT can search the web, but only if you’re paying for the Plus version and using the beta plug-ins. Its self-contained nature makes it more likely to mislead.

Interventions don’t need to be complex, but they need to be there. Why, for example, can ChatGPT simply not recognize when it’s being asked to generate factual citations and caution the user to “check my sources”? Why can’t it respond to someone asking “is this text AI-generated?” with a clear “I’m sorry, I’m not capable of making that judgment”? (We reached out to OpenAI for comment and will update this story if we hear back.)

OpenAI has definitely improved in this area. Since ChatGPT’s launch, it has, in my experience, become much more up-front about its limitations, often prefacing answers with that AI shibboleth: “As an AI language model…” But it’s also inconsistent. This morning, when I asked the bot “can you detect AI-generated text?” it cautioned that it was “not foolproof,” but then when I fed it a chunk of this story and asked the same question, it simply replied: “Yes, this text was AI-generated.” Next, I asked it to give me a list of book recommendations on the topic of measurement (something I know a little about). “Certainly!” it said before offering 10 suggestions. It was a good list, hitting many of the classics, but two titles were just entirely made up, and if I hadn’t known to check, I wouldn’t have noticed. Try similar tests yourself and you’ll q uickly find errors.

With this sort of performance, a disclaimer like “May occasionally generate incorrect information” just doesn’t seem accurate.

Source