We graded how five free AI chatbots performed journalism tasks. Here’s what we learned.
Share this story:
There’s a lot of conversation in journalism about benchmarking AI. LLMs are trained in different ways with different parameters, and while there is some effort to create standards by which to rank them, this doesn’t always translate directly to journalism.
As an industry we need to test LLMs and evaluate them on newsroom tasks; as Ethan Mollick suggests, we need to give AI a job interview. That’s what we set out to do in my AI class at the Missouri School of Journalism.
Students spent October and November in 2025 testing the free versions of popular AI chatbots to see how well the Large Language Models (LLMs) would perform common journalism tasks: researching a topic, summarizing a report and optimizing a story.
Not surprisingly, in every test it was apparent how important a “human in the loop” is for each task. However, the students were surprised that the quality between chatbots ranged dramatically for different tasks.
The students’ No. 1 takeaway: AI won’t be replacing reporters anytime soon, but chatbots can give journalists a head start on tasks (like eager minions who work quickly and don’t tire easily).
A close second takeaway: Not all chatbots are created equal.
The Process
We included ChatGPT (GPT-5), Claude (Sonnet 4), Gemini (Flash 2.5), Meta (Llama 4) and Deepseek (version undisclosed) in our tests. Each chatbot received the same detailed prompt for each test; all the prompts included specific requests about the desired output and where to look for resources. The output was graded on a simple rubric that included whether the chatbot fulfilled the assignment, how often it hallucinated and how useful the output was.
- Test 1: Provide a background research report on tornadoes to assist a reporter in doing a followup story to a tornado event in Missouri.
- Test 2: Provide a summary and data from a 300+ page housing study from the city of Columbia.
- Test 3: Create headlines and social media content for a student-written longform story about pesticides and cancer.
Results and takeaways
The model ranked 1 did the best out of the five, and 5 did the worst.
| Tool / Model | “Research” Rank | “Summarize” Rank | “Optimize” Rank |
|---|---|---|---|
| ChatGPT / GPT-5 | 1 | 1 | 2 |
| Claude / Sonnet 4 | 5 | 2 | 3 |
| DeepSeek / [undisclosed] | 3 | 3 | 1 |
| Gemini / Flash 2.5 | 2 | 4 | 4 |
| Meta / Llama 4 | 4 | 5 | 5 |
Each test had a slightly different rubric, but generally the chatbots were graded by students on whether they fulfilled the entire prompt and gave accurate and useful content. Each student group only interacted with one chatbot over all the tests, and then the entire class discussed the ratings together, and at times made some tweaks. Keep in mind that the ratings are subjective, though as a whole we tried to be consistent.
The chatbots are not interchangeable:
- ChatGPT consistently did well and ranked first in output on two out of three tests.
- Meta consistently did poorly and the students using it were always frustrated.
- Claude was bad at web research but good at document summarization; Gemini was the reverse.
- DeepSeek was colorfully mediocre until the final test; it wrote the best headlines and social media posts.
Everyone hallucinates:
- Every model hallucinated at some point, though ChatGPT had the fewest and Meta had the most.
- When doing research, the models sometimes gave factual information but cited hallucinated links, which suggests that they process information well but can’t cite sources; this makes it tough for journalists to fact check.
- Wholesale hallucinations were present. For example, in the first test DeepSeek hallucinated an entire investigative series by the Missouri Independent and The Kansas City Star called “A Tattered Safety Net.”
There are notable strengths:
- Generally speaking, most of the chatbots (especially Claude, and to a lesser degree DeepSeek) were better at sourcing information from a particular document than the internet, and provided accurate citations back to it.
- ChatGPT was the only chatbot that was able to pull data tables successfully from the report in test 2.
- The models all fulfilled their tasks (rarely do AI chatbots say “no”), and every model’s output had something worthwhile for the student reporters.
Broadly, we concluded that these free tools are useful if you acknowledge that they require a lot of oversight. The question is, how much digging are you willing to do to find the useful information? And how easy is it to fact check it?
Through our experiments we developed a game plan for how to make the most of AI chatbots and how to avoid their worst tendencies.
Tips and best practices
- Fact check everything you might want to use and make that task easier for yourself by requiring citations that will help you verify information. If an LLM can’t point you to a reliable source where it got the information, then it’s making your life harder, not easier.
- For the research report, this meant websites and papers.
- For the housing study summary, this meant page numbers.
- Give the chatbot instructions about tone and good examples to follow. For headlines and social media posts, most of the LLMs unnecessarily sensationalized the content, but DeepSeek and ChatGPT did a good job of accurately characterizing the story. We also found it helpful to instruct the chatbots to ask follow up questions about good sources or ideas, to get a better sense of what we wanted.
- For broad internet research, consider alternatives. These chatbots generally didn’t do well pulling resources consistently from the internet with good sourcing; ChatGPT did the best. “Reasoning” or “deep research” features (which are sometimes behind paywalls) might be better at this.
- Consider tools designed for specific tasks. General chatbots might not be the best option for every task. For example, students also worked with NotebookLM and found it much more effective at summarizing a report.
- Do your own experiments. Ultimately some of our findings will be obsolete in the near future as these tools change rapidly. Don’t trust AI without oversight; always vet the output and measure it against your journalistic standards.
Explore materials from the experiment
Credits
Professor: Elizabeth Lucas, Houston Harte Chair in Journalism
Students: Grace Ainger, Tyler Batliner, Ashley Flewellen, Jae Green, Dominique Hodge, Faith Jacoby, Aiden Kauffman, Cecelia Koparanyan, Lilly Marshall, Yasha Mikolajczak, Peter Pynadath, Nate Salsman, Brady Shanahan, Michael Stamps, Hannah Taylor, Mary Wolter





