We graded how five free AI chatbots performed journalism tasks. Here’s what we learned.

Liz Lucas on January 28, 2026

There’s a lot of conversation in journalism about benchmarking AI. LLMs are trained in different ways with different parameters, and while there is some effort to create standards by which to rank them, this doesn’t always translate directly to journalism.

As an industry we need to test LLMs and evaluate them on newsroom tasks; as Ethan Mollick suggests, we need to give AI a job interview. That’s what we set out to do in my AI class at the Missouri School of Journalism.

Students spent October and November in 2025 testing the free versions of popular AI chatbots to see how well the Large Language Models (LLMs) would perform common journalism tasks: researching a topic, summarizing a report and optimizing a story.

Not surprisingly, in every test it was apparent how important a “human in the loop” is for each task. However, the students were surprised that the quality between chatbots ranged dramatically for different tasks.

The students’ No. 1 takeaway: AI won’t be replacing reporters anytime soon, but chatbots can give journalists a head start on tasks (like eager minions who work quickly and don’t tire easily).

A close second takeaway: Not all chatbots are created equal.

The Process

We included ChatGPT (GPT-5), Claude (Sonnet 4), Gemini (Flash 2.5), Meta (Llama 4) and Deepseek (version undisclosed) in our tests. Each chatbot received the same detailed prompt for each test; all the prompts included specific requests about the desired output and where to look for resources. The output was graded on a simple rubric that included whether the chatbot fulfilled the assignment, how often it hallucinated and how useful the output was.

Test 1: Provide a background research report on tornadoes to assist a reporter in doing a followup story to a tornado event in Missouri.
Test 2: Provide a summary and data from a 300+ page housing study from the city of Columbia.
Test 3: Create headlines and social media content for a student-written longform story about pesticides and cancer.

Results and takeaways

The model ranked 1 did the best out of the five, and 5 did the worst.

Tool / Model	“Research” Rank	“Summarize” Rank	“Optimize” Rank
ChatGPT / GPT-5	1	1	2
Claude / Sonnet 4	5	2	3
DeepSeek / [undisclosed]	3	3	1
Gemini / Flash 2.5	2	4	4
Meta / Llama 4	4	5	5

Each test had a slightly different rubric, but generally the chatbots were graded by students on whether they fulfilled the entire prompt and gave accurate and useful content. Each student group only interacted with one chatbot over all the tests, and then the entire class discussed the ratings together, and at times made some tweaks. Keep in mind that the ratings are subjective, though as a whole we tried to be consistent.

Nota: The journalism-trained AI tool helping small outlets expand capacity

How to pick the right AI tools for your newsroom

The chatbots are not interchangeable:

ChatGPT consistently did well and ranked first in output on two out of three tests.
Meta consistently did poorly and the students using it were always frustrated.
Claude was bad at web research but good at document summarization; Gemini was the reverse.
DeepSeek was colorfully mediocre until the final test; it wrote the best headlines and social media posts.

Everyone hallucinates:

Every model hallucinated at some point, though ChatGPT had the fewest and Meta had the most.
When doing research, the models sometimes gave factual information but cited hallucinated links, which suggests that they process information well but can’t cite sources; this makes it tough for journalists to fact check.
Wholesale hallucinations were present. For example, in the first test DeepSeek hallucinated an entire investigative series by the Missouri Independent and The Kansas City Star called “A Tattered Safety Net.”

There are notable strengths:

Generally speaking, most of the chatbots (especially Claude, and to a lesser degree DeepSeek) were better at sourcing information from a particular document than the internet, and provided accurate citations back to it.
ChatGPT was the only chatbot that was able to pull data tables successfully from the report in test 2.
The models all fulfilled their tasks (rarely do AI chatbots say “no”), and every model’s output had something worthwhile for the student reporters.

Broadly, we concluded that these free tools are useful if you acknowledge that they require a lot of oversight. The question is, how much digging are you willing to do to find the useful information? And how easy is it to fact check it?

Through our experiments we developed a game plan for how to make the most of AI chatbots and how to avoid their worst tendencies.

Tips and best practices

Fact check everything you might want to use and make that task easier for yourself by requiring citations that will help you verify information. If an LLM can’t point you to a reliable source where it got the information, then it’s making your life harder, not easier.
- For the research report, this meant websites and papers.
- For the housing study summary, this meant page numbers.
Give the chatbot instructions about tone and good examples to follow. For headlines and social media posts, most of the LLMs unnecessarily sensationalized the content, but DeepSeek and ChatGPT did a good job of accurately characterizing the story. We also found it helpful to instruct the chatbots to ask follow up questions about good sources or ideas, to get a better sense of what we wanted.
For broad internet research, consider alternatives. These chatbots generally didn’t do well pulling resources consistently from the internet with good sourcing; ChatGPT did the best. “Reasoning” or “deep research” features (which are sometimes behind paywalls) might be better at this.
Consider tools designed for specific tasks. General chatbots might not be the best option for every task. For example, students also worked with NotebookLM and found it much more effective at summarizing a report.
Do your own experiments. Ultimately some of our findings will be obsolete in the near future as these tools change rapidly. Don’t trust AI without oversight; always vet the output and measure it against your journalistic standards.

Explore materials from the experiment

Credits

Professor: Elizabeth Lucas, Houston Harte Chair in Journalism

Students: Grace Ainger, Tyler Batliner, Ashley Flewellen, Jae Green, Dominique Hodge, Faith Jacoby, Aiden Kauffman, Cecelia Koparanyan, Lilly Marshall, Yasha Mikolajczak, Peter Pynadath, Nate Salsman, Brady Shanahan, Michael Stamps, Hannah Taylor, Mary Wolter

Category: Uncategorized
Tag: AI, homepage

Written by Liz Lucas

Liz Lucas is the Houston Harte Chair in Journalism at Mizzou, where she teaches courses in data journalism and emerging technologies. She previously worked as a data reporter and editor for various newsrooms and as a trainer for Investigative Reporters and Editors (IRE).

All Posts

We graded how five free AI chatbots performed journalism tasks. Here’s what we learned.

The Process

Results and takeaways

Related content

Nota: The journalism-trained AI tool helping small outlets expand capacity

How to pick the right AI tools for your newsroom

Tips and best practices

Explore materials from the experiment

Credits

Written by Liz Lucas

About the Help Desk

The Process

Results and takeaways

Related content

Nota: The journalism-trained AI tool helping small outlets expand capacity

How to pick the right AI tools for your newsroom

Tips and best practices

Explore materials from the experiment

Credits

Share this story:

Written by Liz Lucas

Related Articles

The Texas Tribune finds revenue-ready, responsive text-to-audio solution in Everlit

A news publisher’s guide to getting started with Everlit for AI-powered text-to-audio

Reminder to take a break | Help Desk Monthly

Behind the scenes of The Maine Trust for Local News’ successful social video strategy

About the Help Desk