Comparing available LLMs for non-technical users
How do ChatGPT, Mistral, Gemini, and Llama3 stack up for common tasks like generating sales emails?
Last updated: March 3, 2025
If you’re using Large Language Models at work to help automate your job, or even just for personal use, you’ve now got the choice of several different models instead of just ChatGPT . Between Gemini (formerly Bard), Mixtral, Llama2, and ole’ reliable (ChatGPT), which is the best for the kinds of tasks you need it to do?
Existing benchmarks tend to focus on the technical aspects of these models – how fast do they run, how much context can they keep in mind, etc. These are interesting, but not very useful to typical readers of Technically. Instead, the focus of this post is how well they perform for real world tasks that functional teams like marketing, product, and operations would actually use them for.
The TL;DR, for impatient readers:
-
Most models are roughly at parity with each other for common chat-oriented tasks
-
ChatGPT performed significantly worse than I thought it would
-
Gemini, to the surprise of everyone, was the best performing model by a decent margin
-
Overall, model responses were usable, but would need a lot of cleanup and work to use practically
The ringer we shall put these models through
I designed 3 use cases to test each model against, designed to mimic a real world task that you might have an LLM do for in the course of your job. They’re all centered around generating text, even though some of these models are multimodal (can do images as well).
Companies Mentioned
1) Generating social posts from an existing piece of content
This one is for marketing teams. A common (frankly tedious) task is taking an existing piece of content – say a blog post written recently – and breaking it down into smaller bits to post on social media like X or LinkedIn. For this test, I’ll ask the model to turn this Technically post about microservices into social bits. Here’s how I’ll evaluate results:
-
Does the model faithfully reproduce the important points of the original piece of content?
-
Does the generated content __flow __and make sense? Is it easy to read?
-
Does the model follow the given constraints for the formatting of the post? E.g. tweets less than 280 characters, new lines in LinkedIn posts
2) Synthesizing customer interview notes into an internal update
This one is for product and design teams. For this test, I’ll provide the model with some bullet points from a call I did with a customer (or potential customer) and how they use Technically. I’ll ask the model...