If you’re using Large Language Models at work to help automate your job, or even just for personal use, you’ve now got the choice of several different models instead of just ChatGPT. Between Gemini (formerly Bard), Mixtral, Llama2, and ole’ reliable (ChatGPT), which is the best for the kinds of tasks you need it to do?
Existing benchmarks tend to focus on the technical aspects of these models – how fast do they run, how much context can they keep in mind, etc. These are interesting, but not very useful to typical readers of Technically. Instead, the focus of this post is how well they perform for real world tasks that functional teams like marketing, product, and operations would actually use them for.
The TL;DR, for impatient readers:
- Most models are roughly at parity with each other for common chat-oriented tasks
- ChatGPT performed significantly worse than I thought it would
- Gemini, to the surprise of everyone, was the best performing model by a decent margin
- Overall, model responses were usable, but would need a lot of cleanup and work to use practically
The ringer we shall put these models through
I designed 3 use cases to test each model against, designed to mimic a real world task that you might have an LLM do for in the course of your job. They’re all centered around generating text, even though some of these models are multimodal (can do images as well).