Why does AI love the em-dash so much?
AI's love affair with em dashes seems to have a simple explanation: The data used to train large language models was full of em-dashes. The AI is simply mimicking the writers that it learned from.
In fact, there's some evidence to suggest that the content AI was trained on included significantly more em-dashes than you might expect. And weirdly enough, their prevalence seems to have become a deep bias that's embedded into how LLMs understand the flow and structure of writing.
AI-training material may have used an overabundance of em-dashes
One theory behind AI's love of the em-dash is that the later-generation AI models, which rely on it much more heavily than earlier iterations, were trained on older books that included more em-dashes than most modern writers would.
Early on, most AI models were trained based on a mix of public data on the Internet, as well as based on content from pirated books. However, in a quest for better quality training data as the tools evolved, AI models started scanning older texts. Curating the massive data trove that is the internet has been a major focus of AI Labs for more recent model generations, and finding quality text from books was certainly part of that.
The exact timeline for when this happened is something of a mystery, but Anthropic started in 2024, based on court documents, and other AI labs likely made a similar move somewhere between 2022 and 2024.
If AI labs digitized mostly older books, which is a common belief largely because of expired copyrights, their AI programs may have been fed writing with significantly more em dashes included in it — especially as studies show the use of that use of the em-dash peaked in the 1860s.
It may not have been the books alone, either.