technically logo

Learn

AI Reference
Your dictionary for AI terms like LLM and RLHF
Company Breakdowns
What technical products actually do and why the companies that make them are valuable
Knowledge Bases
In-depth, networked guides to learning specific concepts
Posts Archive
All Technically posts on software concepts since the dawn of time
Terms Universe
The dictionary of software terms you've always wanted

Explore knowledge bases

AI, it's not that ComplicatedAnalyzing Software CompaniesBuilding Software ProductsWorking with Data Teams
Loading...

Meet Technically

Technically exists to help you get better at your job by becoming more technically literate.

Learn more →

Solutions for Teams

For GTM Teams
Sell more software to developers by becoming technically fluent.
For Finance Professionals
Helping both buy-side and sell-side firms ask better technical questions.
General Team Inquiries
Volume discounts on Technically knowledge bases.
Loading...
Pricing
Sign In
← Back to Universe

Transformers

ai-architectureintermediate

Transformers are the neural network architecture that powers modern AI like ChatGPT, revolutionizing how models process language and other sequential data.

  • Breakthrough "attention mechanism" lets models focus on relevant parts of input
  • Replaced older architectures that processed information sequentially
  • Powers all major language models: GPT, Claude, Gemini, and others
  • Think spotlight that can illuminate multiple things at once versus flashlight pointing at one thing

Transformers turned AI from reading word-by-word to understanding entire contexts instantly.

What is a transformer in AI?

Loading image...

A transformer is a neural network architecture—basically a blueprint for how to build an AI model. It's the design that powers virtually every major language AI you've heard of: ChatGPT, Claude, Gemini, and others.

Architectures are the blueprints for AI models: they dictate how models are designed and built. Most AI today is made up of computing units called neurons linked together in complex networks. There are a million ways to build these networks: different algorithms, structures, and sizes.

Transformers represent a specific way of connecting these neurons that turned out to be incredibly effective for language tasks.

How do transformer models work?

Loading image...

The key innovation in transformers is called the "attention mechanism"—it lets the model look at all parts of the input simultaneously and decide which parts are most relevant for what it's trying to do.

Before transformers (older architectures):

  • Processed text word-by-word in sequence
  • Had trouble connecting distant parts of sentences
  • Like reading with a flashlight, only seeing one word at a time

With transformers:

  • Can look at entire sentences or paragraphs at once
  • Connects related concepts regardless of distance
  • Like having a spotlight that illuminates everything and focuses on what matters

This parallel processing approach is why transformers work so well with GPUs - instead of having to wait for each word to be processed, they can analyze everything simultaneously.

What's special about the transformer architecture?

Loading image...

Three major breakthroughs made transformers revolutionary:

1. Attention Mechanism

The model can focus on relevant parts of the input while processing any given word. When processing "The bank was steep," it knows whether "bank" refers to a financial institution or a riverbank based on context.

2. Parallel Processing

Unlike older models that had to process sequentially, transformers can analyze entire sequences simultaneously, making them much faster to train and run.

3. Scalability

Transformers get better as you make them bigger and feed them more data, which wasn't true for previous architectures that hit performance ceilings.

Why are transformers better than previous models?

Loading image...

Previous architectures had fundamental limitations:

  • RNNs (Recurrent Neural Networks): Had to process text sequentially and forgot earlier parts of long sequences. Like trying to remember the beginning of a conversation while focusing on the current sentence.
  • CNNs (Convolutional Neural Networks): Great for images but struggled with variable-length text and long-range dependencies.
  • Transformers solved these problems: They can process any length of text, maintain context across entire documents, and train much faster due to parallel processing.

What models use transformers?

Pretty much every major language AI:

  • Text Generation: GPT-5, Claude, Gemini, LaMDA
  • Code Generation: GitHub Copilot, CodeT5
  • Translation: Google Translate (modern versions)
  • Search: Newer search algorithms use transformer-based understanding

Even multimodal models (text + images) often use transformer architectures adapted for different data types.

When were transformers invented?

In 2017, in a research paper called "Attention Is All You Need" by Google researchers. The title was a cheeky reference to the fact that attention was the key innovation that made everything else possible.

The architecture became dominant surprisingly quickly - within 5 years, virtually all state-of-the-art language models were using transformer architectures.

Frequently Asked Questions About Transformers

What does "attention" mean in transformers?

Attention is how the model decides which parts of the input to focus on when processing any given element. When writing "The cat sat on the...", the model pays attention to "cat" and "sat" to predict "mat" is more likely than "stove."

Are transformers only for text?

Nope. While they became famous for language tasks, transformers work well for any sequential data: images (Vision Transformer), audio, time series data, even protein sequences in biology.

What's the difference between GPT and transformers?

GPT (Generative Pre-trained Transformer) is a specific implementation of the transformer architecture. It's like asking the difference between "car" and "Toyota" - GPT is a type of transformer.

Why do transformers need so much computational power?

The attention mechanism requires comparing every word to every other word in the input, which scales quadratically. Processing a 1,000-word document requires roughly 1 million attention calculations.

What comes after transformers?

Researchers are working on architectures that maintain transformers' capabilities while being more efficient (like Mamba and other "state space models"), but transformers remain dominant for now.

Transformers

Read the full post ↗

The beginner’s guide to AI model architectures

Unlike an onion, hopefully these neural network layers won't make you cry.

Read in the Knowledge Base →

Mentioned in

What's a datacenter?

What actually goes on inside the buildings propping up the US economy.

Foundationalai

Related terms

Neural Network

Parameters

Impress your engineers

70K+ product managers, marketers, bankers, and other -ers read Technically to understand software and work better with developers.

Newsletter
Support
Sponsorships
X + Linkedin
Privacy + ToS

Written with 💔 by Justin in Brooklyn