Can bigger-is-better ‘scaling laws’ keep AI improving forever? History says we can’t be too sure
Milad Fakurian / Unsplash
OpenAI chief executive Sam Altman – perhaps the most
prominent face of the artificial intelligence (AI) boom that accelerated with
the launch of ChatGPT in 2022 – loves scaling laws.
These widely admired rules of thumb linking the size
of an AI model with its capabilities inform much of the headlong rush among the
AI industry to buy up powerful computer chips, build unimaginably large data
centres, and re-open shuttered nuclear plants.
As Altman argued in a blog post earlier this year,
the thinking is that the “intelligence” of an AI model “roughly equals the log
of the resources used to train and run it” – meaning you can steadily produce
better performance by exponentially increasing the scale of data and computing
power involved.
First observed in 2020 and further refined in 2022,
the scaling laws for large language models (LLMs) come from drawing lines on
charts of experimental data. For engineers, they give a simple formula that
tells you how big to build the next model and what performance increase to
expect.
Will the scaling laws keep on scaling as AI models
get bigger and bigger? AI companies are betting hundreds of billions of dollars
that they will – but history suggests it is not always so simple.
Scaling laws aren’t just for AI
Scaling laws can be wonderful. Modern aerodynamics is
built on them, for example.
Using an elegant piece of mathematics called the
Buckingham π theorem, engineers discovered how to compare small models in wind
tunnels or test basins with full-scale planes and ships by making sure some key
numbers matched up.
Those scaling ideas inform the design of almost
everything that flies or floats, as well as industrial fans and pumps.
Another famous scaling idea underpinned the boom
decades of the silicon chip revolution. Moore’s law – the idea that the number
of the tiny switches called transistors on a microchip would double every two
years or so – helped designers create the small, powerful computing technology
we have today.
But there’s a catch: not all “scaling laws” are
laws of nature. Some are purely mathematical and can hold indefinitely. Others
are just lines fitted to data that work beautifully until you stray too far
from the circumstances where they were measured or designed.
When scaling laws break down
History is littered with painful reminders of scaling
laws that broke. A classic example is the collapse of the Tacoma Narrows Bridge
in 1940.
The bridge was designed by scaling up what had worked
for smaller bridges to something longer and slimmer. Engineers assumed the same
scaling arguments would hold: if a certain ratio of stiffness to bridge length
worked before, it should work again.
Instead, moderate winds set off an unexpected
instability called aeroelastic flutter. The bridge deck tore itself apart,
collapsing just four months after opening.
Likewise, even the “laws” of microchip manufacturing
had an expiry date. For decades, Moore’s law (transistor counts doubling every
couple of years) and Dennard scaling (a larger number of smaller transistors
running faster while using the same amount of power) were astonishingly
reliable guides for chip design and industry roadmaps.
As transistors became small enough to be measured in
nanometres, however, those neat scaling rules began to collide with hard
physical limits.
When transistor gates shrank to just a few
atoms thick, they started leaking current and behaving unpredictably. The
operating voltages could also no longer be reduced with being lost in
background noise.
Eventually, shrinking was no longer the way
forward. Chips have still grown more powerful, but now through new designs
rather than just scaling down.
Laws of nature or rules of thumb?
The language-model scaling curves that Altman
celebrates are real, and so far they’ve been extraordinarily useful.
They told researchers that models would keep getting
better if you fed them enough data and computing power. They also showed
earlier systems were not fundamentally limited – they just hadn’t had enough
resources thrown at them.
But these are undoubtedly curves that have been fit
to data. They are less like the derived mathematical scaling laws used in
aerodynamics and more like the useful rules of thumb used in microchip design –
and that means they likely won’t work forever.
The language model scaling rules don’t necessarily
encode real-world problems such as limits to the availability of high-quality
data for training, or the difficulty of getting AI to deal with novel tasks –
let alone safety constraints or the economic difficulties of building data
centres and power grids. There is no law of nature or theorem guaranteeing that
“intelligence scales” forever.
Investing in the curves
So far, the scaling curves for AI look pretty smooth
– but the financial curves are a different story.
Deutsche Bank recently warned of an AI “funding gap”
based on Bain Capital estimates of a US$800 billion mismatch between projected
AI revenues and the investment in chips, data centres and power that would be
needed to keep current growth going.
JP Morgan, for their part, has estimated that the
broader AI sector might need around US$650 billion in annual revenue just to
earn a modest 10% return on the planned build-out of AI infrastructure.
We’re still finding out which kind of law governs
frontier LLMs. The realities may keep playing along with the current scaling
rules; or new bottlenecks – data, energy, users’ willingness to pay – may bend
the curve.
Altman’s bet is that the LLM scaling laws will
continue. If that’s so, it may be worth building enormous amounts of computing
power because the gains are predictable. On the other hand, the banks’ growing
unease is a reminder that some scaling stories can turn out to be Tacoma
Narrows: beautiful curves in one context, hiding a nasty surprise in the next.
For more such insights, log
into www.international-maths-challenge.com.
Comments
Post a Comment