A Revolutionary Leap in AI: The Dawn of MatMul-Free Large Language Models

The field of artificial intelligence (AI) has been undergoing a meteoric race to outdo itself, with every generation of new Large Language Models (LLMs) striving to be noticeably more efficient and powerful than the last. This race also involves a quest to push deep-learning alignment into new paradigms, but it has thus far been stymied by a computational bottleneck. To better understand how, let’s do a quick recap of modern AI. At the core of any AI model is a mental record of how words typically occur together – say, ‘house’ and ‘has’ will follow each other much more often than ‘house’ and ‘obey’. This may seem intuitive to us, but there are countless possibilities for how words can pair up with one another that together create a potentially infinite tail of combinations that are unlikely or impossible in reality. Each AI model must keep track of the most frequent combinations, and this process poses a bit of a dilemma. On one hand, we want the AI to consider as many reasonable combinations as possible (think ‘I drank the mint tea’ and ‘the mint tea made me sick’). On the other, we also want the AI to recognise combinations that are unlikely but very familiar to us (think ‘I drank the mint tea’ and ‘the tea was minty’). So how does an AI model strike this balance? Well, it uses a technique called Matrix Multiplication (or MatMul, for short), which was first introduced into AI by the novelist Jorge Luis Borges, who used it for a hauntingly prescient story involving a fictitious form of AI called the Analytical Engine.

Understanding the Role of MATRIX MULTIPLICATION in AI

This matrix-multiplication operation (henceforth, MatMul) is a fundamental operation in deep learning. For instance, it’s fundamental to what we need to do in order to combine data and combine weights in networks – it’s a foundational element of machine learning in neural networks. In fact, in some neural network applications, MatMul operations are a foundational element that you need to perform at scale. For example, in cases where we try to transform the input information via a neural network to, for example, identify a human in a video or in an image. However, in the efforts to perform larger computing tasks with AI, MatMul operations have become fundamental as needing to scale up. This particular operation requires huge amounts of memory and computational power (from GPU clusters).

The Quest for MatMul Alternatives

MatMul-free language models: an architecture that eliminates matrix multiplications in language models entails lower carbon emissions compared to its equivalent counterpart. MatMul-free language models realised by the group at the University of California, Santa Cruz with their collaborators from Soochow University and the University of California, Davis, offer a significant milestone in the field of AI research to date, illustrating that it is possible to achieve comparable performance in LLMs with substantially smaller memory footprint during inference.

The Breakthrough of Ternary Operations

The crux of this method is the replacement of the usual floating-point weights of 16 bits with 3-bit ternary weights – -1, 0 and +1 – along with the use of additions instead of multiplications. This reduces computational cost substantially while still achieving the same results. This is made possible by BitLinear layers using ternary weights and demonstrates we don’t necessarily have to sacrifice the versions of the language models that work well for us.

Rethinking Transformer Architecture

A major innovation in this new architecture is revamped Transformer blocks, which are the foundation of the regular language model architecture. The researchers have replaced the conventional self-attention token mixing with MatMul, which was the core of Transformer with MatMul-free Linear Gated Recurrent Units (MLGRU) and channel mixing with a new formulation of Gated Linear Unit (GLU), which is a MatMul-free concoction for the architecture. The Modular Encoder-Decoder (Mojito) architecture can actually be implemented without relying on any MatMul operations, making it one of the most efficient language model architectures ever devised – all fuelled by addition of weights and element-wise products, with not a shred of multiplication to be found.

Performance Insights of MatMul-Free Models

These MatMul-free language models have been evaluated on several benchmarks and have shown that they can utilise computational resources more efficiently than the Transformer++ architecture. On the same number of model parameters, the MatMul-free models can outperform the Transformer++ architecture. Even more impressively, they can also outperform Transformer++ architecture on several difficult benchmarks. This demonstrates that there might be a more efficient path to high-quality language models, with the potential to transform the future of language models.

The Advantages of Going MatMul-Free

Still, one of the best selling points for MatMul-free models is that they use less memory and have less latency, a fact that becomes even more apparent when model sizes keep growing. In this case, MatMul-free models are much more efficient than traditional approaches, which is highly encouraging for the sustainability and availability of LLMs. As more and more researchers and institutions can now access powerful AI tools, their work will pave the way for a new era of AI-assisted science.

The Future Is MatMul-Free

The work is not just an academic exercise – it’s a big step toward making AI more scalable, energy-efficient and economical. The researchers are making their algorithm and their pretrained models available to the research community. That should lead to increased experimentation on MatMul- free architectures, ideally reducing reliance on specialised and sometimes expensive hardware as more and more people have access to powerful AI technologies.

Decoding the Matrix

At its core, the ‘matrix’ in deep learning is a grid, like a line of numbers in a spreadsheet, of not only the data being transformed but also of the weights by which neural networks are trained to learn and make predictions. Matrix multiplication – or MatMul – is the operation of merging these grids (called matrices) into new matrices of numbers through a series of mathematical operations that doesn’t permit skipping or altering or ignoring any step. MatMul underpins the Transformer architecture by performing the necessary heavy number-crunching that makes natural language processing (NLP) possible. Through these tasks, it has become the classic black box of contemporary AI computation. But, as we’ve seen, the seeming foundational nature of MatMul is being brought into question by the explosive rise of MatMul-free models that, once again, appear to be radically reinventing what’s possible in the field of artificial intelligence.

Jun 14, 2024
<< Go Back