The unprecedented rise of Large language models (LLMs) has seen a quick adoption of its technology in various spheres. While language models and transformers have been used before for Google Search, translation, and sentiment recognition, ChatGPT can be credited for bringing the transformer architecture and LLMs to the forefront of public attention and usage, seeing as many as 100 million users within a few days.
Due to their self-attention architecture, ease of training, and high parallelization capabilities, transformer models have taken the field of deep learning by storm. Their applications are prominent in Natural Language Processing, Computer Vision, Speech Recognition, Protein Folding, Reinforcement Learning, and other intersections of deep learning and sub-fields within artificial intelligence and beyond.
However, these models' widespread adoption and eventual ubiquity raise critical questions about privacy and security. As these models interact with vast amounts of personal and sensitive data, safeguarding privacy without hindering functionality becomes paramount. Blockchains, a technology synonymous with security and decentralization, present a promising avenue to address these privacy challenges. By integrating cryptographic methods with the help of blockchains into AI inference and training, it could be possible to create a secure and transparent framework where data ownership and privacy are preserved.
This article explores one such method of convergence of privacy and AI, i.e., Multi-party computation in transformers. While implementations for MPC without the need for blockchains are available, Blockchains offer some guarantees to the framework, which aren't possible in other such implementations. This convergence of blockchains and AI helps unlock a new privacy layer that balances the technological advancement of transformers with the ethical imperatives of modern society.
The most common way of interacting with LLMs today is Chatgpt, Claude, or one of the several such offerings. The reason for their popularity? The ability to draw on the trillions of words they are trained on, answer questions, give analogies to understand difficult concepts, and even do creative things.
While the following poem and sonnet are perhaps very cringe renditions, they offer a glimpse into their abilities, the ability to have a template or even a draft of anything you want at your fingertips.
We are still only in the early days of what is possible, though very long conversations are still impossible, and every conversation is essentially started anew. There is no notion of memory or remembering you. At some point, coherence also starts taking a hit when the model can no longer remember some of the earlier parts of the conversation or take some reference from there.
Programming is one of the biggest areas that has seen some acceptance and influence with LLMs.
Copilot, introduced by GitHub, offers coding services directly in your website IDE and offers it through extensions on Visual code as well. Several other IDEs have popped up that seem to have proprietary prompts tailoring LLMs for programming-specific tasks, offering it with GPT-4, which seems to be giving better results than the former.
Research has prodded along, simultaneously posting gains and improving performance as we learn more about the capabilities of the transformers. For example, the following work improved on prompting and showed the model’s ability to correct its wrong outputs through 3 methods:
Simple Feedback: Where you ask the model if it thinks a generated solution is correct or not
Unit Test: Where the feedback from unit tests is given to the model so it can improve upon its code
Code explanation: Where the model is asked to explain the code it has just written
All these approaches show that a model with these simple prompt upgrades outperforms works prior to it.
Where does this fail, though? The inability to assimilate large bodies of code is a current drawback for these models. Some possible use cases in that scenario are possibly writing higher-level tests after unit tests for certain functions are written, or helping debug large code bases where substantial effort is spent today.
Recently, one thing that has become popular when evaluating LLMs is reporting their ability to solve exams.
A quick look at GPT-4’s launch shows the results in popular exams being shown early on in the blog to establish model improvement over previous iterations of the model on a wide variety of different exams covering Law, Medical, Code, Aptitude, College entrance exams, AP exams, etc. While the importance of these can be questioned also whether they are good signals, a good natural question to ask is, “Where are we going”?
Using the following prompt
GPT-4 was shown to be able to explain Legal concepts to a very good degree
Moreover, another work showed GPT-4’s ability to pass the Bar exam scoring 75%, which is enough to put it in the 90th percentile.
LLMs have also been shown to help with Medical challenges, performing well on various questions in the USMLE(United States Medical Licensing Examination).
These abilities were shown to establish the idea that LLMs are quickly getting good at doing things, and while they aren’t perfect, they might be able to save a lot of time. Will they be at a place where they completely replace humans? Probably not until two key questions are answered:
In most law applications, Laws and precedents change over time, and large language models are trained on specific instances of data before some cutoff period; unless models are trained to update their internal knowledge, using these models for law applications ubiquitously is extremely difficult.
In most medical applications, when trained on datasets, LLMs and neural networks at large have been shown to be prone to biases and hallucinations. Medicine is a safety-critical application where the necessity to reduce false positives is key, and no model should prescribe the wrong medicine to patients.
Not only in direct user-facing applications like Medicine and Law, LLMs have also shown promise in their ability to reason and solve maths problems.
While there’s a long way to go, and LLMs really lack in their ability to reason, lagging behind human performance by quite a margin, we could expect them to slowly start performing and getting well at these tasks as well. Terrence Tao, in his Microsoft blog, “Embracing change resetting expectations,” wrote:
With effort, human experts can modify ideas that do not work as presented into a correct and original argument. The 2023-level AI can already generate suggestive hints and promising leads to a working mathematician and participate actively in the decision-making process. When integrated with tools such as formal proof verifiers, internet search, and symbolic math packages, I expect, say, 2026-level AI, when used properly, will be a trustworthy co-author in mathematical research, and in many other fields as well.
LLMs today must be deployed most efficiently and have some ways to go before they are fine-tuned for specific downstream tasks that help humans. Regardless of model performance, the way forward seems to be human collaboration and using these models with human feedback.
While the estimates for AI performance at superhuman levels might be off by a few years or even decades, LLMs have been slowly improving on every task broadly and have gotten much better since just a few years ago. These performance improvements suggest that LLM deployment in various places in the next few years might be inevitable.
In most companies, AI models served by corporations cannot be readily used for internal purposes because model serving occurs through APIs, and there is no guarantee that data sent today will not be added to training data for tomorrow's LLMs. At the same time, the best possible use case for most companies in employing LLMs to properly use is to embed them along their codebases/documentation to help teams and increase productivity. Even at the more personal level, one possible use case of LLMs is their application in Medicine. Even if not to prescribe Medicine, they could be used to gather medical history or patient data before the doctor's meetups. This prior interaction would expose the model to sensitive patient data, which might not be desired. Writing legal documents/researching cases within a firm could also reveal their internal files to a model where data privacy is paramount. When applied in research alongside researchers, while collaborative, researchers might want to silo their data and use the model for their specific interests.
Most personal use cases of AI, like chatbots, might reveal the data/preferences of the user over time, even if unintentionally. As chatbot APIs become stateful, they will remember user conversations and about the users themselves, which might provide overall QOL (quality of life) benefits but leak user data to the model and the company.
As LLM offerings proliferate and become ubiquitous either in the background of the apps you use or the interface with which you surf the internet or get things done online, private inference and private training of these models on your data becomes a significant problem to solve. This is not only limited to individuals but to those offering these solutions to entities who want to use it in-house on their dataset as well.
Another angle to look at is as AI vendors grow and offer solutions to most entities worldwide. The entity might want to test how good the model is on the hospital's data, but this is sensitive private data or just IP, and the entity intends to keep it confidential from the AI vendor. This application is another such use case for private AI inference.
One final use case is when two different entities, in this case, hospitals, want to train a model together on private patient data; how do you offer services that keep the data private from each other but train the model efficiently?
While blockchains today are transparent ledgers of individual transactions using apps on different chains, they offer ways of coordination of decentralized networks and entities and also help in providing provable privacy guarantees for various use cases in tandem with other cryptographic methods like Multi-party computation, Zero-knowledge proofs, etc.
Secure Multi-Party Computation (MPC) is a cryptography area concerned with creating methods that allow several parties to jointly compute a function using their private inputs without revealing them to each other. It enables collaboration on sensitive data, such as calculating a shared average without exposing individual numbers, thereby maintaining privacy for all participants.
Imagine three friends who want to know the average of their salaries without revealing their salaries to each other. They could use Secure MPC to accomplish this.
Assume the friends are Alice, Bob, and Eve:
Alice takes her salary, adds a random number to it, and tells the result to Bob.
Bob adds his salary and another random number to the number he received from Alice, then tells the result to Eve.
Eve adds her salary to the number from Bob, then subtracts all the random numbers added earlier and divides the result by three to find the average.
The final number is the average salaries; no one has learned anything about the others' salaries. One nuance to pay attention to here is that although nobody knows the exact salary of each other if the average is lower than that Eve's salary, then Eve could infer that one of the other two's salaries is lesser than hers.
The above explanation is a very simplistic explanation of how it might occur, and we have yet to gloss over the technicalities of making this practical. Now that we have established some understanding of MPC, where do blockchains come in?
Blockchains, as we are familiar with them, are decentralized digital ledgers that have to ensure continuous operation (liveness) and resilience to failures or malicious behavior (fault tolerance). Liveness ensures that valid transactions are consistently added to the chain without delay. At the same time, fault tolerance allows the system to function correctly even if some network nodes fail or act dishonestly. Together, these properties create a trustworthy and robust system capable of handling various challenges without compromising the integrity of the recorded data.
Real-world applications of blockchains have slowly been growing and achieving escape velocity because they offer access to censorship-resistant, live ledgers. In the work "Credible, Optimal Auctions via Blockchains," the authors explore the auction trilemma and show that when cryptographic commitments are used with the help of a censorship-resistant ledger, auctions are credible.
In "HoneyBadgerMPC and AsynchroMix: Practical Asynchronous MPC and its Application to Anonymous Communication," the authors note that while MPC systems implementations exist in the wild, they do not offer any robustness or fairness guarantees, even though there have been other discussions and explorations of the problem none of them were practically implementations. In SMPC(Secure Multi-party computation), fault tolerance and liveness issues are just assumed and not explicitly paid attention to. Most blockchains today have not come up with ways of securing confidential data or have any confidential layers and offer zero privacy naturally. Blockchain with MPC provides a robust, secure method for the computation of private data with liveness guarantees even in the presence of possible adversarial or malicious entities. This complementary nature makes the infrastructure of blockchains synergize with the ideas that MPC offers.
Note: You can skip to the next section if you are not interested in learning the mathematical technicalities of the transformer and how it is deployed using MPC.
Now that we have recognized that MPC can be used with blockchains to provide privacy, fault tolerance, and robustness guarantees, we will now look at some methods for private inference on transformers with MPC.
While going to every specific nuance of transformers will take a blog post in itself, I’d like to, in short, go through the architecture so we understand what makes transformers difficult when used in MPC settings.
When you are first given the input “The detective investigated,” the input is first tokenized, i.e., split into subwords and then substituted as numbers from a dictionary. This dictionary is also the vocabulary. The model's vocabulary is built before the model is trained and can be thought of as similar to how humans have their own vocabulary. These are usually in the order of 250,000 words, but an adult has about a 30000 words functioning vocabulary, so models already operate at about 8-10x higher.
These are then represented as a vector, and the size of this embedding may vary according to the choices made during the architecture design. Once we have the input embedding as seen below, since transformer architectures have no notion of sequences and word order, we add positional encoding, i.e., in a very simple way, add some notion of word order so the model recognizes that “this is a ball” and “this ball is a” are different sentences and are to be treated somewhat differently.
Once the embeddings are obtained, they are sent through an attention layer, which can basically be thought of as a way to understand the language better and determine the relations between words. After which, it is sent through an MLP feed-forward neural network which lets the model learn about each word in reference to itself. While this is just one layer, in the original photo above, we see that there’s an “Nx” on each side of the architecture, which says that each of these modules is ultimately repeated N times which determines the size and parameter of the model. GPT-3 essentially has 96 layers.
One intuitive way to think of what happens at higher layers is thinking that transformers learn more abstract representations at the later layers in the model. In comparison, the earlier layers make sense of the language itself.
How are these actually calculated?
Given an input vector x and (q,k,v) are input to the attention module, in this case, we assume only a single input vector x. Still, during actual computation, these are stacked together as matrices, and then a dot product is computed, after which it is put through a softmax function.
This calculates the “attention” in an input and
this calculates the final output.
The usual way of representing this in most popular literature is
where the softmax function is given by
The softmax function is a nonlinear function that normalizes the sum of the inputs to 1.
The following formula calculates the MLP layer
These days GELU is used instead of the RELU function. The form of the function is as follows.
While we have mostly gone only over the separate modules of the architecture, something to observe is that towards the end, we have a linear layer which is then sent through the softmax, ultimately yielding probabilities over the vocabulary, and the highest probability word is chosen.
Why did we go over the transformer architecture? In general, when applied to Machine learning models, MPC has its own issues. Most Machine learning and Deep learning computation are done in floating point arithmetic, while MPC computation uses integers, and floating point emulation, even if possible, is very costly. Another reason is that most MPC applies basic arithmetic operations like addition and multiplication, and every other operation is composed or approximated with these. These simple operations make implementing complex functions costly. Finally, the issue we face with transformers is the presence of nonlinear functions like GELU and Softmax, which cannot be easily approximated.
Recent work like "MPCFormer: fast, performant and private Transformer inference with MPC" and "PUMA: Secure Inference of LLaMA-7B in Five Minutes" have offered some solutions to these problems. Both works provide approximations of these functions and address other issues that might arise because of these choices.
In MPCFormer: fast, performant, and private Transformer inference with MPC, the authors approximate the GELU function with the following, similar to how RELU is usually approximated.
And the softmax approximation is made with the following quadratic equation.
The Gelu approximation is called Quad, and the Softmax approximation is 2Quad. The experiments find that Softmax numerically diverges from the original Softmax function (basically, they aren't the same functions and have different graphs). Still, it is a good approximation that is computationally efficient.
The authors handle this specific divergence with the help of a technique called knowledge distillation, which was initially introduced in deep learning to use larger models as teachers to smaller models, so smaller models can learn the same things and have similar performance but be almost a magnitude smaller (sometimes half the size) than the larger model. In this case, this technique is used to make the model's representations similar to the original model.
The model used in this work is the BERT model, which was the first pre-trained language model and preceded the GPT papers and work. The results in the work look promising in that they achieve a 5x speedup in the inference for a dataset called IMdB used for sentiment classification and GLUE, a natural understanding benchmark used for understanding and estimating model performance on language tasks. While IMDB has no actual loss in accuracy to provide MPC, i.e., privacy guarantees, we see that in the Quad+2Quad case, there is a 5-point reduction and just a 2x speedup compared to the original Bert implementation.
The PUMA: Secure Inference of LLaMA-7B in Five Minutes work improves the approximations for GELU, Softmax compared to the MPCformer work. The MPCFormer work also used some underlying assumptions in the MPC engine that caused the model to suffer in real-world performance, like not implementing some critical functions needed for transformer model inference. These are rectified in the PUMA work, which leads to better performance.
The GELU approximation in this work improves over the previous quadratic approximation in MPCFormer by studying the graph for the functions, as we saw earlier in the article. We see that after -4, the function tends to 0; after 3, it resolves to x itself. The function between -4 and 3 is approximated with the help of Numpy.
And a better approximation is similarly found for softmax as well. In the case of BERT models, we see that the PUMA work gives almost a 2x inference boost, and the communication cost is practically reduced by 2.5x. The communication cost is the cost born by the MPC system to coordinate the computation that it does over function.
Some future work to consider is that the PUMA work also experiments with models like LLama, a recent architecture trained by Meta, which shows much promise in the future of LLMs. Still, when applied in the MPC setting, they find that it takes about 500 seconds to generate one token, whereas, in real-world conditions without privacy, we see that LLama can generate about 20 tokens every second, which is almost a 1000x slowdown. We need to use better ways and use hardware acceleration techniques and quantization as well, which can speed up inference for MPC-based transformer inference.
One natural question is, "Are blockchains only useful for private inference of transformer models?" While the primary use case for AI x blockchains is a privacy layer, there are also some other aspects.
Will MPC be helpful for private model inference en masse? Blockchains make MPC protocols possible more securely and fairly but ultimately have limitations when implementing transformer inference protocols (training has not been explored yet). Ultimately, we approximate non-linear functions to make it possible within MPC, but while this is possible in an inference setting, it’s extremely difficult to train a neural network. Without its non-linearities, a neural network is simply a flat linear function. These approximations ensure decentralized inference, not training (yet). While not an optimized state-of-the-art deployment, the paper I covered takes about 5 minutes to generate one token from a 7B parameter Llama model. In contrast, you can generate 20 tokens locally on your CPU. The privacy vs efficiency tradeoff for deep learning is skewed, and we have a long way to go.
As Large language models become more capable, current evaluation methods of understanding their abilities become difficult. Evaluation datasets last for a short time, and just a few iterations of the model make the evaluation benchmark obsolete. The best method of changing and growing evaluation benchmarks for these models is to use expert evaluation. However, coordinating a body of experts and building an unbiased evaluation protocol is complicated. Blockchains offer a way for incentive alignment and bootstrapping expert networks in various fields and also design a protocol that might help expert evaluation of LLMs and the growing AI capabilities in a scalable and transparent way.
One last application to mention is this recent work called "Tools for verifying neural models training Data," which introduces a concept called "Proof-of-Training-Data," which is any protocol that allows a model trainer to convince a Verifier of the training data that produced a set of model weights. As model training takes trillions of words, some data might be private or copyrighted, and there's a need to establish if a model used a specific trough of data. When adopted in the broader framework of zero knowledge, protocols like these offer a way of trustlessly establishing data provenance.
This blog is a soft introduction to how blockchains could act as a privacy layer for AI models and provide private inference for data-sensitive applications. In subsequent works, we could probably explore MPC protocol implementations in technical detail through blockchains, the tradeoffs, and design decisions that we need to establish, and if we can also introduce Zero-knowledge proofs with MPC to provide other guarantees that might be desired in such a system.
It offers an inside view into the world of large language models and some applications that are slowly making their way into the hands of millions of people. But with this widespread access, issues like data privacy become evident. We then discuss the various privacy pitfalls of LLMs. MPC with blockchains offers a way to do computation while keeping the data private with robustness and fairness guarantees. We finally look at recent works that apply MPC to transformer inference and discuss future work.
I will cover ZK-ML and FHE-based deep learning protocols in future articles. Thanks for reading!