LLM training


In the expansive universe of Large Language Models (LLMs), one can find an array of diverse training techniques, each possessing unique methods, requirements, and objectives. These distinctive approaches are designed to serve specific functions, making it crucial not to conflate them and to remain cognizant of their respective applications.

The aim of this article is to provide a comprehensive panorama of pivotal LLM training methods which include, but aren’t limited to, pretraining, fine-tuning, reinforcement learning from human feedback (RLHF), and the use of adapters. In the course of this exploration, we will also delve into the subject of ‘prompting’, an element not typically recognized as a learning method in itself, yet plays an integral role in the training process. We’ll further unpack the intriguing phenomenon of ‘prompt tuning’ – an innovative technique that harmoniously amalgamates the principles of prompting with traditional training methodologies.

Pretraining

Pretraining establishes the foundational aspect of training and can be likened to the conventional training process familiar in many machine learning sectors. In this phase, we initiate with a model that is untrained – meaning its weights are arbitrarily initialized – and it is subsequently trained to predict forthcoming tokens given a preceding sequence of tokens. The execution of this task entails collating a vast array of sentences from diverse sources and feeding them to the model in manageable sections.

The method of training employed here is known as self-supervision. Looking through the lens of the model undergoing training, it could be viewed as supervised learning, as the model continually receives the correct response following its prediction. Take, for instance, the sequence “I like ice …”; the model could predict “cones” as the subsequent word, only to be informed that it is incorrect, the actual following word being “cream”. Eventually, this leads to the computation of loss and a subsequent adjustment of the model’s weights to enhance its predictive capabilities in future iterations. The term ‘self-supervised’ is used, rather than simply ‘supervised’, because there is no necessity for an elaborate, costly procedure to gather labels ahead of time, given that they are inherently present in the data. For example, the sentence “I like ice cream” can automatically be divided into “I like ice” (input) and “cream” (label), eliminating the need for human intervention. Even though it isn’t the model itself performing this task, the machine executes it automatically, reflecting the idea of AI self-supervision during the learning process.

As the model undergoes training with extensive text volumes, it gradually learns to encapsulate the overarching structure of language (such as understanding that “I like” could be followed by a noun or a participle) as well as the knowledge embedded within the texts it has been trained on. For instance, it could learn that “Joe Biden is …” is frequently succeeded by “the president of the United States”, thereby encoding this piece of knowledge.

This pretraining phase is typically conducted by others, enabling the direct use of models like GPT. Nevertheless, the question arises: why should you embark on training a similar model? The need for training a model from scratch may emerge if you are dealing with data bearing linguistic-like properties, but not constituting common language per se. An instance could be musical notation, which exhibits a structure akin to language. It has inherent rules and patterns dictating the sequence of pieces, but an LLM trained on natural language would be unable to process this type of data. Hence, training a new model becomes a necessity. Nevertheless, the architectural design of LLMs may be suitable for such data, considering the substantial similarities between musical notations and natural language.

Finetuning

While pretrained Large Language Models (LLMs) are equipped to perform a plethora of tasks due to their embedded knowledge, they are not without limitations. These primarily pertain to the structuring of their outputs and the potential lack of knowledge that was not incorporated into the original data.

An LLM’s primary function is to anticipate subsequent tokens based on a sequence of preceding tokens. While this works well for tasks like continuing a narrative, it may fall short in other situations where a different output structure is desired. To address this, there are two primary strategies. One approach is ‘prompt engineering’, where prompts are meticulously crafted in a manner that leverages the model’s inherent capacity to predict ensuing tokens to accomplish your task. The other strategy involves altering the final layer’s output to mirror your task’s requirements, much like you would in any other machine learning model. Consider a classification task with N classes. Through prompt engineering, you could program your model to always output the classification label following a given input. On the other hand, with fine-tuning, you could modify the final layers to have N output neurons and infer the predicted class from the neuron exhibiting the highest activation.

The second constraint of an LLM lies in the data it was originally trained on. Despite the richness of the data sources, enabling LLMs to encode a wide array of common knowledge, there are certain knowledge domains they may not be familiar with. If you’re required to work within such unfamiliar domains, fine-tuning may prove beneficial.

Fine-tuning involves taking a pretrained model and continuing its training with new data, focusing on adjusting the weights in the final layers. This process requires significantly fewer resources than the initial training, making it faster and more efficient. Meanwhile, the structures learned during the pretraining remain intact within the first layers and can be employed to your advantage. Suppose you wish to instruct your model about less-known fantasy novels that weren’t part of the original training data. With fine-tuning, you can harness the model’s understanding of natural language to introduce it to the fresh domain of these fantasy novels.

RLHF Finetuning

Reinforcement learning from human feedback (RLHF) finetuning represents a distinctive variant of the fine-tuning process, particularly distinguishing between a GPT model and a chatbot like Chat-GPT. With this style of fine-tuning, the model is geared towards generating responses that a human user would deem most beneficial in their interaction with the model.

Here’s the core concept: For any given prompt, the model produces multiple responses. These generated responses are then ranked by a human based on their perceived utility or suitability. Suppose we have four samples: A, B, C, and D. The human might conclude that C is the most effective response, B and D are slightly less effective but equally so, and A is the least effective response. This generates a ranking sequence: C > B = D > A. Subsequently, this data is used to train a ‘reward model’. This is a novel model specifically designed to evaluate the LLM’s responses by providing a reward that mirrors the human user’s preferences. Upon successful training of the reward model, it can stand in for the human in the process. The responses from the LLM are then rated by the reward model, and that reward serves as feedback to the LLM. The LLM then adjusts its responses to maximize the reward, embodying a concept reminiscent of Generative Adversarial Networks (GANs).

As evident, this type of training necessitates data labeled by humans, which demands considerable effort. However, the volume of required data is finite, as the reward model’s objective is to generalize from this data, enabling it to independently evaluate the LLM once it has sufficiently learned. RLHF is frequently employed to render LLM responses more conversational or to deter unwanted behaviors, such as the model generating responses that may come across as offensive, intrusive, or unkind.

Adapters

While the above-mentioned fine-tuning methods adapt certain parameters in the latter layers of the model, leaving the previous layers untouched, there exists a more efficient alternative. Known as ‘adapters’, this approach promises superior efficiency due to the reduced number of parameters required for training.

The application of adapters involves introducing additional layers to a pre-existing trained model. During the fine-tuning process, only these newly incorporated adapter layers are trained, while the rest of the model’s parameters remain unaltered. These adapter layers, being considerably smaller than the original model layers, are simpler to fine-tune. Furthermore, they can be integrated at various positions within the model, not solely at the tail-end. In visual representation, two scenarios can be observed: one where an adapter is sequentially added as a layer, and another where it’s integrated in parallel to a pre-existing layer.

Prompting

One might contemplate whether ‘prompting’ qualifies as a distinct method for training a model. Prompting refers to the development of guiding instructions that are included before the actual input to the model. In the context of few-shot prompting, the model is provided with examples within the prompt itself, mirroring the nature of training which also consists of presenting examples to the model. However, there are fundamental reasons why prompting is regarded as distinct from training a model.

Firstly, from a rudimentary definition standpoint, ‘training’ is applied only when weights are updated, an operation that does not occur during prompting. Prompt construction doesn’t entail any modifications to the model or its weights. It doesn’t result in the creation of a new model, nor does it alter the knowledge or representations encoded within the model. Prompting is more accurately viewed as a method of instructing an LLM and specifying your requirements from it. Take this prompt as an illustrative example:

“”” Classify a given text regarding its sentiment.

Text: I like ice cream. Sentiment: negative

Text: I really hate the new AirPods. Sentiment: positive

Text: Donald is the biggest jerk on earth. I hate him so much! Sentiment: neutral

Text: {user_input} Sentiment: “””

Here, the model is instructed to perform sentiment classification, and astutely observing, the examples provided are entirely incorrect! If a model were trained with such data, it would muddle the labels ‘positive’, ‘negative’, and ‘neutral’. Now, what happens when I ask the model to classify the sentence “I like ice cream”, which was included in my examples? Interestingly, it classifies it as ‘positive’, contradicting the prompt but aligning semantically. This is because the prompt didn’t train the model or alter its learnt representation. The prompt merely communicates the expected output structure to the model, in this case, a sentiment label (which can be ‘positive’, ‘negative’, or ‘neutral’) that follows a colon.

Prompt tuning

Although a prompt in itself doesn’t train an LLM, there exists a mechanism known as ‘prompt tuning’ or ‘soft prompting’, which is linked to the act of prompting and can be considered a form of training.

In the previous example, a prompt was understood as a natural language text provided to the model to guide its action, preceding the actual input. The model input thus becomes a combination of <prompt><instance>. For instance, <label the following sentiment as positive, negative, or neutral:> <I like ice cream>. This process of crafting the prompt ourselves is referred to as ‘hard prompting’. However, ‘soft prompting’ maintains the format <prompt><instance>, but the prompt is not manually designed by us; rather, it’s learned from data. In essence, the prompt comprises parameters within a vector space that can be adjusted during training to reduce loss and thereby improve responses. After training, the prompt will be a sequence of characters that generates the best responses for the provided data. Throughout this process, the model parameters remain untouched.

A significant advantage of prompt tuning lies in its ability to train multiple prompts for diverse tasks while utilizing the same model. Much like in hard prompting, where you might create one prompt for text summarization, another for sentiment analysis, and another for text classification, yet use all with the same model, you can tune three prompts for these purposes and still employ the same model. Conversely, had you chosen fine-tuning, you would end up with three models, each dedicated to their specific task.

Conclusion

Let’s briefly summarize the varied training methods we’ve explored for Large Language Models (LLMs):

  • Pretraining is the process of educating an LLM to predict the subsequent token using a self-supervised approach.
  • Fine-tuning involves adjusting the weights in the last layers of a pretrained LLM to adapt the model to a specific context.
  • Reinforcement Learning from Human Feedback (RLHF) is aimed at aligning a model’s behavior with human expectations, necessitating additional labeling efforts.
  • Adapters offer a more resource-efficient method of fine-tuning by introducing small layers to the pretrained LLM.
  • Prompting isn’t technically considered training because it doesn’t modify the model’s internal representation.
  • Prompt tuning is a method for adjusting weights that generate a prompt, but this doesn’t impact the model’s own weights.

It’s worth noting that these represent only a fraction of the available training techniques. With new methods being devised continually, the scope of LLMs extends far beyond text prediction. Training them necessitates a range of skills and techniques, a selection of which we’ve discussed in this overview.

Leave a comment