We have the answers to your questions! - Don't miss our next open house about the data universe!

Instruction Tuning: What is fine-tuning?

- Reading Time: 6 minutes
Instruction Tuning: What is fine-tuning?

Instruction tuning is an innovative method of fine-tuning Large Language Models by adding specific instructions to example data. Find out why this approach has the potential to revolutionize AI!

Over the past few years, Machine Learning and Natural Language Processing (NLP) have evolved considerably. In particular, the way in which models are trained has changed.

With the advent of pre-trained models such as BERT or GPT, fine-tuning pre-trained models for downstream tasks became the new norm.

Subsequently, the increasing capabilities of ever-larger language models enabled learning in context via prompting. And more recently, a new method has emerged for making LLMs useful in practice: instruction tuning.

By combining example data with instructions, this innovative approach makes language models much more versatile. Before exploring this technique in more detail, let’s start by looking back at the concept of fine-tuning.

What is fine-tuning?

Pre-trained language models offer tremendous potential, but are not natively expert in a specific domain.

In order to be specialized for tasks such as sentiment analysis, language translation or answering questions on specific topics, they need to be fine-tuned through a method called “fine-tuning”.

This process puts the finishing touches to a model, making it more specialized. It usually involves training the model on a smaller, task-specific dataset.

The dataset is labeled with examples relevant to the targeted task. By exposing the model to these examples, it becomes able to adjust its internal parameters and representations.

The knowledge acquired during pre-training is exploited, saving time and resources. Once refined in this way, the language model performs better on the tasks for which it has been tuned.

The challenge, however, is to generalize the expertise acquired during training to other tasks. This is where instruction tuning comes in.

Instruction tuning vs fine-tuning: what are the differences?

The main difference between the tuning instruction and standard supervised fine-tuning lies in the data on which the model is trained.

Whereas supervised fine-tuning consists in training models on example inputs and the results derived from them (output), instruction tuning fleshes out input-output examples with another component: instructions.

This is precisely what enables instruction-tuned models to generalize more easily to new tasks. As a result, LLMs tuned in this way become much more versatile and useful.

With this method, the ability of LLMs to provide appropriate results based on instruction inputs is enhanced in the same way as cross-task generalization. Performance is thus enhanced on novel tasks.

In addition, sample efficiency is enhanced, since the volume of training data required to match the performance of the best supervised models is minimal.

However, this approach requires the construction of tuning instruction datasets. Fortunately, there are several excellent datasets available, and we’ll now take a look at some of the most popular!

The best tuning instruction datasets

There are two main categories of instruction tuning datasets.

  1. In the first case, instructions are added to existing NLP tasks.
  2. In the second, the data is used to condition a model to generate new “tuples” (ordered sequences) of input-output instructions.

Natural Instructions (Swaroop Mishra, 2022)

This dataset gathers 193,000 examples of instruction outputs sourced from 61 existing NLP tasks in English. The crowd-sourced instructions in each dataset are aligned with a common schema.

These instructions are more structured than on other datasets. However, the outputs are relatively short, making the data less useful for generating long-form content.

Natural Instructions v2 / Super-Natural Instructions (Yizhong Wang, 2022)

This crowd-sourced collection of instruction data is based on NLP tasks and simple synthetic tasks. It contains 5 million examples from 76 tasks in 55 languages.

Compared with the first version of the Natural Instructions dataset, the instructions have been greatly simplified. They consist of a task definition accompanied by positive and negative examples and explanations.

Unnatural Instructions (Or Honovinch, 2023)

This automatically collected dataset contains 240,000 examples obtained by prompting InstructGPT (text-davinci-002) with three examples of Super-Natural Instructions.

These are an instruction, an input, and possible output constraints. For each trio, the InstructGPT model is instructed to generate a new example.

The output is generated separately from the condition on the instruction, input and constraints generated. Subsequently, the generated instructions are paraphrased again by prompting the model.

Compared with Super-Natural Instructions, this new Unnatural Instructions version covers a much wider range of tasks. While many of the examples reflect classic NLP tasks, other interesting examples are also included.

P3 : Public Pool of Prompts (Victor Sanh, 2022)

This collection of prompts is crowd-sourced from 177 English NLP tasks. For each dataset, around 11 different prompts are available on average.

This makes it possible to study the impact of different prompt formulations. Compared with the instructions in the datasets mentioned above, P3 prompts are often shorter and less elaborate.

Flan 2021 / Muffin (Jason Wei, 2022)

A set of prompts taken from 62 datasets of texts written in English, with 10 prompts templates for each task: this is what Flan 2021 offers.

And for classification tasks, an OPTIONS suffix is appended to the input to indicate output constraints. However, version 2022 is much more comprehensive.

Flan 2022 (Hyung Won Chung, 2022)

This dataset is a combination of Flan 2021, P3, Super-Natural Instructions, and other reasoning, dialog and program synthesis datasets.

The nine additional reasoning datasets are annotated with a chain-of-thoughts (CoT). This makes it one of the most comprehensive instruction-tuning datasets to date.

A new generation of datasets closer to the real world

Aside from the first-generation datasets mentioned earlier, mainly based on existing NLP tasks, a new wave of datasets has emerged to get closer to real-world use cases. Here are just a few examples.

The Alpaca Data dataset launched by Rohan Taori and his associates in March 2023 contains 52,000 examples of English instructions. It is generated using OpenAI’s text-davinci-003 with self-instruct. Its creators have made modifications to simplify the data generation pipeline and reduce costs to under $500!

With Evol-instruct launched in April 2023, Can Xu and colleagues rewrote 250,000 instruction and response pairs based on Alpaca Data. Instructions were rewritten to make them more complex or to create new, more specialized instructions using ChatGPT.

In a second step, ChatGPT was used to generate the corresponding responses. Low-quality pairs of instructions and responses are filtered out using heuristics. The process is repeated three times.

Another example is Vicuna ShareGPT, dated March 2023. It brings together over 70,000 English conversations shared by users and scraped from the sharegpt.com website. Pre-processing involved converting HTML to markdown, filtering out low-quality samples, and splitting long conversations into shorter segments.

Compared to the other datasets mentioned above, ShareGPT conversations are made up of multiple replicas and are therefore more useful for training a model to build on the context of a discussion.

Another example of a multi-replica dataset is Baize Data, launched in April 2023. It contains examples of 54k and 57k English dialogues, with an average of 3.4 replicas generated with GPT using questions from Quora and StackOverflow.

In addition, 47k dialogs on the medical field were generated using questions from the MedQuAD dataset. This makes it very useful for this domain.

The databricks-dolly-15k dataset from April 2023 contains instructions and 15k examples written by Databricks employees. The instructions and responses are generated by humans, which contrasts with the use of ChatGPT in the other datasets mentioned.

The examples cover 7 different use cases, such as open or closed Q&A, Wikipedia data extraction and synthesis, brainstorming, classification and creative writing.

While most datasets focus on the English language, OpenAssistant Conversations offers conversations in several languages generated by human annotators. Over 30% are in Spanish or other languages.

Finally, LIMA data launched in May 2023 offers question-answer pairs from StackExchange, wikiHow and the Pushshift Reddit dataset. Training on this small, carefully selected dataset proves more effective than training on a much larger dataset such as Alpaca Data.

Key features of instruction data

In a study published in early 2023, Shayne Longpre and colleagues highlight several important aspects of training data.

Firstly, training with few-shot prompts mixed with zero-shot prompts massively improves performance in both configurations.

In addition, large language models benefit from the ever-increasing number and diversity of tasks. Another beneficial approach is data augmentation, notably through the inversion of inputs and outputs.

This may involve, for example, transforming a question-answering task into a question-generating task. Similarly, when using a combination of several instruction tuning datasets, it is important to adjust the weights appropriately.

Conclusion: instruction tuning, the key to making LLMs more versatile and generalist?

Thanks to the instructions added to the datasets, the tuning instruction helps generalize the knowledge acquired by LLMs to new tasks. This could be the key to the emergence of general-purpose AI (GPA), considered to be the ultimate goal of artificial intelligence…

According to a study by researcher Khai Loong Aw and colleagues, instruction tuning brings LLMs closer to the way the human brain processes language. Compared with fine-tuning, alignment with the brain increases by 6%.

To master this innovative fine-tuning method, turn to DataScientest! We offer several online training courses to help you become an expert in artificial intelligence.

With the Machine Learning Engineer course, you can learn to design, develop and deploy artificial intelligence solutions. This program combines skills in data science and machine learning.

You’ll learn about Python programming, DataViz tools, Data Engineering, DataOps techniques, as well as Machine Learning and Deep Learning.

At the end of this hands-on course, you’ll be awarded the “Project Manager in Artificial Intelligence” certification issued by the Collège de Paris, and receive a certificate from Mines ParisTech PSL Executive Education.

What’s more, you’ll be able to take the Amazon Certified Cloud Practitioner certification exam, attesting to your mastery of the AWS cloud. In other words, this course gives you triple recognition!

The MLOps course teaches you how to put AI models into production and deploy them automatically. It covers programming on Linux, CI/CD, containerization with Docker and Kubernetes, and monitoring with Prometheus and Grafana.

Finally, our Prompt Engineering and Generative AI course will make you a master in the art of formulating prompts for ChatGPT, Canva and DALL-E.

In just two days, you’ll be able to harness generative AI to produce content that perfectly matches your expectations.

All our training courses are eligible for financing options, and can be completed entirely remotely on a full-time, part-time or intensive BootCamp basis. Find out more about DataScientest and its AI training courses!

You are not available?

Leave us your e-mail, so that we can send you your new articles when they are published!
icon newsletter

DataNews

Get monthly insider insights from experts directly in your mailbox