The Power Of Transfer Learning

Introduction

If you are familiar with machine learning, particularly deep learning, you might have heard of the term transfer learning. What is transfer learning? In this blog, we are going to discuss about

  1. Need for transfer learning
  2. What is transfer learning
  3. How to use transfer learning
  4. Benefits and limitations of transfer learning

Need for transfer learning

As the word suggests, transfer learning is a process of transferring the learning or knowledge acquired from a particular task. How is this applicable in deep learning?

In machine learning, for performing every task, we need to train a model right? For training the model, a machine learning engineer will require

  1. The data for training and testing
  1. A model that can perform the task 

For the model to perform the task, we train the model with our training dataset, evaluate the model with the cross validation dataset and finally test the performance using the test data. Mainly there are two types of machine learning algorithms 

  1. Classical machine learning algorithms like linear regression, decision trees, ensemble models etc.
  2. Deep learning algorithms which is built using neural networks

Classical machine learning models can work well with a few thousands of training data, because the models are not very complex and contain only a few tunable parameters. But if you consider a deep neural network, that is not the case. Consider a neural network as shown below

We can see there are a lot of weights that are required to be optimized. Since the deep neural network is often very deep, complex and requires optimization of a large number of weights, the training of such a network requires a very large amount of data. In some cases, the training requires even millions of data points. There are mainly two disadvantages for this

  1. The availability of data points- with the help of various search engines, nowadays it is possible to obtain data. But just imagine the case where the millions of data points are required and the manual labour required for inspecting and cleaning these data. Oftentimes, it is difficult to obtain such a massive quantity of data.
  2. Time – Training of a deep learning model using millions of data is very time consuming. 
  3. Resources – Training of deep learning models often requires powerful resources and requires huge amounts of GPUs

Transfer learning can be considered as a solution to all the above problems. You can develop a very good deep learning model for your task, using very less data, time and resources.

What is transfer learning

We saw why we need transfer learning. But what exactly is transfer learning? How does it work? Let’s see.

Transfer learning is an approach where the knowledge is transformed from one model to another. Consider an example of a simple image classification problem where we need to classify the images into cats and dogs. For a task involving image data, our primary consideration will be a convolutional neural network (CNN) model, right? But as discussed above, these models require a lot of training data and resources to produce good results. 

Now suppose we have another model, Model A which is already trained on a huge amount of image data, and can classify many categories of images. The weights of Model A are already optimized so that it is able to find patterns and shapes of any given image and classify the same. What if we can utilize a part of this model for our task – i.e, to classify images of cats and dogs? 

This process where a model trained on one task is reused in another task is known as transfer learning. Now in our above example, the Model A is called a pre-trained model, because it is already trained on a huge amount of data and it can perform the task with less mistakes or errors. Now let us see how to use this method.

How to use transfer learning

  1. Selection of the source model

The first step to perform transfer learning is to select a pre-trained model. Many organizations and research institutions release models which are trained on large and challenging datasets. The model or the model weights can be directly downloaded from their source. But it is also important to decide on which model is to be used. For example, for image classification tasks, there are various models available which are already trained in large datasets like imagenet, CIFAR 10, etc. Similarly if the task is related to text classification, models like BERT, Word2Vec, etc. can be used.

2. Re-using the model

The next step is to use the pretrained model for your task. The selection of this pretrained model is based on domain expertise, or experience. Once a suitable source model is selected, there are multiple ways by which one can use this model on their tasks.

2.a) Direct use of pretrained model

You can directly use a pretrained model as it is, if your data and the training data of the pretrained model is matched. Consider the above image classification example. Suppose there is a pretrained model which is already trained on the same task of classifying. Then you can directly use that model in your data. Consider another example. If your task is to obtain embeddings of sentences, then you can directly use a language model like BERT. BERT is trained on  the whole of the English Wikipedia and Brown Corpus and is fine-tuned on downstream natural language processing tasks like question and answering sentence pairs. So if your data is also similar, then you can directly use the BERT model.

2.b) Using the bottle neck features from the pretrained model

Instead of using the pretrained model as it is, we can use the model for extracting the features. For example, consider the same image classification problem. There are various CNN models which are pre trained on large amounts of data, for performing a similar classification task. These models are pretrained on datasets like ImageNet, CIFAR10, etc. and are trained to classify various categories like humans, aero planes, fish, birds and many more. We can always use the trained layers of these models by removing the final SoftMax or classification layer. This approach helps us to utilize all the trained weights, which will generate meaningful features from an image. This will also help us to specify the dimensions of the last classification layer, according to our task (in the above example, 2). Also, if the pre-trained network has 7 x 7 x 512 output from the layer prior to the last fully-connected. We can flatten this layer, which results in a new N x 21,055 network output (N — number of data points). On top of this layer we can also add a simple classification model of our choice.

2.c) Fine tuning the last few layers of the pretrained model

In the previous approach we have trained the final classification layer only. We took all the layers up to the final layer, and used it for extracting the features of a given image. Or we can say that we freeze all but the last classification layer of the pretrained model, and train the final classifier alone. There is a third option, where you  freeze only some earlier layers of the pretrained model and train the last few layers. This is known as fine tuning. 

We always modify the later layers since it has been observed that the earlier layers in a network capture more generic features while later ones are very dataset-specific.

For example, we can use a model that is trained to recognize cats, can be fine tuned and used to recognize dogs. We can always select the number of layers to train. That is considered as a hyperparameter. The more data you have, the more layers we can fine tune.

Benefits and limitations of transfer learning

Although transfer learning may sound simple, it is a very powerful tool. Transfer learning will help the engineers to perform deep learning tasks with very less data and resources. Using a suitable pre-trained model will help to produce more accurate results and it speeds up the training process. The below figure shows how transfer learning can help to produce more accurate results

You can see that the performance of the model with transfer learning is higher and it saturates quickly, as compared to the other model. 

Like any technology, transfer learning also has its own challenges and limitations. One of the biggest limitations of transfer learning is the problem of negative transfer. 

Transfer learning only works if the initial and target problems are similar enough for the first round of training to be relevant. Developers can draw reasonable conclusions about what type of training counts as “similar enough” to the target, but the algorithm doesn’t have to agree.

Conclusions

In this blog I tried to cover the most powerful technique called transfer learning. I tried to cover the basic concepts like what is transfer learning, how to use this method for developing powerful models, and the benefits and limitations of the same

For any queries or suggestions, please feel to contact me through LinkedInYou can also check out my projects on Github