Using Transfer Learning for NLP with Small Data

Published in

Insight

7 min readMay 7, 2019

Text classification has numerous applications, from tweet sentiment, product reviews, toxic comments, and more. It’s a popular project topic among Insight Fellows, however a lot of time is spent collecting labeled datasets, cleaning data, and deciding which classification method to use. Services like Clarifai, and Google AutoML have made it very easy to create image classification models with less labeled data, but it’s not as easy to create such models for text classification.

Transfer learning has simplified image classification tasks. This project applies the same techniques to text.

For image classification tasks, transfer learning has proven to be very effective in providing good accuracy with fewer labeled datasets. Transfer learning is a technique that enables the transfer of knowledge learned from one dataset to another. I wanted to make transfer learning easy to use for text classification. Through this project, I was able to achieve 83% classification accuracy on the IMDB movie reviews dataset with only 500 labeled samples, as opposed to fastText, which requires 22,500 labeled samples to achieve similar accuracy. To learn more about how I achieved this, read on!

Current Methods

There are various methods available to create text classifiers using libraries like NLTK, Spacy, and techniques like BOW (bag of words) or word embeddings. Below, I compare three methods — fastText, word embedding, and language models — in terms of training time, ease of use, and performance with less labeled data. For my project, I focused on improving the ease of use of language models and achieving high accuracy with small datasets.

fastText — The fastText library from Facebook has very easy to use scripts to create a text classification model, and it’s also very fast to train. However, it’s accuracy is low with small datasets.

Word embedding — There are lot of examples of people using Glove or Word2Vec embedding for their dataset, then using a LSTM (Long short-term memory) network to create a text classifier. However, one can often run into issues, like out-of-vocabulary (OOV) words, and this approach is not as accurate with less labeled data.

Language models — Language models like BERT (Bidirectional Encoder Representations from Transformers), ULMFiT ( Universal Language Model Fine-tuning), GPT and GPT-2 have shown that information learned from one dataset can be transferred to other datasets for specific tasks.

Transfer Learning Platform

My goal was to create an easy to use API (application programming interface) for creating text classification models with less labeled data.

Model

For this project, I used the BERT language model released by Google. At the time of its release, BERT had state-of-the-art results on various natural language processing (NLP) tasks on the GLUE benchmark. I used the BERT-base uncased model weights because the BERT-large model weights are too big for a GPU and currently require a TPU (Tensor processing unit). Based on the example provided in the BERT github repository, a binary classifier is created for any dataset using the train API. Here is another great blog post on BERT by a former Insight Fellow. BERT uses a deep bi-directional model using transformers. More details about the BERT model can be found in the official github repo and the Arxiv paper.

API

This platform provides easy to use APIs to the user for creating a text classification model, and then doing batch inference using the trained model. Flask micro web framework was used to create the APIs. Flask is an easy to use web framework written in Python and very popular for building simple web applications and APIs.

Train — The train API takes an S3 bucket name as input and then downloads the training data in the S3 bucket to the local server on which training is done. The training job is run in the background using Celery. Celery is a Python-based framework used to run multiple tasks in parallel in the background and can be thought of as a job scheduler.

Batch Inference — This API also takes an S3 bucket name as input and then performs inference on all inputs in the test file.

Example of Training API call using Postman

Architecture

Four Docker containers were used to run four different services: Rest API, Redis message broker, Celery task manager, and an inference server for text classification. The user interacts with the platform using the APIs which start the training or batch inference in the background using Celery. Docker Compose was used to start and configure the four different containers. To support GPU training, Nvidia Docker 2 was used. The platform was tested on an Amazon p2.xlarge instance with Nvidia K80 GPU with 12 GB memory.

Results

I used the IMDB reviews dataset to evaluate the performance of this platform compared to fastText. The dataset contains 25,000 reviews for training and 25,000 for testing. The reviews are labeled as positive or negative. The figure below shows that, by using the transfer learning platform, classification accuracy of 83% can be achieved with only 500 samples.

Initially, I started with 22,500 labeled samples and used that to create a classifier using fastText and this platform. As shown below, the transfer learning model provided a 6% improvement in accuracy. I continued decreasing the number of labeled samples, trained different models, and evaluated their performance on a large unlabeled dataset to get the results shown below. At only 5,000 labeled samples, the transfer learning model provided an over 34% improvement in accuracy over fastText, and maintained an 86% accuracy rate. By having an easy to use API, users can easily train models with different amounts of data and quickly iterate on different experiments. The github repo of the project can be found here.

Challenges

Celery and GPU Memory — I encountered an issue where the GPU memory was not getting released after training was done. This happens because TensorFlow only releases the GPU memory after the process is finished and, since I was starting the process from Celery, the GPU memory was not getting released. To solve this, I limited the max tasks per child, and then the memory was released after the training was completed (more details here).

TensorFlow and Cuda Versions — One of the biggest challenges I faced was getting TensorFlow, Docker, and GPU to work together. To use GPU, you need TensorFlow-GPU and NVidia Cuda library. Different versions of TensorFlow require specific versions of Cuda (TensorFlow 1.12 needs Cuda 9). Additionally, to access the GPU from inside the Docker container, Nvidia Docker run-time is required. I also found that only specific versions of Docker Compose (in my case, I used v2.3) support the Nvidia runtime.

Future Improvements

For this project, I used the BERT-base model, but in the future I would like to use the BERT-large model which will hopefully perform even better. I would also like to try some of the training speed improvements shown by Nvidia by using mixed-precision training and see how it affects accuracy. This project can also be integrated with Kubeflow to make is easy to scale and deploy.

Want to learn applied Artificial Intelligence from top professionals in Silicon Valley, New York, or Toronto? Learn more about the Artificial Intelligence program at Insight or apply today!

Are you a company working in AI and would like to get involved in the Insight AI Fellows Program? Feel free to get in touch.

Acknowledgments

This work was done as the project for the Insight AI fellow program over three weeks. I want to thank my program directors Matt Rubashkin and Amber Roberts for their helpful guidance, and my technical advisor Emmanuel Ameisen for feedback and suggestions for the project. Special thanks to Matt Rubashkin and Holly Szafarek for their help with this blog post.

Manu Suryavansh has nine years of data engineering and systems engineering experience at Cypress semiconductor and Texas Instruments. As an Insight Fellow, he built Sherlock-NLP, an easy to use platform+API for training new models using the BERT pre-trained language model, enabling the creation of custom classification models with scarce labeled data. He is passionate about building production machine learning systems to solving challenging real-world problems. He is actively looking for ML/AI engineer positions in the tech industry and can be reached here.