Member-only story

Multi-label Text Classification with BERT using Pytorch

Joe Khaung
6 min readMar 31, 2021

--

Photo by Benjamin Ashton on Unsplash

Introduction

Natural Language Process (NLP) is one of the most trending AI to process unstructured text to meaningful knowledge for business cases. NLP solves business problems such as classification, topic modelling, text generation, question and answering, and recommendation, etc. While TF/IDF vectorization, or other advanced word embedding such as GLOVE and Word2Vec have shown a good performance on such NLP business problems, those models have limitation which a word is encoded with one vector regardless of different meaning depending on context. Thus, those models may not perform well when trying to solve problems required to understand user’s intents. One example is that when user interacts with automated chatbot that tries to understand user query’s intent and provide the response accurately.

Another example in NLP for such cases is decoding contextual meaning from below two sentences.

1. A thieve robbed a bank.

2. He went to river bank.

Human can easily identify that “bank” has two different meanings from above two statements; however, machine couldn’t differentiate because above mentioned word embeddings use same token for “bank” irrespective of their contextual meaning. To overcome this challenge, Google have developed state-of-the-art Bidirectional Encoder Representations from Transformers (BERT) model.

What is BERT?

BERT is a pre-training model trained on Books Corpus with 800M words and English Wikipedia with 2,500M words. In BERT, “bank” will have two different tokens for their contextual differences. This does not slow down on training time on model building while maintaining high performance on NLP tasks. You can extract new language features from BERT to be used in model’s prediction. And it provides much quicker development compared to other deep learning models such as RNN, LSTM and CNN, etc. Refer to this great article on how BERT works in details. As high level understanding, BERT has two different variants of architecture: BERT base and BERT large. First variant has 12 transformer blocks with 12 attention heads and 110 millions parameter and later variant has 24 transformers, 16 attention heads and 340 million parameters. It does two tasks…

--

--

Joe Khaung
Joe Khaung

Written by Joe Khaung

I code to live. I write to relax.

Responses (3)

Write a response