The Beginner’s Guide to BERT: Google’s Robust NLP Algorithm

Imagine this…

There you are, happily working away on a seriously cool data science project designed to recognize regional dialects, for instance. You’ve been plugging away, working on some advanced methods, making progress.

Then suddenly, almost out of nowhere comes along a brand new framework that’s going to revolutionize your field and really improve your model.

This is the reality of working in AI these days.

The world of AI progresses rapidly. 

In fact, the global AI market is expected to reach $190 billion by 2025 according to market research.

Recent years have seen AI begin to play a greater role in our everyday lives, mostly behind the scenes. One visible area of AI that has benefited from progress in the field of Deep Learning is NLP (Natural Language Processing). 

  • NLP is a field within Deep Learning
  • Deep Learning is a subset of Machine Learning.
  • Machine Learning is a branch of AI.

An example of NLP at work is predictive typing, which suggests phrases based on language patterns that have been learned by the AI. Users of Google’s Gmail will be familiar with this feature.

how does nlp work example

On the subject of Google, their research department Google Brain has recently developed a game-changing deep learning NLP algorithm called BERT. More on that later on.

This guide is an in-depth exploration of NLP, Deep Learning Algorithms and BERT for beginners. First, we’ll cover what is meant by NLP, the practical applications of it, and recent developments. We’ll then explore the revolutionary language model BERT, how it has developed, and finally, what the future holds for NLP and Deep Learning.


  1. What is NLP?
  2. The Challenging Aspects of NLP for Deep Learning
  3. Recent NLP Developments
  4. What is BERT?
  5. What is Transformer?
  6. 2019 – The Year of BERT
  7. What’s Next? Post-BERT.

What is NLP and How Does it Work?

NLP stands for Natural Language Processing, and the clue is in the title. 

“Natural language” refers to the kind of typical conversational or informal language that we use every day, verbally or written. Natural language conveys a lot of information, not just in the words we use, but also the tone, context, chosen topic and phrasing. 

We use our innate human intelligence to process the information being communicated, and we can infer meaning from it and often even predict what people are saying, or trying to say, before they’ve said it. 

NLP began in the 1950’s by using a rule-based or heuristic approach, that set out a system of grammatical and language rules. This was a limited approach as it didn’t allow for any nuance of language, such as the evolution of new words and phrases or the use of informal phrasing and words. 

Everything changed in the 1980’s, when a statistical approach was developed for NLP. The aim of the statistical approach is to mimic human-like processing of natural language. This is achieved by analyzing large chunks of conversational data and applying machine learning to create flexible language models. That’s how machine learning natural language processing was introduced. 

Another breakthrough for NLP happened in 2006, when it was shown that a multi-layered neural network could be pre-trained a layer at a time. This was a game-changer that opened the door to NLP deep learning algorithms. 

Over the past decade, the development of deep learning algorithms has enabled NLP systems to organize and analyze large amounts of unstructured data such as conversational snippets, internet posts, tweets, etc., and apply a cognitive approach to interpreting it all. This allows for a greater AI-understanding of conversational nuance such as irony, sarcasm and sentiment.

Natural Language Processing (NLP) is a subfield of Artificial Intelligence (AI) that uses deep learning algorithms to read, process and interpret cognitive meaning from human languages.

Interest is high in NLP, as there are dozens of applications and areas for potential development. Here are just a few applications of NLP:

  • Sentiment Analysis – For example, social media comments about a certain product or brand can be analyzed using NLP to determine how customers feel, and what influences their choices and decisions.
  • Cognitive Assistance – Virtual assistants, advanced chatbots, etc. can be enhanced by predicting your search intention or interpreting queries more accurately. 
  • Voice-driven Interfaces – Amazon’s Alexa and Apple’s Siri are examples of AI systems that use NLP to interpret voice prompts and return more relevant responses.
  • Filters – More accurate email spam filters have been enhanced by NLP for a while now. The NLP group at MIT are developing fake news filters to spot politically biased reporting. 

The Challenging Aspects of NLP for Deep Learning

The main challenge of NLP for deep learning is the level of complexity. Deep learning for NLP techniques are designed to deal with complex systems and data sets, but NLP is at the outer reaches of complexity. Human speech is often imprecise, ambiguous and contains many variables such as dialect, slang and colloquialisms. In other words, it is made up of large amounts of unstructured data. 

Deep learning uses neural networks to process and analyze data. A basic neural network is known as an ANN and is configured for a specific use, such as recognizing patterns or classifying data through a learning process. 

Neural network for Deep Learning
Fig 1: Basic Structure of an ANN  

For the purpose of building NLP systems, ANN’s are too simplistic and inflexible. They don’t allow for the high complexity of the task and sheer amount of incoming data that is often conflicting.

In recent years, a new type of neural network has been conceived that allows for successful NLP application. Known as Convolutional Neural Networks (CNN), they are similar to ANNs in some respects, as they have neurons that learn through weighting and bias. The difference is that CNNs apply multiple layers of inputs, known as convolutions. Each layer applies a different filter and combines all the results into “pools”.

Each filter picks out specific features. In the case of NLP deep learning, this could be certain words, phrases, context, tone, etc. Pooling the data in this way allows only the most relevant information to pass through to the output, in effect simplifying the complex data to the same output dimension as an ANN. 

CNNs can be combined with RNNs (Recurrent Neural Networks), which are designed to process sequential information, and bi-directional RNNS to successfully capture and analyze NLP data.

Recent NLP Developments

Applying deep learning principles and techniques to NLP has been a game-changer. In recent years there have been several breakthroughs.

Here is a brief breakdown of the developments in chronological order:

  • Word embedding – Also known as distributional vectors, which are used to recognize words appearing in similar sentences with similar meanings. Shallow neural networks are used to predict a word based on the context. In 2013, Word2vec model was created to compute the conditional probability of a word being used, given the context.
  • Convolutional Neural Networks (CNN) – A major breakthrough in NLP (described in the previous section).
  • Recurrent Neural Networks (RNN) – Described in the previous section.
  • Recursive Neural Networks – natural mechanisms to model sequential data.
  • Reinforcement Learning – Algorithmic learning method that uses rewards to train agents to perform actions.
  • Unsupervised Learning – Involves mapping sentences to vectors without supervision.
  • Deep Generative Models – Models such as Variational Autoencoders (VAEs) that generate natural sentences from code.

The amazing thing is that all of these developments (and more) have occurred within the last 7 years, and most of them within the last 3 years. This really is the golden age of NLP and everything so far has been leading up to the revolutionary birth of BERT.

What is BERT?

BERT algorithm has been the most significant breakthrough in NLP since its inception. 

But what is it? And why is it such a big deal?

Let’s start at the beginning. BERT stands for Bidirectional Encoder Representations from Transformers. Still none the wiser?

Let’s simplify it.

BERT is a deep learning framework, developed by Google, that can be applied to NLP. 

Bidirectional (B)

This means that the NLP BERT framework learns information from both the right and left side of a word (or token in NLP parlance). This makes it more efficient at understanding context. 

For example, consider these two sentences:

Jimmy sat down in an armchair to read his favorite magazine.

Jimmy took a magazine and loaded it into his assault rifle. 

Same word – two meanings, also known as a homonym. As BERT is bidirectional it will interpret both the left-hand and right-hand context of these two sentences. This allows the framework to more accurately predict the token given the context or vice-versa.

Encoder Representations (ER)

This refers to an encoder which is a program or algorithm used to learn a representation from a set of data. In BERT’s case, the set of data is vast, drawing from both Wikipedia (2,500 millions words) and Google’s book corpus (800 million words). 

The vast number of words used in the pretraining phase means that BERT has developed an intricate understanding of how language works, making it a highly useful tool in NLP.

Transformer (T)

This means that BERT is based on the Transformer architecture. We’ll discuss this in more detail in the next section.

Why is BERT so revolutionary?

Not only is it a framework that has been pre-trained with the biggest data set ever used, it is also remarkably easy to adapt to different NLP applications, by adding additional output layers. This allows users to create sophisticated and precise models to carry out a wide variety of NLP tasks.

BERT continues the work started by word embedding models such as Word2vec and generative models, but takes a different approach.

There are 2 main steps involved in the BERT approach:

1. Create a language model by pre-training it on a very large text data set.

2. Fine-tune or simplify this large, unwieldy model to a size suitable for specific NLP applications. This allows users to benefit from the vast knowledge the model has accumulated, without the need for excessive computing power.

We’ve only scratched the surface of what BERT is and what it does. If you really want to master the BERT framework for creating NLP models check out our course Learn BERT – most powerful NLP algorithm by Google.

What is Transformer?

To put it simply, Transformer is a deep machine learning model that was released in 2017, as a model for NLP.

Transformer performs a similar job to an RNN, i.e. it processes ordered sequences of data, applies an algorithm, and returns a series of outputs. Unlike RNNs, the Transformer model doesn’t have to analyze the sequence in order. Therefore, when it comes to natural language, the Transformer model can begin by processing any part of a sentence, not necessarily reading it from beginning to end.

The unordered nature of Transformer’s processing means it is more suited to parallelization (performing multiple processes simultaneously). For this reason, since the introduction of the Transformer model, the amount of data that can be used during the training of NLP systems has rocketed. 

Now that large amounts of data can be used in the training of NLP, a new type of NLP system has arisen, known as pretrained systems. BERT is an example of a pretrained system, in which the entire text of Wikipedia and Google Books have been processed and analyzed.

2019 – The Year of BERT Algorithm

2019 was arguably the year that BERT really came of age. We witnessed BERT being applied to many different NLP tasks. The power of a pre-trained NLP system that can be fine-tuned to perform almost any NLP task has increased the development speed of new applications. 

Here are some of the highlights:

  • Transfer-learning in NLP – BERT has made it possible to get high quality processing results for one word-level tasks, right up to 11 sentence-level tasks, with little modification needed. Not only is this great news for people working on projects involving NLP tasks, it is also changing the way we present language for computers to process. We now understand how to represent language in such a way that allows models to solve challenging and advanced problems.
  • Breaking new ground in AI and data science – In 2019, more than 150 new academic papers were published related to BERT, and over 3000 cited the original BERT paper. 
  • New applications for BERT – Research and development has commenced into using BERT for sentiment analysis, recommendation systems, text summary, and document retrieval.
  • Compressed BERT models – In the second half of 2019 some compressed versions arrived such as DistilBERT, TinyBert and ALBERT. DistilBERT, for example, halved the number of parameters, but retains 95% of the performance, making it ideal for those with limited computational power.

What’s Next? Post-BERT.

There’s no doubt that BERT algorithm has been revolutionary in terms of progressing the science of NLP, but it is by no means the last word. 

In fact, within seven months of BERT being released, members of the Google Brain team published a paper that outperforms BERT, namely the XLNet paper. XLNet achieved this by using “permutation language modeling” which predicts a token, having been given some of the context, but rather than predicting the tokens in a set sequence, it predicts them randomly. This method means that more tokens can be predicted overall, as the context is built around it by other tokens. 

ERNIE, also released in 2019, continued in the Sesame Street theme – ELMo (Embeddings from Language Models), BERT, ERNIE (Enhanced Representation through kNowledge IntEgration). ERNIE draws on more information from the web to pretrain the model, including encyclopedias, social media, news outlets, forums, etc. This allows it to find even more context when predicting tokens, which speeds the process up further still.

In terms of performance,  the compressed models such as ALBERT and Roberta, and the recent XLNet model are the only ones beating the original NLP BERT in terms of performance. In a recent machine performance test of SAT-like reading comprehension, ALBERT scored 89.4%, ahead of BERT at 72%. 

machine learning natural language processing graph

BERT still remains the NLP algorithm of choice, simply because it is so powerful, has such a large library, and can be easily fine-tuned to almost any NLP task. Also, as it is the first of its kind, there is much more support available for BERT compared to the newer algorithms. 

While the NLP space is progressing rapidly and recently released models and algorithms demonstrate computing-efficiency improvements, BERT is still your best bet. 

The application of this algorithm is robust and while we’ve covered quite a lot of information in this guide – we haven’t even gone into the practical side of using BERT and NLP algorithms!

To discover all the potential and power of BERT and get hands-on experience in building NLP applications, head over to our comprehensive BERT and NLP algorithm course.


A million students have already chosen Ligency

It’s time for you to Join the Club!