Demystifying Named Entity Recognition - Part II
As a continuation for Demystifying Named Entity Recognition - Part I, in this post I’ll discuss popular models available in the field and try to cover:
-
popular traditional models
-
deep learning models
-
python libraries
Over the history of NER, there’s been three major approaches: grammar-based, dictionary-based and machine-learning-based. Grammar-based approach produces a set of empirical rules hand-crafted by experienced computational linguists, usually takes months of work. Dictionary-based approach basically organizes all the known entities into a lookup table, which can be used to detect whether a candidate belongs to a defined category or not. By design it doesn’t work well with newly invented entities. Machine-learning-based approach typically needs annotated data, but doesn’t necessarily rely on domain experts to come up with rules or fail on unseen entities.
This post focuses only on machine-learning based models.
1. Popular traditional models
The traditional models we’ll discuss here are MEMM, CRF. They are very popularly used before deep learning models entered the scene.
1.1 MEMM
We’ve covered the details of MEMM in the previous post. The key idea of the MEME approach is to model the conditional probability of tag sequenece for a given sentence with Markov assumption:
We then model
In inference, we use Viterbi algorithm to get best-fitting tag sequence for a given sentence. Details can be found in 2.2.1 section of the previous post.
In training, we use maximum likelihood estimation to get optimal
where
1.2 CRF
Instead of
The main challenge in direct modeling is that the denominator is sum of
1.2.1 Inference
During inference, we are only interested in the
If using brutal force, we have to evaluate
Fortunately, if we add a little structure into
The structure added in CRF is:
To maximize
If we can maximize any partial score (which turns out not that difficult), then the score we want to acutally maximize,
So how to maximize any partial score? Let’s start with
This is easy because it’s just a single-variable optimization and
can only have K choices. We also store all the evaluated .
How about
We can first fix
and optimize over dimension. Remember we’ve known evaluated from the previous question. So it takes computations to find the optimal for each - . Then pick the which has maximum . In total, we need perform evaluations. We also store all the for future use.
How about
Similar to the previous question, we try to estimate
for each using . We also carry out evaluation per , thus totally evaluations for all possible . We store for future use (e.g., when ).
By doing this all the way to
1.2.2 Training
Similar to MEMM, we can also use maximum likelihood estimation to get optimal
where
2. Deep learning models
The deep learning models we’ll discuss here are LSTM, BiLSTM-CRF, Bert.
2.1 LSTM
2.1.1 Architecture
In the setting of LSTM, each token
2.1.2 Inference
The inference in LSTM is very simple:
which indicates the prediction of i-th position only utilizes the sentence information up to i-th token - only the left side of the sentence is used for tag prediction at i-th position. BiLSTM is designed to provide context information from both sides, which will be seen in next section.
2.1.3 Training
Like all the other neural network training, LSTM training uses Stochastic Gradient Descent algorithm. Loss function adopts negative log likelihood. For a data point
where
Total loss is the mean of all the individual losses.
where
2.2 BiLSTM
BiLSTM stands for bi-directional LSTM, which provides sequence information from both directions. Because of that, BiLSTM is more powerful than LSTM. Except the bi-directional component, the meaning of network output, inference, and training loss are same as LSTM.
2.3 BiLSTM-CRF
BiLSTM captures contextual information around i-th position. But at each position, BiLSTM predicts tags basically in an independent fashion. There’s cases where some adjacent positions are predicted with tags which do not usually appear together in reality. For example, I-PER tag should not follow B-ORG. To account for this kind of interactions between adjacent tags, Conditional Random Field (CRF) is introduced to BiLSTM.
2.3.1 Architecture
where
CRF layer also learns a transition matrix
2.3.1 Inference
Same as the inference in CRF section, given a trained network and sentence
The score is a sum of contributions from token level. i-th position has contribution of
To find the tag sequence
Define sub problem
The recursion would be:
The original problem is then
We can always use parent pointers to retrieve the corresponding best sequence
2.3.2 Training
Loss function for BiLSTM-CRF also adopts negative log likelihood. For a data point
where the first term is easy to calculate via a forward pass of the network and the second term needs more care. Let’s define that term (without log) as
To calculate
The recursion would be:
The original problem is then
Via this way, individual loss
2.4 Bert
Recent research on BERT provides an option for NER modeling. Despite of the complexity of the BERT model architecture, in the context of NER it can be regarded as an advanced version of our BiLSTM model - replacing the LSTM with multiple Transformer Encoder layers.
Thus,
Inference, and training loss are same as LSTM section.
3. Python libraries
There’s several machine learning based NER repositories in GitHub. I picked some of them here with some comments.
-
KEHANG/ner: for English texts, based on PyTorch, has LSTM, BiLSTM, BiLSTM+CRF and Bert models, has released conda package
-
shiyybua/NER: for Chinese texts, based on Tensorflow, only BiLSTM+CRF model, no packages released
-
Franck-Dernoncourt/NeuroNER: for English texts, based on Tensorflow, has LSTM model, no package released
Enjoy Reading This Article?
Here are some more articles you might like to read next: