Demystifying Named Entity Recognition - Part II

As a continuation for Demystifying Named Entity Recognition - Part I, in this post I’ll discuss popular models available in the field and try to cover:

  • popular traditional models

  • deep learning models

  • python libraries

Over the history of NER, there’s been three major approaches: grammar-based, dictionary-based and machine-learning-based. Grammar-based approach produces a set of empirical rules hand-crafted by experienced computational linguists, usually takes months of work. Dictionary-based approach basically organizes all the known entities into a lookup table, which can be used to detect whether a candidate belongs to a defined category or not. By design it doesn’t work well with newly invented entities. Machine-learning-based approach typically needs annotated data, but doesn’t necessarily rely on domain experts to come up with rules or fail on unseen entities.

This post focuses only on machine-learning based models.

The traditional models we’ll discuss here are MEMM, CRF. They are very popularly used before deep learning models entered the scene.

1.1 MEMM

We’ve covered the details of MEMM in the previous post. The key idea of the MEME approach is to model the conditional probability of tag sequenece for a given sentence with Markov assumption:

p(y1...yn|x1...xn)=i=1np(yi|yi1,x1...xn)

We then model p(yi|yi1,x1...xn) using local environment:

p(yi|yi1,x1...xn)=exp(θf(yi1,yi,x1...xn))yexp(θf(yi1,y,x1...xn))

In inference, we use Viterbi algorithm to get best-fitting tag sequence for a given sentence. Details can be found in 2.2.1 section of the previous post.

In training, we use maximum likelihood estimation to get optimal θ that

maxθj=1Np(xj|yj)

where xj,yj are the jth sentence and corresponding tag sequence (the whole training dataset has N examples).

1.2 CRF

Instead of p(yi|yi1,x) , Conditional Random Field (CRF) approach chooses to directly model p(y|x):

p(y|x)=exp(ΘF(x,y))yexp(ΘF(x,y))

The main challenge in direct modeling is that the denominator is sum of Kn terms where K is the number of tag label types and n is the length of sentence to tag. This is a much larger number than that in MEMM - p(yi|yi1,x1...xn) has just K terms in the denominator.

1.2.1 Inference

During inference, we are only interested in the y that gives the highest probability rather than the highest probability itself:

y=argmaxyp(y|x)=argmaxyexp(ΘF(x,y))=argmaxyΘF(x,y)

If using brutal force, we have to evaluate exp(ΘF(x,y)) for Kn times.

Fortunately, if we add a little structure into F(x,y), which I’m going to talk about next, we can bring the exponential complexity - O(Kn) down to linear complexity - O(K2n).

The structure added in CRF is:

F(x,y)=i=1nf(x,yi1,yi)

To maximize ΘF(x,y), we define a partial score as we did in 2.2.1 section of the previous post:

spartial,k(y1...k)=Θi=1kf(x,yi1,yi)

If we can maximize any partial score (which turns out not that difficult), then the score we want to acutally maximize, ΘF(x,y), is just a special case of spartial,k when k=n.

So how to maximize any partial score? Let’s start with k=1, namely spartial,1(y1)=Θf(x,y1).

This is easy because it’s just a single-variable optimization and y1 can only have K choices. We also store all the evaluated spartial,1(y1).

How about k=2, namely maximize spartial,2(y1,y2)=spartial,1(y1)+Θf(x,y1,y2)?

We can first fix y2 and optimize over y1 dimension. Remember we’ve known spartial,1(y1) evaluated from the previous question. So it takes K computations to find the optimal y1 for each y2 - spartial,2(y1,y2). Then pick the y2 which has maximum spartial,2(y1,y2). In total, we need perform K2 evaluations. We also store all the spartial,2(y1,y2) for future use.

How about k=3, namely maximize spartial,3(y1,y2,y3)=spartial,2(y1,y2)+Θf(x,y2,y3)?

Similar to the previous question, we try to estimate spartial,3(y1,y2,y3) for each y3 using spartial,3(y1,y2,y3)=maxy2(spartial,2(y1,y2)+Θf(x,y2,y3)). We also carry out K evaluation per y3, thus totally K2 evaluations for all possible y3. We store spartial,3(y1,y2,y3) for future use (e.g., when k=4).

By doing this all the way to k=n, we can get maxyΘF(x,y) with roughly K2n evaluations.

1.2.2 Training

Similar to MEMM, we can also use maximum likelihood estimation to get optimal Θ that

maxΘj=1Np(xj|yj)

where xj,yj are the jth sentence and corresponding tag sequence (the whole training dataset has N examples). More details on training algorithm can be found in Page 10 of Michael Collins’s CRF note.

2. Deep learning models

The deep learning models we’ll discuss here are LSTM, BiLSTM-CRF, Bert.

2.1 LSTM

2.1.1 Architecture

Alt text

In the setting of LSTM, each token xi is fed to a LSTM unit, which outputs a oi. oi models log probabilities of all possible tags at i-th position, so it has dimension of K.

oi=[logP(yi=PER|x)logP(yi=ORG|x)...logP(yi=MISC|x)]

2.1.2 Inference

The inference in LSTM is very simple: yi = the tag with highest log probability at i-th position.

yi=argmaxkoi,k

which indicates the prediction of i-th position only utilizes the sentence information up to i-th token - only the left side of the sentence is used for tag prediction at i-th position. BiLSTM is designed to provide context information from both sides, which will be seen in next section.

2.1.3 Training

Like all the other neural network training, LSTM training uses Stochastic Gradient Descent algorithm. Loss function adopts negative log likelihood. For a data point (xj,yj), we have its loss calculated as:

Lj=i=1njoij[yij]

where nj is the length of the sentence xj, oij is the LSTM output at i-th position and yij is the ground truth tag at i-th position.

Total loss is the mean of all the individual losses.

L=1Nj=1NLj

where N is the total number of training examples.

2.2 BiLSTM

Alt text

BiLSTM stands for bi-directional LSTM, which provides sequence information from both directions. Because of that, BiLSTM is more powerful than LSTM. Except the bi-directional component, the meaning of network output, inference, and training loss are same as LSTM.

2.3 BiLSTM-CRF

BiLSTM captures contextual information around i-th position. But at each position, BiLSTM predicts tags basically in an independent fashion. There’s cases where some adjacent positions are predicted with tags which do not usually appear together in reality. For example, I-PER tag should not follow B-ORG. To account for this kind of interactions between adjacent tags, Conditional Random Field (CRF) is introduced to BiLSTM.

2.3.1 Architecture

Alt text

where oi models emission scores of all possible tags at i-th position and yi is the best tag for i-th position which collectively achieves highest sequence score.

oi=[scoreemission(yi=PER|x)scoreemission(yi=ORG|x)...scoreemission(yi=MISC|x)]

CRF layer also learns a transition matrix A which stores transition scores between any possible pair of tag types.

2.3.1 Inference

Same as the inference in CRF section, given a trained network and sentence x, any sequence s will have a score.

score(x,s)=i=1noi[si]+A[si1][si]=i=1nϕ(x,si1,si)

The score is a sum of contributions from token level. i-th position has contribution of ϕ(x,si1,si)=oi[si]+A[si1][si], where the first term is emission score and second term is transition score.

To find the tag sequence y achieving highest score, we need to use dynamic programming.

Define sub problem DP(k,t) to be the max score accumulated from 1st position to k-th position with the k-th position tag being t, detailed as follows:

DP(k,t)=maxsSk:sk=ti=1kϕ(x,si1,si)

The recursion would be:

DP(k+1,t)=maxt[DP(k,t)+ϕ(x,t,t)]

The original problem is then

score(x,y)=maxtDP(n,t)

We can always use parent pointers to retrieve the corresponding best sequence y.

2.3.2 Training

Loss function for BiLSTM-CRF also adopts negative log likelihood. For a data point (xj,yj), we have its loss calculated as:

Lj=logP(yj|xj)=logexp(score(xj,yj))yjexp(score(xj,yj)) =score(xj,yj)+log_yjexp(score(xj,yj))

where the first term is easy to calculate via a forward pass of the network and the second term needs more care. Let’s define that term (without log) as Z, which is exponential sum of scores of all the possible sequences s of length n.

Z=sSnexp(score(x,s))=sSnexp(i=1nϕ(x,si1,si)) =sSni=1nexp(ϕ(x,si1,si))=sSni=1nψ(x,si1,si)

To calculate Z, we need to use dynamic programming again. This time the sub-problem DP(k,t) is the exponential sum of scores of all possible sequences of length k with last tag sk=t:

DP(k,t)=sSk:sk=ti=1kψ(x,s_i1,si)

The recursion would be:

DP(k+1,t)=_tDP(k,t)ψ(x,t,t)

The original problem is then

Z=tDP(n,t)

Via this way, individual loss Lj is calculated and then batch loss by averaging the individual losses in the batch.

2.4 Bert

Recent research on BERT provides an option for NER modeling. Despite of the complexity of the BERT model architecture, in the context of NER it can be regarded as an advanced version of our BiLSTM model - replacing the LSTM with multiple Transformer Encoder layers.

Alt text

Thus, oi still models log probabilities of all possible tags at i-th position.

oi=[logP(yi=PER|x)logP(yi=ORG|x)...logP(yi=MISC|x)]

Inference, and training loss are same as LSTM section.

3. Python libraries

There’s several machine learning based NER repositories in GitHub. I picked some of them here with some comments.

  • KEHANG/ner: for English texts, based on PyTorch, has LSTM, BiLSTM, BiLSTM+CRF and Bert models, has released conda package

  • shiyybua/NER: for Chinese texts, based on Tensorflow, only BiLSTM+CRF model, no packages released

  • Franck-Dernoncourt/NeuroNER: for English texts, based on Tensorflow, has LSTM model, no package released




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Battle-Tested LLM Training: From Dataset to Data Iterator
  • Battle-Tested LLM Training: Multi-host Input Pipeline
  • A Desk That Listens
  • A Desk with Its Own Schedule
  • Demystifying Named Entity Recognition - Part I