Using openNLP with Apache UIMA project - Part 3
Natural Language processing or more commonly known as 'NLP' is a common buzz word which started appearing in almost all recent software related developments in the recent past. So I have decided to go with the flow and take use of Apache OpenNLP library and adopt it to Apache UIMA, as a continuation of my previous article.
According to Apache,
The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. OpenNLP also included maximum entropy and perceptron based machine learning.
Complete use of OpenNLP is still alien to most and it is the same for me,hence I'll only look at how to use OpenNLP library along with per-trained models in creating an Annotator, which will extract Noun phrases of a given sample text.In order to do so, we should first identify several terms used in OpenNLP library,tools and get to know what they mean.
can detect that a punctuation character marks the end of a sentence or not. In this sense a sentence is defined as the longest white space trimmed character sequence between two punctuation marks.
egment an input character sequence into tokens. Tokens are usually words, punctuation, numbers, etc.
marks tokens with their corresponding word type based on the token itself and the context of the token. Part-of-speech consists nouns, verbs, adjectives, adverbs, pronouns, conjunctions, prepositions and interjections
Text chunking consists of dividing a text in syntactically correlated parts of words, like noun groups, verb groups, but does not specify their internal structure, nor their role in the main sentence.
These 4 functionalities of OpenNLP library will be used in the creation of the annotator along with several pre-trained models.
Sample codes and the descriptor files will be explained using the example provided in this article and the steps taken in creating the descriptor, using Eclipse will be described in detail.
Annotator Class
import java.io.InputStream;
import opennlp.tools.chunker.ChunkerME;
import opennlp.tools.chunker.ChunkerModel;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.Span;
import org.apache.commons.io.IOUtils;
import org.apache.uima.UimaContext;
import org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.jcas.JCas;
import org.apache.uima.resource.ResourceInitializationException;
/**
* Annotate noun phrases in sentences from within blocks of
* text (marked up with TextAnnotation) from either HTML or
* plain text documents. Using the OpenNLP library and models,
* the incoming text is tokenized into sentences, then each
* sentence is tokenized to words and POS tagged, and finally
* tokens are grouped together into chunks. Of these chunks,
* only the noun phrases (i.e. tagged as NP)are annotated.
*/
public class NounPhraseAnnotator extends JCasAnnotator_ImplBase {
private SentenceDetectorME sentenceDetector;
private TokenizerME tokenizer;
private POSTaggerME posTagger;
private ChunkerME chunker;
@Override
public void initialize(UimaContext ctx)
throws ResourceInitializationException {
super.initialize(ctx);
InputStream smis = null;
InputStream tmis = null;
InputStream pmis = null;
InputStream cmis = null;
try {
smis = getContext().getResourceAsStream("SentenceModel");
SentenceModel smodel = new SentenceModel(smis);
sentenceDetector = new SentenceDetectorME(smodel);
smis.close();
tmis = getContext().getResourceAsStream("TokenizerModel");
TokenizerModel tmodel = new TokenizerModel(tmis);
tokenizer = new TokenizerME(tmodel);
tmis.close();
pmis = getContext().getResourceAsStream("POSModel");
POSModel pmodel = new POSModel(pmis);
posTagger = new POSTaggerME(pmodel);
pmis.close();
cmis = getContext().getResourceAsStream("ChunkerModel");
ChunkerModel cmodel = new ChunkerModel(cmis);
chunker = new ChunkerME(cmodel);
cmis.close();
} catch (Exception e) {
throw new ResourceInitializationException(e);
} finally {
IOUtils.closeQuietly(cmis);
IOUtils.closeQuietly(pmis);
IOUtils.closeQuietly(tmis);
IOUtils.closeQuietly(smis);
}
}
@Override
public void process(JCas jcas) throws AnalysisEngineProcessException {
String text = jcas.getDocumentText();
Span[] sentSpans = sentenceDetector.sentPosDetect(jcas.getDocumentText());
for (Span sentSpan : sentSpans) {
String sentence = sentSpan.getCoveredText(text).toString();
int start = sentSpan.getStart();
Span[] tokSpans = tokenizer.tokenizePos(sentence);
String[] tokens = new String[tokSpans.length];
for (int i = 0; i < tokens.length; i++) {
tokens[i] = tokSpans[i].getCoveredText(sentence).toString();
}
String[] tags = posTagger.tag(tokens);
Span[] chunks = chunker.chunkAsSpans(tokens, tags);
for (Span chunk : chunks) {
if ("NP".equals(chunk.getType())) {
NounPhraseAnnotation annotation = new NounPhraseAnnotation(jcas);
annotation.setBegin(start +
tokSpans[chunk.getStart()].getStart());
annotation.setEnd(
start + tokSpans[chunk.getEnd() - 1].getEnd());
annotation.addToIndexes(jcas);
}
}
}
}
}
As explained in the code it self, this annotator follows a repetitive process in the order of;
The process() method splits the text into sentences.
Then the sentences are split into tokens.
Tokens are then POS tagged.
Then the tokens and tags are used to chunk each sentence.
Finally, only the noun-phrase chunks are annotated.
Apache OpenNLP can be downloaded from here, which includes the required opennlp tools.
Now that we have created the Annotator class, let's create the Analysis Engine Descriptor using Eclipse Component Descriptor Editor.
Go to the Type System tab and add a type (discussed in Getting Started: Writing My First UIMA Annotator, which was mentioned in part 1 of this article)
Add type window would look like this. After adding a type as shown above, you have to add resources required by the annotator (OpenNLP model files).
To add resources, go to the Resources tab as shown above and add resource dependencies as shown above.
Make sure that you add all the dependencies shown above. Key value is the compulsory field and the description is optional. After adding the dependencies, you have to add the resources (model files) by clicking Add in Resource Needs, Definitions and Binding section.
Files required are as follows. (these can be found in the Models directory in the OpenNLP downloaded package. If not available, you can download them here)
SentenceModelFile - en-sent.bin
TokenizerModelFile - en-token.bin
POSModelFile - en-pos-maxent.bin
ChunkerModelFile - en-chunker.bin
Once a resource is added to the descriptor, select the corresponding dependency from the Resource Dependencies section and click Bind in the left side section.
Once all the bindings are done, it will look like this.
These keys in the dependencies, which are been bound to resources (model files) are used in the annotator as follows. (taken from the annotator class)
smis = getContext().getResourceAsStream("SentenceModel");
tmis = getContext().getResourceAsStream("TokenizerModel");
pmis = getContext().getResourceAsStream("POSModel");
cmis = getContext().getResourceAsStream("ChunkerModel");
After the binding is done, you have to connect the Annotator class to the descriptor in the Overview tab (in the 'Name of the Java class file' field).
Now the process is complete and the final xml would look like this.
org.apache.uima.java
true org.apache.uima.annotators.NounPhraseAnnotator
NounPhraseAnnotatorDescriptor.xml
1.0
org.apache.uima.tutorial.NounPhraseAnnotation
Annotation to represent Noun Phrase sequences in a body of text
uima.tcas.Annotation
true
true
false
SentenceModel
OpenNLP Sentence Model
false
TokenizerModel
OpenNLP Tokenizer Model
false
POSModel
OpenNLP POS Tagging Model
false
ChunkerModel
OpenNLP Chunker Model
false
SentenceModelFile
/home/achintha/Software/apache-opennlp-1.5.3/Models/en-sent.bin
TokenizerModelFile
/home/achintha/Software/apache-opennlp-1.5.3/Models/en-token.bin
POSModelFile
/home/achintha/Software/apache-opennlp-1.5.3/Models/en-pos-maxent.bin
ChunkerModelFile
/home/achintha/Software/apache-opennlp-1.5.3/Models/en-chunker.bin
SentenceModel
SentenceModelFile
TokenizerModel
TokenizerModelFile
POSModel
POSModelFile
ChunkerModel
ChunkerModelFile
you can test the annotator using the following sample code which is obtained from here.
import java.util.Iterator;
import org.apache.uima.analysis_engine.AnalysisEngine;
import org.apache.uima.cas.FSIndex;
import org.apache.uima.jcas.JCas;
import org.junit.Test;
import com.mycompany.tgni.uima.utils.UimaUtils;
public class NounPhraseAnnotatorTest {
private static final String[] INPUTS = new String[] { ... };
@Test
public void testNounPhraseAnnotation() throws Exception {
AnalysisEngine ae = UimaUtils.getAE(
"conf/descriptors/NounPhraseAE.xml", null);
for (String input : INPUTS) {
System.out.println("text: " + input);
JCas jcas = UimaUtils.runAE(ae, input, UimaUtils.MIMETYPE_TEXT);
FSIndex index = jcas.getAnnotationIndex(NounPhraseAnnotation.type);
for (Iterator<NounPhraseAnnotation> it = index.iterator(); it.hasNext();) {
NounPhraseAnnotation annotation = it.next();
System.out.println("...(" + annotation.getBegin() + "," +
annotation.getEnd() + "): " +
annotation.getCoveredText());
}
}
}
}
Hope you can extend this to suite different demands of yours.Enjoy.