Named-Entity Recognition

Title
Named-Entity Recognition

Named-entity recognition, or named-entity recognition in English, is a construct commonly used in natural language processing, known as "NER". Entity Name Recognition is an information extraction sub-task that aims to place and classify named entities mentioned in the text into predefined categories such as person names, organizations, places, medical codes, time expressions, amounts, monetary values, percentages. It is the process of extracting from the documents. A subset of Information Extraction

Its definition was introduced at the Message Understanding Conference in 1995. Definitions are made in 3 basic categories: ENAMEX, TIMEX and NUMEX.

• • Enamex: Expressions such as person, place, organization

• • Numex: Monetary and percentage expressions

• • Timex: It is used to define temporal expressions such as day and date.

Named entity recognition (NER) - sometimes also referred to as entity fragmentation, extraction, or identification. It is the task of identifying and categorizing key information (entities) in the text. An entity can be any word or sequence of words that consistently denotes the same thing. Each detected asset is classified into a predetermined category. For example, a NER machine learning (ML) model might detect the word "Sisasoft" in a text and classify it as a "Company".

NER is a form of natural language processing (NLP), a subfield of artificial intelligence. NLP is about computers that process and analyze natural language, i.e. any language that evolves naturally, not artificially like computer coding languages.

In Python, the Spacy library is an NER tool developed for manipulating the English language. In this sense, there are some categories determined by default.

As can be seen, the above categories describe some basic words and nouns. Among these, there are verbal and terminological titles such as very basic numerical expressions (percentage, time, date..). The diversification of these categories is only possible with the addition of newly tagged words and sentences and the development of the modle.

The data labeling format differs slightly from conventional labeling. Here, it is planned to label all sentences and their elements, thus differentiating the named and the unnamed. In addition to these, situations such as the beginning, continuation, last word of the names used together, and not belonging to any category are also given meaning with tags.

There are different data labeling formats such as Raw, IOB, IOB2, BILOU. Below is an explanation of the symbols.

Let's take a look at the example tagged with these tag types.

In this example tagged sentence, the person name, date, location tags; We see that those that do not belong to any category are also labeled as “O”.

As a result of the training carried out with large datasets given in this way, the values belonging to this category can be extracted from a given sentence.

The number of categories can be increased according to the tags. These labels may differ according to the subject or field of study. For example, recognition of series names, recognition of plant names, recognition of products belonging to a company.

If you think your business or project could benefit from NER, it's pretty easy to get started. There are a number of excellent open source libraries that can get you going, including NLTK, SpaCy, and Stanford NER. Each has its own pros and cons, and we will examine them in more detail soon. However, before you can start using one of these libraries to build a model, you will need to create a dataset with corresponding labels to train the model.

State of Art Techniques for Asset Name Recognition

When the best solutions for entity name recognition are examined, we see that classical word embeddings such as skip-gram and glove are not included, but new generation word embedding techniques such as Flair, BERT, ELMO. BLSTM+CRF-based techniques are beginning to be replaced by other architectures, and these solutions require a lot of computational power when examined. BLSTM+CRF based solutions seem more appropriate as computional-cost/accuracy.

So where can NER be used?

Classification of News Content:

Expressions such as person, place and location in the news are automatically removed, making it easier to reach the news of a certain region or people. Knowing the relevant tags for news is used for automatic categorization in defined hierarchies and for content discovery.

Customer service:

For example, the Samsung Note 7 was a product that came to the fore with battery problems. If the information that the Note 7 is a phone part and the battery is a phone part is automatically removed, the problems can be directly reported to the relevant unit of the company that produces this product. The time taken to transfer the problem to the relevant unit is considerably shortened. If analyzes are made on social media such as Twitter, what kind of problem is in which product, in which location, is automatically determined. The company can make investments for the relevant subject by making the necessary analyzes with the data it has obtained.

Machine Translations

One of the issues to be considered when translating between different Natural Languages ​​is that linguistic information such as proper names remain unchanged in the translation system.

For example, if we examine the sentence "Toprak did not come to school today", here Toprak is a proper noun, if the translation system perceives it as an item, the translation would be incorrect. For this reason, extracting ennamex data types is important for translation systems.

Sentiment Analysis

Expressions such as person, place, and organization generally do not affect whether a review is good or bad. Therefore, removing entity names will slightly increase performance.

Search and recommendation engines

It improves the speed and relevance of search results and recommendations by summarizing explanatory text, reviews , and discussions. "Booking.com" is a remarkable success story in this field.

Content classification

By identifying the topics and themes of blog posts and news articles, uncover content more easily and gain insight into trends.

Health care

Improve patient care standards and reduce workloads by extracting key information from laboratory reports Roche does this with pathology and radiology reports.

Academy

Help students and researchers find relevant material faster by summarizing articles and archival materials and highlighting key terms, issues and themes Europeana, the EU's digital platform for cultural heritage, uses NER to make historical newspapers searchable

In this sense, we have seen the use, operation, main purpose and where it can be used of Named-entity recognition. The fact that NER enables words that are used differently, whether they are named or spatially different, to be detected and separated in sentences, shows that it can work in different ways in different projects. Sometimes, when examining twitter data, using it to see if some special words have passed, will solve the problems.

References

https://en.wikipedia.org/wiki/Named-entity_recognition

https://medium.com/codable/named-entity-recognition-varl%C4%B1k-i%CC%87smi-tan%C4%B1ma-b21315a30029

https://tezarsivi.com/turkcede-varlik-ismi-tanima

Stay Informed!

By signing up for our e-bulletin, you can be informed about all our innovations.

"We use cookies to personalize and improve your Sisasoft Website usage experience. By making your visit with default settings, you accept the use of cookies as specified in Sisasoft's

Privacy Policy
0312 227 06 34