Have you ever wondered how AI personal assistants like Siri or Cortana work? How your spell checker was able to detect syntax errors that you yourself would not have spotted? How does your search engine manage to guess the words you were about to write from the first letters?
What is NLP?
NLP for Natural Language Processing is a discipline that focuses on the understanding, manipulation and generation of natural language by machines. Thus, NLP is really at the interface between computer science and linguistics. It is about the ability of the machine to interact directly with humans.
What problems does NLP address?
NLP is a fairly generic term that covers a very wide range of applications. Here are the most popular applications:
Machine translation
The development of machine translation algorithms has truly revolutionized the way texts are translated today. Applications, such as Google Translate, are able to translate entire texts without any human intervention.
Because natural language is inherently ambiguous and variable, these applications do not rely on word-for-word replacement, but require true text analysis and modeling, known as Statistical Machine Translation.
Sentiment analysis
Also known as “Opinion Mining“, sentiment analysis involves identifying subjective information in a text to extract the author’s opinion.
For example, when a brand launches a new product, it can use the comments collected on social networks to identify the overall positive or negative sentiment shared by customers.
In general, sentiment analysis is a way to measure the level of customer satisfaction with the products or services provided by a company or organization. It can even be much more effective than traditional methods such as surveys.
Indeed, if we are often reluctant to spend time answering long questionnaires, a growing part of consumers nowadays frequently share their opinions on social networks. Thus, the search for negative texts and the identification of the main complaints make it possible to improve products, adapt advertising and reduce the level of customer dissatisfaction.
Marketing
Marketers also use NLP to find people who are likely to make a purchase.
They rely on the behavior of Internet users on websites, social networks and search engine queries. This type of analysis allows Google to generate a significant profit by offering the right advertisement to the right people. Each time a visitor clicks on an ad, the advertiser pays up to 50 dollars!
More generally, NLP methods can be used to build a rich and comprehensive picture of a company’s existing market, customers, issues, competition, and growth potential for new products and services.
Raw data sources for this analysis include sales logs, surveys and social media…
Chatbots
NLP methods are at the heart of how today’s chatbots work. While these systems are not completely perfect, they can now easily handle standard tasks such as informing customers about products or services, answering their questions, etc. They are used across multiple channels, including the Internet, applications and messaging platforms. The opening of the Facebook Messenger platform to chatbots in 2016 contributed to their development.
Other application areas :
- Text classification: this consists of assigning a set of predefined categories to a given text. Text classifiers can be used to organize, structure and categorize a set of texts.
- Character recognition: This allows to extract key information from receipts, invoices, checks, legal billing documents, etc., based on character recognition.
- Automatic correction: Most text editors today have a spell checker that checks the text for spelling errors.
- Automatic summarization: NLP methods are also used to produce short, precise and fluid summaries of a longer text document.
What are the main methods used in NLP?
Broadly speaking, we can distinguish two aspects that are essential to any NLP problem:
- The “linguistic” part, which consists in preprocessing and transforming the input information into a usable dataset.
- The “machine learning” or “data science” part, which consists in applying Machine Learning or Deep Learning models to this dataset.
In the following, we will discuss these two aspects, briefly describing the main methods and highlighting the main challenges. We will use a classical example: spam detection.
The pre-processing phase: from text to data
Let’s assume that you want to be able to determine whether an e-mail is spam or not, based on its content alone. To do this, it is essential to transform the raw data (the text of the email) into usable data.
The main steps include:
- Cleaning: Varying according to the source of the data, this phase consists of performing tasks such as removing urls, emoji, etc.
- Data normalization:
- Tokenization, or breaking up the text into multiple pieces called tokens. Example: “You will find attached the document in question”; “You”, “will find”, “attached”, “the document”, “in question”.
- Stemming: the same word can be found in different forms depending on the gender (masculine, feminine), number (singular, plural), person (me, you, them…), etc. Stemming generally refers to the crude heuristic process of cutting off the end of words in order to keep only the root of the word.
- Lemmatization: this consists of performing the same task but using a fine-grained vocabulary and word construction analysis. Lemmatization allows to remove only the inflexible endings and thus to isolate the canonical form of the word, known as the lemma.
- Other operations: deleting numbers, punctuation, symbols and stopwords, changing to lowercase.
In order to apply Machine Learning methods to natural language problems, it is necessary to transform textual data into numerical data.
There are several approaches but the main ones remains the Ter-Frequency. This method consists in counting the number of occurrences of tokens present in the corpus for each text. Each text is then represented by a vector of occurrences. This is generally referred to as a Bag-Of-Word.
Nevertheless, this approach has a major drawback: some words are by nature more used than others, which can lead the model to erroneous results.
Term Frequency-Inverse Document Frequency (TF-IDF): this method consists in counting the number of occurrences of tokens present in the corpus for each text, which is then divided by the total number of occurrences of these same tokens in the whole corpus.
For the term x present in the document y, we can define its weight by the following relation:
Where :
- tƒx,y is the frequency of term x in y ;
- dƒx is the number of documents containing x ;
- N is the total number of documents.
Thus, this approach provides a vector representation for each text that includes vectors of weights rather than occurrences.
The efficiency of these methods differs according to the application case. However, they have two main limitations:
- The richer the vocabulary of the corpus, the larger the size of the vectors, which can represent a problem for the learning models used in the next step.
- Word occurrence counting does not allow accounting for their arrangement and thus for the meaning of the sentences.
There is another approach that allows to remedy these problems: Word Embedding. It consists in building vectors of fixed size that take into account the context in which the words are found.
Thus, two words present in similar contexts will have vectors closer (in terms of vector distance). This allows us to capture both semantic, syntactic or thematic similarities of words.
A more detailed description of this method will be given in the next section.
The learning phase: from data to model
Overall, we can distinguish 3 main NLP approaches: rule-based methods, classical Machine Learning models and Deep Learning models.
- Rule-based methods: Rule-based methods are largely based on the development of domain-specific rules (e.g. regular expressions). They can be used to solve simple problems such as extracting structured data from unstructured data (e.g. web pages).
In the case of spam detection, this could consist of considering as spam emails, those that contain buzzwords such as “promotion”, “limited offer”, etc.
However, these simple methods can be quickly overwhelmed by the complexity of natural language and prove to be inefficient.
- Classical Machine Learning Models: Classical machine learning approaches can be used to solve more difficult problems. Unlike methods based on predefined rules, they are based on methods that are really about language understanding. They exploit data obtained from raw texts pre-processed via one of the methods described above, for example. They can also use data related to the length of sentences, the occurrence of specific words, etc. Furthermore, they generally implement a statistical machine learning model such as Naive Bayes, Logistic Regression, etc.
- Deep Learning Models: The use of deep learning models for NLP problems is currently the subject of much research.
These models generalize even better than classical learning approaches because they require a less sophisticated text preprocessing phase: neural layers can be seen as automatic feature extractors.
This makes it possible to build end-to-end models with little data preprocessing. Apart from the feature engineering part, the learning capabilities of Deep Learning algorithms are generally more powerful than those of classical Machine Learning, which allows obtaining better scores on various complex hard NLP tasks such as translation.
What are the possibilities and challenges of NLP?
The rules that govern the transformation of natural language text into information are not easy for computers to understand. It requires understanding both the words and the way the concepts are related to deliver the desired message.
Ambiguity
In natural language, words are unique but can have different meanings depending on the context, resulting in lexical, syntactic and semantic ambiguity. To solve this problem, NLP proposes several methods, such as context evaluation. However, understanding the semantic meaning of words in a sentence is still a work in progress.
Synonymy
Another key phenomenon in natural language is that we can express the same idea with different terms that also depend on the specific context.
For example, the terms “big” and “wide” may be synonymous in describing an object or a building, but they are not interchangeable in all contexts: “big” may mean older.
Correference
Correference tasks involve finding all expressions that refer to the same entity. This is an important step for many high-level NLP tasks that involve whole-text understanding, such as document summarization, question answering, and information extraction. This problem has seen a revival with the introduction of state-of-the-art Deep Learning techniques.
Writing style
Depending on the author’s personality, intentions and emotions, the same idea can be expressed in different ways. Some authors do not hesitate to use irony or sarcasm, and thus convey a meaning opposite to the literal one.
So while humans can easily master a language, the ambiguity and imprecise characteristics of natural languages are what make NLP difficult for machines to implement.