Steps of NLP
NLP enables computers to understand natural language like humans. However, to do so, it must go through a series of steps where it takes the raw text or voice data and breaks it down in a way that can be used in applications that deal with natural language.
Just like humans break down sentences into phrases or words to understand what we are reading, computers also need to have a process to break down raw data for better understanding.
In this section, we'll go through the steps taken in NLP that break down text or voice data into the desired format. This is then used as input for different programs and applications. (Dr Monica, 2020; Lutkevich, B., 2021)
Sentence Segmentation
Sentence segmentation determines how a text is divided into sentences for further processing. (Palmer, D.D., 2000)
For example, the original text could be a paragraph, "Hello, world. This page is about sentence segmentation. It is not always easy to determine the end of a sentence."
After the implementation of sentence segmentation, the computer will split the paragraph into sentences.
"Hello, world."
"This page is about sentence segmentation."
"It is not always easy to determine the end of a sentence."
Tokenisation
Tokenisation is a way of separating text into smaller units called tokens.
It is very similar to sentence segmentation, but this process further breaks down sentences into words and punctuation. (Dr Monica, 2020)
For example, the original text could be a sentence like, "How are you?"
After implementing tokenisation, the sentence splits into individual words and punctuation.
"How"
"are"
"you"
"?"
Part-of-Speech Tagging (POS)
POS tagging labels each word in a sentence with its appropriate part of speech.
Parts of speech include nouns, verbs, adverbs, adjectives, pronouns, conjunction and their sub-categories. POS tagging is a useful step as it helps to simplify many different problems in NLP. (Dr Monica, 2020)
For example, the original text could be a sentence like, "You just gave me a scare."
After implementing POS tagging, each word in the sentence is classified as a noun, verb, adverb, adjective, pronoun, conjunction and their sub-categories.
"You": personal pronoun
"just": adverb
"gave": verb, past tense
"me": personal pronoun
"a": determiner
"scare": noun, singular
Identifying Stopwords
These are words that add little value to the meaning of the document.
Stopwords are common words within sentences that don't add value and thus are eliminated before analysis. The removal of stopwords is useful as it means less data for the computer to process, which could translate to faster processing times and improvement in the accuracy of the program. The removal of stopwords must be completed after tokenisation. (Dr Monica, 2020; Roldós, I., 2021)
For example, the original sentence could be, "Hey, Amazon, my package never arrived."
After the implementation of tokenisation and the removal of stopwords, the sentence will look like this, "Amazon package never arrived". You can see that removing the words 'Hey' and 'my' makes the sentence simpler.
Lemmatisation
Lemmatisation is the process of grouping words based on root definition. It allows us to differentiate between present, past, and indefinite.
For example, under lemmatisation,
jump = jump
jumps = jump
jumping = jumping
jumped = jumped
(Roldós, I., 2021)
Future of NLP
The buzz of NLP in the market is growing rapidly. The reason behind this growth is the rise of chatbots, the urge to discover customer insights and the transfer of messaging technology from manual to automated. There are also many other tasks required to be automated that involve language or speech at some point.
The future of NLP is most concerned with evolving from human-computer interaction to human-computer conversation. In the case of interaction only, it's possible to use a single medium of communication, either verbal or nonverbal. For conversation, it is a necessity to use both mediums, verbal and non-verbal together. (Kaur, J., 2022)
Biometrics
Seeing and recognising expressions is key to identifying the difference between different sentiments and emotions.
Nonverbal communication includes body language, touch, gestures, and facial expressions. So to bring nonverbal type communication into the game, there is a necessity to use biometrics like facial recognition, fingerprint scanner, and retina scanner. Just like different words constitute a whole sentence, different microexpressions show the feelings in a conversation. Suppose it's possible to couple them with natural language processing units. In that case, this integration can unlock a whole new level of interaction, resulting in machines being able to converse with humans.
(Kaur, J., 2022)
Robotics
Every soul needs a body to express itself.
"In the same manner, there is a need for a physical unit to convey the NLP advancement in a proper and commercial environment. As the growth of NLP and biometrics is gaining pace and accuracy, these technologies can give a whole new level to the research of humanoid robots. This means they can express themselves through movement, postures, and expressions."
(Kaur, J., 2022)
Historical timeline of NLP:
1940s
Machine Translation
NLP originated from the idea of machine translation (MT), which came into existence during the second world war.
This was intended to convert one human language to another human language. For example, translating English to Russian and vice versa, as they were both the primary language used back then. (Kaur, J., 2022)
1960s
The Decline of NLP
In 1957, Noam Chomsky revolutionised previous linguistic concepts, concluding that the sentence structure would have to be changed for a computer to understand a language. In 1958, the programming language, Locator and Identifier Separation Protocol (LISP), a computer language still in use today, was released by John McCarthy.
In 1964, ELIZA, a "typewritten" comment and response process, was designed to imitate a psychiatrist using reflection techniques. It did this by rearranging sentences and following relatively simple grammar rules. However, there needed to be a better understanding of the computer part.
Also, in 1964, the US National Research Council (NRC) created the Automated Language Processing Advisory Committee (ALPAC) to evaluate the progress of NLP research.
In 1966, NRC and ALPAC initiated the first AI and NLP stoppage because, after 12 years of research and $20 million, machine translations were still more expensive than manual human translations. In addition, there were still no computers that came anywhere near being able to carry out basic conversations. (Foote, K. D., 2019)
1980s
Re-emergence with AI
The stoppage of AI research between the 1960s and the 1980s "initiated a new phase of fresh ideas, with earlier concepts of machine translation abandoned and new ideas promoting new research, including expert systems. The 1980s initiated a fundamental reorientation, with simple approximations replacing deep analysis, and the evaluation process became more rigorous." It was a time when computational grammar became a very active field of research linked with the science of reasoning for meaning and considering the user's beliefs and intentions.
LUNAR, developed in 1978 by W.A Woods, could analyse, compare and evaluate the chemical data on a lunar rock and soil composition that was accumulating due to Apollo moon missions. (Kaur, J., 2022; Foote, K. D., 2019)
1990s
Increase in NLP Uptake
In the 1990s, the popularity of statistical models for NLP analyses rose dramatically. The pure statistical NLP methods "have become remarkably valuable in keeping pace with the tremendous flow of online text." In addition, grammar, tools and practical resources related to NLP/MT became available with the parsers.
(Kaur, J., 2022; Foote, K. D., 2019)
Present
Latest NLP Trends
Nowadays, everybody wants a machine to talk. The only way a computer can speak is through NLP. Take the example of Alexa, a conversational product by Amazon.
A query is passed to it by the medium of voice, and it can reply by the same medium. The query can be used to ask anything, search for anything, play songs or even book cabs.
Alexa is only one example. NLP can be applied to the healthcare sector, among many others. Furthermore, it is used for cognitive analytics, sentiment analysis, spam detection, and recruitment.