On our last blog, we have already told you how NLP tools work and why syntactic and semantic analysis need to be thoroughly involved during the whole process. We will now get into the details of the main NLP tasks and techniques in syntactic and semantic analysis.
Machines are basically incapable of deciphering the human language without the help of syntactic and semantic analysis—which tasks include breaking down the human language into something a machine can read.
Syntactic analysis represents the relationship between words on a diagram called a parse tree—or the act of parsing, in short—while semantic analysis identifies the meaning behind those words. Below are some of the most common tasks of both syntactic and semantic analysis.
1. Tokenization
Tokenization is basically the process of simplifying a text by breaking down words into tokens—units that are considered semantically useful. Depending on its scale, tokenization is used to split sentences within a whole text (sentence tokenization) or to split words within a sentence (word tokenization).
Sample: “Saya merasa sangat puas dengan pelayanan yang diberikan oleh hotel ini.”
Tokens: “Saya” – “merasa” – “sangat” – “puas” – “dengan” – “pelayanan” – “yang” – “diberikan” – “oleh” – “hotel” – “ini”
2. Part-of-speech tagging (PoS tagging)
PoS tagging or part-of-speech tagging focuses on identifying the relationship between words in order to understand the meaning behind sentences. It determines the part of speech category of each token within a text—tagging it with the label verb, adverb, noun, pronoun, preposition, etc.
Sample: “Saya merasa sangat puas dengan pelayanan yang diberikan oleh hotel ini.”
Tags: Saya [pronoun] merasa [verb] sangat [adverb] puas [adjective] dengan [preposition] pelayanan [noun] yang [preposition] diberikan [verb] oleh [preposition] hotel [noun] ini [pronoun]
3. Lemmatization and stemming
For machines to understand our complex language, there needs to be some adjustments done to the forms of the words that we originally speak or write before it gets processed. NLP tools use lemmatization to transform words back to their root forms or their lemma—the form of words as they appear in the dictionary.
Sample: “memberikan” = beri, “pencarian” = cari, “pepohonan” = pohon
On the other hand, stemming refers to trimming words into their root forms even though they are less-accurate and may not always be semantically correct—thus much preferable than lemmatization for faster results and lesser complexity.
Sample: “kebersamaan, bersama, menyamai, disamakan” = sama
4. Stopword removal
Stop words are high-frequency words that add little to no semantic value to a sentence such as which, for, to, is, at, on, etc. Removing them from the text you want to process using NLP is crucial if you want to get a noise-free result—especially when you are handling large sets of data like social media comments or customer’s feedbacks that needs to be categorized based on their topics.
Sample: “Selamat pagi. Saya mengalami kendala saat sedang melakukan pemesanan tiket.”
Stopwords: selamat, pagi, saya, mengalami, saat, sedang, melakukan
Result: kendala pemesanan tiket = main topic
5. Text classification
Text classification is probably one of the most basic NLP tasks that help machines understand unstructured data by assigning appropriate categories or tags to a text based on its content. This particular NLP task is popularly used in sentiment analysis—one of the services that Sonar has.
Sample:
“Pelayanan CS di sini buruk sekali!” = negative
“Kecepatan internetnya sepertinya baik-baik saja, sih.” = neutral
“Saya sangat menyukai parfum ini.” = positive
With NLP as a core, Sonar can perform a more comprehensive and accurate sentiment analysis in Bahasa Indonesia with up to 83% accuracy—providing you with actionable insights that can help your company detect upcoming crisis and make data-driven decisions.
Contact us for a personalized demo.