Sentiment Analysis in Five Steps using AutoML

vijay Anandan
Analytics Vidhya
Published in
7 min readApr 11, 2021

--

Automated Machine Learning(AutoML)

Automated machine learning (AutoML) is the process of automating the tasks of applying machine learning to real-world problems. AutoML covers the complete pipeline from the raw dataset to the deployable machine learning model. AutoML was proposed as an artificial intelligence-based solution to the ever-growing challenge of applying machine learning. The high degree of automation in AutoML allows non-experts to make use of machine learning models and techniques without requiring them to become experts in machine learning. Automating the process of applying machine learning end-to-end additionally offers the advantages of producing simpler solutions, faster creation of those solutions, and models that often outperform hand-designed models. AutoML has been used to compare the relative importance of each factor in a prediction model.

Over the years researchers have developed ways of automating processes by developing tools like AutoKeras, AutoSklearn and even no-coding platforms like WEKA and H2o.

One such area of automation is in the field of natural language processing. With the development of AutoNLP, it is now super easy to build a model like sentiment analysis with very few basic lines of code and get a good output. With automation like these, it allows everyone to be a part of the machine learning community and does not restrict machine learning to only developers and engineers.

In this article, we will learn about Classical NPL and AutoNLP and using AutoNLP we will implement a sentiment analysis model with twitter dataset.

Please go to Part 2 if you already familiar with the basics of NLP

Part 1

What is Classical NLP?

Generally while feeding the text data into our model we will perform a set of process to make the data cleaned and then making them to numerical formats. Let’s discuss some of them before making automation.

Some of the text preprocessing techniques,

  • Tokenization
  • Lemmatization
  • Removing Punctuations and Stopwords
  • Part of Speech Tagging
  • Entity Recognition

Analyzing, interpreting and building models out of unstructured textual data is a significant part of a Data Scientist’s job. Many deep learning applications like Natural Language Processing (NLP) revolve around the manipulation of textual data.

For Example -You are a business firm which has launched a new website or a mobile application-based service. Now, you have the data containing the customer reviews for your product, and you wish to do a Consumer Sentiment analysis on these reviews using machine-learning algorithms.

However, to make this data structured and computationally viable for algorithms, we need to preprocess it.

So, here we are going to learn about various fundamental preprocessing techniques for our textual data. We are going to work with spaCy library in Python, which is among the numerous libraries (like nltk,gensim,etc.) used for textual transformations.

Tokenization:

Tokenization is the process of chopping down the text into pieces, called tokens, while ignoring characters like punctuation marks (“,” , “.” , “!” ,etc.) and spaces. spaCy’s functions allows us to tokenize our text via two ways -

  • Word Tokenization
  • Sentence Tokenization

Below is a sample code for word tokenizing our text.

Result

Notice how we get a list of tokens containing words and punctuations. Here, the algorithm identifies contractions like are ‘nt as two distinct words; “are” and “‘nt”.

We can obtain sentence tokenization (splitting text into sentences) as well if we wish to. However, we would have to include a preprocessing pipeline in our “nlp” module for it to be able to distinguish between words and sentences.

Below is a sample code for sentence tokenizing our text.

Result

Tokenization is a fundamental step in preprocessing, which helps in distinguishing the word or sentence boundaries and transforms our text for further preprocessing techniques like Lemmatization,etc.

Lemmatization:

Lemmatization is an essential step in text preprocessing for NLP. It deals with the structural or morphological analysis of words and break-down of words into their base forms or “lemmas”.
For Example — The words walk,walking,walks,walked are indicative towards a common activity i.e. walk. And since they have different spelling structure, it makes it a confusing task for our algorithms to treat them differently. So, these will be treated under a single lemma.

We can use spaCy’s built-in methods for lemmatizing our text.

Result

As you can clearly see, the words such as running are broken down to their lemmas i.e. — run. Lemmatization greatly enhances our text for better and faster optimization.

Removing Stop Words

While working with textual data, we encounter many data instances which aren’t of much use for our analysis as they do not add any meaning/relevance to our data. These can be pronouns (like I, you, etc.) or words like are , is , was , etc.

These words are called Stop words. We can use the built in STOP_WORDS from spaCy for filtering our text.

spaCy’s built in stop words list can be viewed as following -

Result

Now we can use the “is_stop” attribute of the token object for filtering out the stop words from our sample text.

Result

You can compare the above two lists and notice words such as down,the,with and my have been removed.Now, similarly, we can also remove punctuation from our text as well using “isalpha” method of string objects and using list comprehensions.

Result

You can observe the differences between the two lists. Indeed, spaCy makes our work pretty easy.

Part-of-Speech Tagging (POS)

A word’s part of speech defines the functionality of that word in the document. For example — in the text Robin is an astute programmer, “Robin” is a Proper Noun while “astute” is an Adjective.

We will use the en_core_web_sm module of spacy for POS tagging.

Result

Thanks for bearing with me for so long in the text processing.
Believe me, the next part will be so simple and more powerful all the previously discussed steps will be taken care of Automatically we just need to pass the raw text data. Let’s deep dive into AutoNLP :)

Part2: What is AutoNLP?

Using the concepts of AutoML, AutoNLP helps in automating the process of exploratory data analysis like stemming, tokenization, lemmatization etc. It also helps in text processing and picking the best model for the given dataset. AutoNLP was developed under AutoVIML which stands for Automatic Variant Interpretable ML. Some of the features of AutoNLP are:

  1. Data cleansing: The entire dataset can be sent to the model without performing any process like vectorization. It even fills the missing data and cleans the data automatically.
  2. Uses feature tools library for feature extraction: Feature Tools is another great library that helps in feature engineering and extraction in any easy way.
  3. Model performance and graphs are produced automatically: Just by setting the verbose, the model graph and performance can be shown.
  4. Feature reduction is automatic: With huge datasets, it becomes tough to select the best features and perform EDA. But this is taken care of by AutoNLP.

Implementation of AutoNLP

Let us now implement a sentiment analysis model for a twitter dataset using autoNLP. Without autoNLP, the data had to be first vectorized, stemmed and lemmatized and finally converted to a word cloud before training. But with autoNLP, all we have to do is five simple steps.

Installing the AutoNLP:

To install this we can use a simple pip command. Since AutoNLP belongs to autoviml we need to install that.

Data: https://raw.githubusercontent.com/Vijayvj1/twitter-sentiment-analysis-1/master/train.csv

After installing this, we can go ahead and download the dataset for the project. I will be using the twitter dataset since we are doing sentiment analysis. Once done, let us mount the drive and see our dataset.

Now, all set Let’s Activate Automatic mode. Yeah, it’s simple only 5 steps.. keep counting.

Step 1:

Data Import

Step 2:

Import lib and define train-test split

Step 3:

Define Feature & Target

Step 4:

Auto NLP

Now you will see a series of graphs and within few minutes you will see the trained output.

These graphs show in detail about the visualizations during the training process. It shows the word count, the density and character count as well. As the training progresses these graphs change and here is the final output. All the punctuations and tags are automatically removed and the density of these are also shown in the graph.

Auto NLP Results

Do we need to think about n-gram, Hyperparameter optimization, feature size, Algo selection,etc..,? no Auto NLP will take care of them.., have a cup of coffee until they finish!

NLP Pipeline logs

Step 5:

Finally, you can make predictions.. yah it’s done!!

Conclusion

We saw how using AutoNLP made the model building very easy for performing sentiment analysis. Not only this but it also automatically pre-processed the data and gave visualizations for different aspects of the dataset. Thus, automation makes it easy to build even complex models.

If you want to learn more about machine learning, continue reading my blogs:

  1. Audio Data Augmentation: https://vijay-anandan.medium.com/lets-augment-a-audio-data-part-1-5ab5f6a87bae
  2. Sentiment Analysis On Voice Data: https://vijay-anandan.medium.com/sentiment-analysis-of-voice-data-64533a952617
  3. Resample an extremely imbalanced datasets: https://vijay-anandan.medium.com/how-to-resample-an-imbalanced-datasets-8e413dabbc21
  4. How Do Neural Networks Really Work in the Deep Learning:https://medium.com/analytics-vidhya/how-do-neural-networks-really-work-in-the-deep-learning-72f0e8c4c419

linkedin : https://www.linkedin.com/in/vijay-anadan/

--

--