Emotion Analysis using Machine Learning Part-1

June 16, 2020

Emotion Analysis using Machine Learning

Hai,Everyone 👋 ,
Sorry for the delay,During This time I done TCS ION internship, I would like to share with you my insights from this project.The aim of this project was to identify emotions Happy, Sad, Fear, Anger, Bad, Surprise, Disgust from a collection of English text, sentences or large paragraph which also should handle scenarios where there are:

    • sarcastic meaning
    • double negation
    • abbreviations

The solution proposed is Machine Learning based Systems ie Text classification based on past observations.

Steps involved:

Data Collection: Gathering relevant datasets for Machine Learning based Systems

Preprocessing: Text preprocessed to remove HTML tags, remove URLs, remove special characters, remove repetitions of letters in the same word, expand abbreviations, remove stop words for training deep learning model for classification of the text based on emotion.

Feature extraction: Transforms each text into a numerical representation in the form of a vector. E.g. bag of words [a vector represents the frequency in a predefined dictionary of words ]

Test-Train Split: Splitting the data into Testing and Training datasets.

Training: The algorithm is fed with training data consisting of feature sets. By using training data, the algorithm can learn the different associations between pieces of text and that a particular output label is expected for a particular input.

Testing: Once trained with enough training samples, the deep learning model can begin to make accurate predictions on unseen text with similar feature sets.

Text Classification Algorithms used:

Machine Learning Models:
    • SGDClassifier
    • LogisticRegression, MultinomialNB
    • MultinomialNB 
    • Pipeline of CountVectorizer, TfidfTransformer
Deep Learning Models:
    • CNN
    • Bidirectional LSTM

Metrics and Evaluation Methods used:

Accuracy: the percentage of texts that were predicted with the correct tag.

Precision: the percentage of examples the classifier got right out of the total number of examples that are predicted for a given tag.

Recall: the percentage of examples the classifier predicted for a given tag out of the total number of examples it should have predicted for that given tag.

F1 Score: the harmonic mean of precision and recall.

Confusion Matrix: A confusion matrix is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm.

Solution Approach:
As solution proposed is Machine Learning based Systems for Text classification based on past observations. Machine Learning model must be trained with the dataset since the Model doesn’t understand the words it needs to be converted into a format that can be inputted to model, the steps involved in the preparation of dataset are:

Data Collection and Conversion: As the datasets have different formats and have different labels, It needs to be converted into the same format. For that Emotions are classified into Happy, Sad, Anger, Disgust, Surprise, Fear, Bad. the dataset from different file formats is converted into CSV format for easy loading and the labels are set to the above mentioned. For further preprocessing, the dataset is merged into a single merged CSV file based on the relevance of the datasets. the datasets used are: https://www.kaggle.com/shrivastava/isears-dataset https://data.world/crowdflower/sentiment-analysis-in-text https://www.kaggle.com/kutayeen/emotion-datasets-twitter

Preprocessing: Which mainly include Text Cleaning: -removing URLs from the text -stripping HTML tags from the text which is present after the scrapping using beautiful soup -converted all the text into lowercase -emoji present into its textual format using demojize -removed letters that are repeated in words -use map to expand contractions -remove all the twitter handles using twitter Tokenizer -remove stop words using nltk -Apply the PorterStemmer to keep the stem of the words. -remove all the special characters from the text

Modifying Data: As for implementing both deep learning and Machine learning model data is modified in different ways: For inputting into deep learning model the data must be of the form numerical array hence we use Keras tokenizer for tokenizing data, as all text must be of the same length we pad tokenized data with Keras padsequences with inputlength as the max text length. For TensorFlow model data and label must be of the same format hence we encode the emotion labels with LabelBinarizer from sklearn, making it data that can be feed into the model. For Machine learning model used the CountVectorizer for vectorizing the input and the label was not encoded.

Test-Train Split: For testing and training the model the data is split into test and train, here used sklearn traintestsplit for splitting data into 80% train and 20% test (testsize=0.20,randomstate=42).

Model fitting: the Machine learning Models used for classification are SGDClassifier, LogisticRegression, MultinomialNB, and a Pipeline of CountVectorizer, TfidfTransformer, MultinomialNB the Deep learning models considered are Bidirectional LSTM and CNN.

Evaluation metrics: These metrics are the Accuracy, Precision, Recall, and F1 score, these are used to find the model performance and to select the best model possible or to join two models for the best prediction.

Predict: To predict an arbitrary text these steps are followed:

Inputted text is preprocessed
the text is tokenized or vectorized based on the model
the model predicts the output
we use the inverse encoder to get the emotion of the text