Emotion Analysis using Machine Learning Part-2
June 22, 2020
Emotion Analysis using Machine Learning
2nd week of blogging
The Machine Learning models used are :
SGDClassifier
- loss=‘hinge’ as hyperparameter
- The model used because it converges faster than other descent techniques
Results:
Model Performance metrics:
------------------------------
Accuracy: 0.8486
Precision: 0.8429
Recall: 0.8486
F1 Score: 0.8411
Model Classification report:
------------------------------
precision recall f1-score support
Anger 0.72 0.61 0.66 1459
Bad 0.38 0.18 0.24 1810
Digust 0.58 0.37 0.45 206
Fear 0.92 0.88 0.90 12716
Happy 0.71 0.90 0.80 7672
Sad 0.92 0.92 0.92 16902
Surprise 0.62 0.39 0.48 844
accuracy 0.85 41609
macro avg 0.69 0.61 0.64 41609
weighted avg 0.84 0.85 0.84 41609
LogisticRegression
- penalty=‘l2’, C=1 as hyperparameter
- The model used because its simplest model
Model Performance metrics:
------------------------------
Accuracy: 0.8488
Precision: 0.8477
Recall: 0.8488
F1 Score: 0.8469
Model Classification report:
------------------------------
precision recall f1-score support
Anger 0.75 0.62 0.68 1459
Bad 0.39 0.40 0.40 1810
Digust 0.57 0.35 0.44 206
Fear 0.91 0.88 0.89 12716
Happy 0.78 0.85 0.81 7672
Sad 0.90 0.93 0.91 16902
Surprise 0.60 0.38 0.46 844
accuracy 0.85 41609
macro avg 0.70 0.63 0.66 41609
weighted avg 0.85 0.85 0.85 41609
SGDClassifier and LogisticRegression has the accuracy 0.8486.
MultinomialNB:
-
Naive Bayes family of algorithms is popular machine learning algorithms for creating text classification models.
Model Performance metrics: ------------------------------ Accuracy: 0.7892 Precision: 0.7725 Recall: 0.7892 F1 Score: 0.7636
Model Classification report: ------------------------------ precision recall f1-score support Anger 0.80 0.24 0.37 1459 Bad 0.38 0.11 0.17 1810 Digust 0.00 0.00 0.00 206 Fear 0.85 0.85 0.85 12716 Happy 0.79 0.74 0.76 7672 Sad 0.77 0.94 0.84 16902 Surprise 0.67 0.07 0.12 844 accuracy 0.79 41609 macro avg 0.61 0.42 0.44 41609 weighted avg 0.77 0.79 0.76 41609
Pipeline with CountVectorizer, TfidfTransformer, MultinomialNB:
-
As pipeline often gives promising results to the initial model itself.
Model Performance metrics: ------------------------------ Accuracy: 0.7225 Precision: 0.7134 Recall: 0.7225 F1 Score: 0.6791
Model Classification report: ------------------------------ precision recall f1-score support Anger 0.92 0.01 0.02 1459 Bad 0.00 0.00 0.00 1810 Digust 0.00 0.00 0.00 206 Fear 0.85 0.78 0.81 12716 Happy 0.88 0.51 0.64 7672 Sad 0.64 0.96 0.77 16902 Surprise 0.00 0.00 0.00 844 accuracy 0.72 41609 macro avg 0.47 0.32 0.32 41609 weighted avg 0.71 0.72 0.68 41609
MultinomialNB and .Pipeline with CountVectorizer, TfidfTransformer, MultinomialNB have accuracy 0.7892 and 0.7225.
The Deep Learning models used are:
CNN model:
- Deep learning algorithms with CNN is expected to work well with text classification
-
CNN with and complied using loss = ‘categorical_crossentropy’ , optimizer = adam , metrics = accuracy Model:
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_1 (Embedding) (None, 157, 500) 54695500 _________________________________________________________________ conv1d_1 (Conv1D) (None, 157, 128) 192128 _________________________________________________________________ conv1d_2 (Conv1D) (None, 157, 64) 24640 _________________________________________________________________ conv1d_3 (Conv1D) (None, 157, 32) 4128 _________________________________________________________________ conv1d_4 (Conv1D) (None, 157, 16) 1040 _________________________________________________________________ flatten_1 (Flatten) (None, 2512) 0 _________________________________________________________________ dropout_1 (Dropout) (None, 2512) 0 _________________________________________________________________ dense_1 (Dense) (None, 100) 251300 _________________________________________________________________ dropout_2 (Dropout) (None, 100) 0 _________________________________________________________________ dense_2 (Dense) (None, 7) 707 ================================================================= Total params: 55,169,443 Trainable params: 55,169,443 Non* trainable params: 0
trained the model which produced Accuracy: 84.063542, Loss: 45.061589. Which is less the ML models.
Bidirectional LSTM model:
- removed the spatial dropout layer as it decreases the accuracy of the model, which is inferred from training
-
CNN with and complied using loss = ‘categorical_crossentropy’ , optimizer = adam , metrics = accuracy
Layer (type) Output Shape Param # Connected to ================================================================================================== input_1 (InputLayer) [(None, 157)] 0 __________________________________________________________________________________________________ embedding (Embedding) (None, 157, 500) 54695500 input_1[0][0] __________________________________________________________________________________________________ bidirectional (Bidirectional) (None, 157, 256) 644096 embedding[0][0] __________________________________________________________________________________________________ conv1d (Conv1D) (None, 155, 64) 49216 bidirectional[0][0] __________________________________________________________________________________________________ global_average_pooling1d (Globa (None, 64) 0 conv1d[0][0] __________________________________________________________________________________________________ global_max_pooling1d (GlobalMax (None, 64) 0 conv1d[0][0] __________________________________________________________________________________________________ concatenate (Concatenate) (None, 128) 0 global_average_pooling1d[0][0] global_max_pooling1d[0][0] __________________________________________________________________________________________________ dense (Dense) (None, 7) 903 concatenate[0][0] ================================================================================================== Total params: 55,389,715 Trainable params: 55,389,715 Non-trainable params: 0
The Bidirectional LSTM has the highest accuracy of all the models Accuracy: 86.488497, Loss: 36.384022
I also found from the observation that the LSTM model works better for Sad, Happy, and Fear but the SGDClassifier works better on Anger, Bad, Surprise hence joined the model together making it a stacked model which provided a accuracy:
- Accuracy: 0.883606912
- Precision: 0.8874248006
- Recall: 0.883606912
- F1 Score: 0.8844271224
Preprocess:
https://colab.research.google.com/drive/1NZBf4iDojvel3dHeU9AYzWWt5XM8Eggr
MachineLearning:
https://colab.research.google.com/drive/1UI4LPO9BfrkJYXEVudsii7XRXeVJXvJJ#scrollTo=HP6QeD8vzjKc
Deep Learning:
https://colab.research.google.com/drive/1gauXKeBIN9Piln_2_OxmLsk5g8U9f1f-#scrollTo=TKyIf6PhJRfE
Stacked Model:
https://colab.research.google.com/drive/191BIy6tGDRfIi6eKNFhLM2IN7quwFdyr#scrollTo=ROunJKQlz_Lr