How to handle imbalanced text data in Natural Language Processing

14 minute read

Hello everyone today we are going to discuss an interesting problem with imbalanced text data.

We are interested to classify messages that are spam.

We are going to test different classification models such as Multinomial Naive Bayes, Random Forest (RF), XGBoost, Support Vector Machines (SVM), and different sampling and under-sampling methods such as Random over-sampling, SMOTE, and derivations, Random under-sampling techniques.

This example is helpful to show how to deal with this type of problem for your custom NLP problems.

What is Imbalanced Data?

Imbalanced data typically refers to a problem with classification problems where the classes are not represented equally.

There different performance measures that can give more insight into the accuracy of the model than traditional classification accuracy:

  • Confusion Matrix: A breakdown of predictions into a table showing correct predictions (the diagonal) and the types of incorrect predictions made (what classes incorrect predictions were assigned).
  • Precision: A measure of a classifiers exactness.
  • Recall: A measure of a classifiers completeness
  • F1 Score (or F-score): A weighted average of precision and recall.

  • Kappa (or Cohen’s kappa): Classification accuracy normalized by the imbalance of the classes in the data.
  • ROC Curves: Like precision and recall, accuracy is divided into sensitivity and specificity and models can be chosen based on the balance thresholds of these values.

Step 1. Creation of the environment

I will create an environment called nlp, because I am interested in the Natural Language Processing (NLP) .

NLP enables computers to understand natural language as humans do. Whether the language is spoken or written, natural language processing uses artificial intelligence to take real-world input, process it, and make sense of it in a way a computer can understand.

First you need to install anaconda at this link

then after is installed type in your terminal

conda create -n nlp python==3.8

then

conda activate nlp

then in your terminal type the following commands:

conda install ipykernel

then we install

python -m ipykernel install --user --name nlp --display-name "Python (NLP)"

then we install the following libraries

pip install  matplotlib seaborn nltk sklearn plotly wordcloud xgboost imbalanced-learn notebook jupyter

then we type

jupyter notebook

image-20221022180032811

and we choose our Python (NLP) notebook.

We are interested to work in the problem of Spam Detection by using NLP.

In this project we are going to use a dataset contains messages that are spam and not.

You can download the dataset here.

Step 2. Load libraries

#importing necessary libraries
# General

import numpy as np
import pandas as pd
import re
import os
import pickle

# EDA

import matplotlib.pyplot as plt
import seaborn as sns 
import plotly.express as px
from wordcloud import WordCloud
from collections import Counter

# NLP 

import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize 

# ML
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix

Step 3. Data analysis

Here we need to analyze our dataset, first we read the dataset and then count how many messages are spam.

df=pd.read_csv('spam.csv',encoding = 'latin-1')
df.head()
v1 v2 Unnamed: 2 Unnamed: 3 Unnamed: 4
0 ham Go until jurong point, crazy.. Available only ... NaN NaN NaN
1 ham Ok lar... Joking wif u oni... NaN NaN NaN
2 spam Free entry in 2 a wkly comp to win FA Cup fina... NaN NaN NaN
3 ham U dun say so early hor... U c already then say... NaN NaN NaN
4 ham Nah I don't think he goes to usf, he lives aro... NaN NaN NaN
df['v1'].value_counts(dropna=False).plot(kind='bar')

png

Moreover we can get more information of our dataset by typing

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB

or simply

df.describe()
v1 v2 Unnamed: 2 Unnamed: 3 Unnamed: 4
count 5572 5572 50 12 6
unique 2 5169 43 10 5
top ham Sorry, I'll call later bt not his girlfrnd... G o o d n i g h t . . .@" MK17 92H. 450Ppw 16" GNT:-)"
freq 4825 30 3 2 2
#checking for null values
df.isnull().sum()
v1               0
v2               0
Unnamed: 2    5522
Unnamed: 3    5560
Unnamed: 4    5566
dtype: int64
#dropping the column with more null values
df.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis=1,inplace=True)
df.isnull().sum()
v1    0
v2    0
dtype: int64
#renaming the columns
df=df.rename({'v1':'label','v2':'text'},axis=1)
df.head()
label text
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...

Calculating the length of each data sample. We will create a new length column that will show the length of each data sample. This new column will help us with preprocessing the data samples.

df['length'] = df['text'].apply(lambda x: len(x))
df.head()
label text length
0 ham Go until jurong point, crazy.. Available only ... 111
1 ham Ok lar... Joking wif u oni... 29
2 spam Free entry in 2 a wkly comp to win FA Cup fina... 155
3 ham U dun say so early hor... U c already then say... 49
4 ham Nah I don't think he goes to usf, he lives aro... 61

Distribution based on length of words

#analyzing text
df['length']=df['text'].apply(lambda x: len(x))
sns.distplot(df['length'], kde=True)

png

sns.boxplot(y='length', x='label', data=df)

png

df.head()
label text length
0 ham Go until jurong point, crazy.. Available only ... 111
1 ham Ok lar... Joking wif u oni... 29
2 spam Free entry in 2 a wkly comp to win FA Cup fina... 155
3 ham U dun say so early hor... U c already then say... 49
4 ham Nah I don't think he goes to usf, he lives aro... 61

Step 4. Data Preprocessing

NLTK has smaller sub-libraries that perform specific text cleaning tasks. These smaller libraries also have methods for text cleaning.

from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
import re
nltk.download('punkt')

Downloading stop words. We download the English stop words so that the model can identify the stop words in the texts and remove them.

nltk.download('stopwords')
nltk.download('omw-1.4')

We will use it to remove all the stop words in the dataset. We will then create custom functions for text cleaning and pass in the imported methods as parameters. To implement the custom functions, we will require Python regular expression (RegEx) module.

def convert_to_lower(text):
    return text.lower()
def remove_numbers(text):
    number_pattern = r'\d+'
    without_number = re.sub(pattern=number_pattern, repl=" ", string=text)
    return without_number

import string
def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))

def remove_stopwords(text):
    removed = []
    stop_words = list(stopwords.words("english"))
    tokens = word_tokenize(text)
    for i in range(len(tokens)):
        if tokens[i] not in stop_words:
            removed.append(tokens[i])
    return " ".join(removed)
def remove_extra_white_spaces(text):
    single_char_pattern = r'\s+[a-zA-Z]\s+'
    without_sc = re.sub(pattern=single_char_pattern, repl=" ", string=text)
    return without_sc
def lemmatizing(text):
    lemmatizer = WordNetLemmatizer()
    tokens = word_tokenize(text)
    for i in range(len(tokens)):
        lemma_word = lemmatizer.lemmatize(tokens[i])
        tokens[i] = lemma_word
    return " ".join(tokens)


df['text_clean'] = df['text'].apply(lambda x: convert_to_lower(x))
df['text_clean'] = df['text_clean'].apply(lambda x: remove_numbers(x))
df['text_clean'] = df['text_clean'].apply(lambda x: remove_punctuation(x))
df['text_clean'] = df['text_clean'].apply(lambda x: remove_stopwords(x))
df['text_clean'] = df['text_clean'].apply(lambda x: remove_extra_white_spaces(x))
df['text_clean'] = df['text_clean'].apply(lambda x: lemmatizing(x))
df['length_after_cleaning'] = df['text_clean'].apply(lambda x: len(x))

df.head()
label text length text_clean length_after_cleaning
0 ham Go until jurong point, crazy.. Available only ... 111 go jurong point crazy available bugis great wo... 78
1 ham Ok lar... Joking wif u oni... 29 ok lar joking wif oni 21
2 spam Free entry in 2 a wkly comp to win FA Cup fina... 155 free entry wkly comp win fa cup final tkts st ... 101
3 ham U dun say so early hor... U c already then say... 49 u dun say early hor c already say 33
4 ham Nah I don't think he goes to usf, he lives aro... 61 nah dont think go usf life around though 40
#transform the values of the output variable into 0 and 1
#We can create the label map as follows:
label_map = {
    'ham': 0,
    'spam': 1,
}
df['label'] = df['label'].map(label_map)

Step 5. Implementing text vectorization

It converts the raw text into a format the NLP model can understand and use. Vectorization will create a numerical representation of the text strings called a sparse matrix or word vectors. The model works with numbers and not raw text. We will use TfidfVectorizer to create the sparse matrix.

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
tf_wb= TfidfVectorizer()
X_tf = tf_wb.fit_transform(df['text_clean'])
#Converting the sparse matrix into an array
#We then apply the toarray function to convert the sparse matrix into an array.
X_tf = X_tf.toarray()
X_tf.shape
X_train_tf, X_test_tf, y_train_tf, y_test_tf = train_test_split(X_tf, df['label'].values, test_size=0.3)
from sklearn.naive_bayes import GaussianNB
NB = GaussianNB()
NB.fit(X_train_tf, y_train_tf)
NB_pred= NB.predict(X_test_tf)
print(NB_pred)
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test_tf, NB_pred))
[0 0 0 ... 0 0 0]
0.8767942583732058
from imblearn.over_sampling import RandomOverSampler

We will use the RandomOverSampler function to balance the classes.

RandomOverSampler will increase the data samples in the minority class (spam). It makes the minority class have the same data samples as the majority class (ham). The function synthesizes new dummy data samples in the minority class to enable class balancing.

X_train, X_test, y_train, y_test = train_test_split(df['text_clean'], df['label'].values, test_size=0.30)

After splitting the dataset, we will use the Counter module to check the number of data samples in the majority and minority classes. We import the module as follows:

from collections import Counter
Counter(y_train)
Counter({0: 3391, 1: 509})
#Vectorizing the X_train
vectorizer = TfidfVectorizer()
vectorizer.fit(X_train)

The fit function will fit the initialized TfidfVectorizer function to the X_train. We then use the transform function to apply the vectorization method.

X_train_tf = vectorizer.transform(X_train)

We finally convert the transformed text (sparse matrix) to an array as follows:

X_train_tf = X_train_tf.toarray()
#Vectorizing the X_test
X_test_tf = vectorizer.transform(X_test)
X_test_tf = X_test_tf.toarray()

Let’s now apply the RandomOverSampler function.

Applying RandomOverSampler function We use the following code:

ROS = RandomOverSampler(sampling_strategy=1)

The function uses the sampling_strategy parameter to balance the class. We set the parameter’s value to 1 to ensure the dataset classes have 1:1 data samples. We then apply the function to the training set. It will generate the new data samples to ensure both classes are balance

X_train_ros, y_train_ros = ROS.fit_resample(X_train_tf, y_train)

Let’s recheck the number of data samples in the majority and minority classes:

Counter(y_train_ros)
Counter({0: 3391, 1: 3391})

Using the balanced dataset to build the same model

nb = GaussianNB()
nb.fit(X_train_ros, y_train_ros)
y_preds = nb.predict(X_test_tf)
print(y_preds)
[0 0 1 ... 0 0 0]
print(accuracy_score(y_test, y_preds))
0.8851674641148325

Actually NLP is one of the most common areas in which resampling of data is needed as there are many text classification tasks dealing with imbalanced problem (think of spam filtering, insulting comment detection, article classification, etc.

Step 6. Oversampling

Synthetic Minority Oversampling Technique (SMOTE) is a statistical technique for increasing the number of cases in your dataset in a balanced way. The component works by generating new instances from existing minority cases that you supply as input.

X_tf.shape
(5572, 7858)
from imblearn.over_sampling import SMOTE, KMeansSMOTE , ADASYN,SVMSMOTE,KMeansSMOTE,BorderlineSMOTE
from imblearn.combine import SMOTEENN, SMOTETomek
from imblearn.metrics import classification_report_imbalanced
def run_model(X,y,model):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)
    print(Counter(y_train))
    vectorizer_smote = TfidfVectorizer()
    vectorizer_smote.fit(X_train)
    #Vectorizing the X_train
    X_train_vec = vectorizer_smote.transform(X_train)
    X_train_vec = X_train_vec.toarray()
    #Vectorizing the X_test
    X_test_vec = vectorizer_smote.transform(X_test)
    X_test_vec = X_test_vec.toarray()
    # transform the dataset
    oversample = model
    X_train_over, y_train_over = oversample.fit_resample(X_train_vec, y_train)
    print(Counter(y_train_over))
    nb = GaussianNB()
    nb.fit(X_train_over, y_train_over)
    y_preds = nb.predict(X_test_vec)
    score = accuracy_score(y_test,y_preds)
    print("Accuracy: ",score)
    #target_names = ['class 0', 'class 1'] 
    #print(classification_report_imbalanced(y_test, y_preds,     target_names=target_names))
    return score
X=df['text_clean']
y=df['label'].values
#Object to over-sample the minority class(es) by picking samples at random with replacement. 
oversample= RandomOverSampler(sampling_strategy=1)
ov1=run_model(X,y,model=oversample)
Counter({0: 3353, 1: 547})
Counter({0: 3353, 1: 3353})
Accuracy:  0.8839712918660287
# over-sampling using SMOTE.
oversample=SMOTE(sampling_strategy=0.2)
ov2=run_model(X,y,model=oversample)
Counter({0: 3389, 1: 511})
Counter({0: 3389, 1: 677})
Accuracy:  0.882177033492823
#Oversample using Adaptive Synthetic (ADASYN) algorithm.
oversample = ADASYN()
ov3=run_model(X,y,model=oversample)
Counter({0: 3375, 1: 525})
Counter({1: 3399, 0: 3375})
Accuracy:  0.8708133971291866
#KMeans clustering before to over-sample using SMOTE.
oversample=KMeansSMOTE()
ov4=run_model(X,y,model=oversample)
Counter({0: 3391, 1: 509})
Counter({1: 3391, 0: 3391})
Accuracy:  0.8929425837320574
#Over-sampling using Borderline SMOTE
oversample=BorderlineSMOTE()
ov5=run_model(X,y,model=oversample)
Counter({0: 3376, 1: 524})
Counter({0: 3376, 1: 3376})
Accuracy:  0.8803827751196173
#Over-sampling using SMOTE and cleaning using ENN.
oversample=SMOTEENN()
ov6=run_model(X,y,model=oversample)
Counter({0: 3376, 1: 524})
Counter({0: 3361, 1: 3352})
Accuracy:  0.8738038277511961
#Over-sampling using SVM-SMOTE.
oversample=SVMSMOTE()
ov7=run_model(X,y,model=oversample)
Counter({0: 3391, 1: 509})
Counter({0: 3391, 1: 3391})
Accuracy:  0.882177033492823
over_models = pd.DataFrame({
    'Model':['RandomOverSampler',
             'SMOTE',
             'ADASYN',
             'KMeans-SMOTE',
             'Borderline-SMOTE',
             'SMOTE-ENN'
            ],
    'Accuracy_score' :[ov1 ,ov2, ov3, ov4,ov5,ov6
                      ]
})
sns.barplot(x='Accuracy_score', y='Model', data=over_models)
over_models.sort_values(by='Accuracy_score', ascending=False)
Model Accuracy_score
3 KMeans-SMOTE 0.892943
0 RandomOverSampler 0.883971
1 SMOTE 0.882177
4 Borderline-SMOTE 0.880383
5 SMOTE-ENN 0.873804
2 ADASYN 0.870813

png

  • Some researchers have investigated whether SMOTE is effective on high-dimensional or sparse data, such as data used in text classification or genomics datasets. This paper has a good summary of the effects and of the theoretical validity of applying SMOTE in such cases: Blagus and Lusa: SMOTE for high-dimensional class-imbalanced data.
  • If SMOTE is not effective other approaches that you might consider include:
    • Methods for oversampling the minority cases or undersampling the majority cases.
    • Ensemble techniques that help the learner directly by using clustering, bagging, or adaptive boosting.

Step 7. Downsampling

from imblearn.under_sampling  import RandomUnderSampler,AllKNN,NearMiss, TomekLinks
# random under-sampling
undersample=RandomUnderSampler()
un1=run_model(X,y,model=undersample)
Counter({0: 3393, 1: 507})
Counter({0: 507, 1: 507})
Accuracy:  0.8331339712918661
# Undersample based on the AllKNN method.
undersample=AllKNN()
un2=run_model(X,y,model=undersample)
Counter({0: 3373, 1: 527})
Counter({0: 3370, 1: 527})
Accuracy:  0.8779904306220095
# under-sampling based on NearMiss methods.
# NearMiss-1: Majority class examples with minimum average distance to three closest minority class examples.
undersample=NearMiss(version=1)
un3=run_model(X,y,model=undersample)
Counter({0: 3395, 1: 505})
Counter({0: 505, 1: 505})
Accuracy:  0.8462918660287081
#Under-sampling by removing Tomek's links.
undersample=TomekLinks()
un4=run_model(X,y,model=undersample)
Counter({0: 3385, 1: 515})
Counter({0: 3383, 1: 515})
Accuracy:  0.8845693779904307
under_models = pd.DataFrame({
    'Model':['RandomUnderSampler',
             'AllKNN',
             'NearMiss',
             'Tomek'
            ],
    'Accuracy_score' :[un1 ,un2, un3, un4,
                      ]
})
sns.barplot(x='Accuracy_score', y='Model', data=under_models)
under_models.sort_values(by='Accuracy_score', ascending=False)
Model Accuracy_score
3 Tomek 0.884569
1 AllKNN 0.877990
2 NearMiss 0.846292
0 RandomUnderSampler 0.833134

png

Step 8. Oversampling Pipeline

TF-IDF - normalizing and weighting with diminishing importance tokens that occur in the majority of documents.

1.TF(Term frequency)-Term frequency works by looking at the frequency of a particular term you are concerned with relative to the document. There are multiple measures, or ways, of defining frequency

2.IDF (inverse document frequency)-Inverse document frequency looks at how common (or uncommon) a word is amongst the corpus.

The output above shows the label column has the assigned integer values (0 and 1). The next step is to implement text vectorization.

We will split the vectorized dataset into two portions/sets. The first portion will be for model training and the second portion for model testing. We will use the train_test_split method to split the vectorized dataset.

#split the data into train and test sets
X=df['text_clean']
y=df['label']
#y=df['label']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 42)
def model(model_name,X_train,y_train,X_test,y_test):
    pipeline=Pipeline([
    ('tfidf', TfidfVectorizer()),#transform the texts into the vectorized input variables X
    ('model', model_name),
    ])
    pipeline.fit(X_train,y_train)

    preds=pipeline.predict(X_test)

    print (classification_report(y_test,preds))
    print (confusion_matrix(y_test,preds))
    print('Accuracy:', pipeline.score(X_test, y_test)*100)
    print("Training Score:",pipeline.score(X_train,y_train)*100)
    from sklearn.metrics import accuracy_score
    score = accuracy_score(y_test,preds)
    return score

MultinomialNB

mnb=model(MultinomialNB(),X_train,y_train,X_test,y_test)
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       965
           1       1.00      0.75      0.85       150

    accuracy                           0.97      1115
   macro avg       0.98      0.87      0.92      1115
weighted avg       0.97      0.97      0.96      1115

[[965   0]
 [ 38 112]]
Accuracy: 96.59192825112108
Training Score: 97.55440879515369

RF

rf=model(RandomForestClassifier(),X_train,y_train,X_test,y_test)
              precision    recall  f1-score   support

           0       0.97      1.00      0.99       965
           1       1.00      0.83      0.91       150

    accuracy                           0.98      1115
   macro avg       0.99      0.92      0.95      1115
weighted avg       0.98      0.98      0.98      1115

[[965   0]
 [ 25 125]]
Accuracy: 97.75784753363229
Training Score: 100.0

XGBoost

xgb=model(XGBClassifier(),X_train,y_train,X_test,y_test)
              precision    recall  f1-score   support

           0       0.97      1.00      0.98       965
           1       0.97      0.82      0.89       150

    accuracy                           0.97      1115
   macro avg       0.97      0.91      0.94      1115
weighted avg       0.97      0.97      0.97      1115

[[961   4]
 [ 27 123]]
Accuracy: 97.21973094170404
Training Score: 98.99035225487997

SVM

svc=model(SVC(),X_train,y_train,X_test,y_test)
              precision    recall  f1-score   support

           0       0.97      1.00      0.99       965
           1       0.98      0.83      0.90       150

    accuracy                           0.98      1115
   macro avg       0.98      0.92      0.94      1115
weighted avg       0.98      0.98      0.97      1115

[[963   2]
 [ 25 125]]
Accuracy: 97.57847533632287
Training Score: 99.73076060130133

Model Comparison

models = pd.DataFrame({
    'Model':['MNB','RF','XGBoost', 'SVM'],
    'Accuracy_score' :[mnb ,rf, xgb, svc]
})
sns.barplot(x='Accuracy_score', y='Model', data=models)
models.sort_values(by='Accuracy_score', ascending=False)
Model Accuracy_score
1 RF 0.977578
3 SVM 0.975785
2 XGBoost 0.972197
0 MNB 0.965919

png

Step 9 Final model

def run_model_sampling(X,y,mlmodel,sampling=None):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)
    print(Counter(y_train))
    vectorizer_smote = TfidfVectorizer()
    vectorizer_smote.fit(X_train)
    #Vectorizing the X_train
    X_train_vec = vectorizer_smote.transform(X_train)
    X_train_vec = X_train_vec.toarray()
    #Vectorizing the X_test
    X_test_vec = vectorizer_smote.transform(X_test)
    X_test_vec = X_test_vec.toarray()
    # transform the dataset ( Sampling model)
    oversample = sampling
    if oversample: 
        X_train_over, y_train_over = oversample.fit_resample(X_train_vec, y_train)
        X_train_vec= X_train_over 
        y_train = y_train_over        
    print(Counter(y_train))
    # Machine Learning Model
    model=mlmodel
    model.fit(X_train_vec, y_train)
    y_preds = model.predict(X_test_vec)
    score = accuracy_score(y_test,y_preds)
    print("Accuracy: ",score)
    target_names = ['class 0', 'class 1'] 
    print(classification_report_imbalanced(y_test, y_preds, target_names=target_names))
    return score
X=df['text_clean']
y=df['label'].values
mlmodel=RandomForestClassifier()
sampling=KMeansSMOTE()
run_model_sampling(X,y,mlmodel,sampling)
Counter({0: 3364, 1: 536})
Counter({0: 3364, 1: 3364})
Accuracy:  0.9736842105263158
                   pre       rec       spe        f1       geo       iba       sup

    class 0       0.97      1.00      0.79      0.99      0.89      0.81      1461
    class 1       1.00      0.79      1.00      0.88      0.89      0.77       211

avg / total       0.97      0.97      0.82      0.97      0.89      0.80      1672

0.9736842105263158
X=df['text_clean']
y=df['label'].values
mlmodel=RandomForestClassifier()
run_model_sampling(X,y,mlmodel)
Counter({0: 3390, 1: 510})
Counter({0: 3390, 1: 510})
Accuracy:  0.9766746411483254
                   pre       rec       spe        f1       geo       iba       sup

    class 0       0.97      1.00      0.84      0.99      0.91      0.85      1435
    class 1       1.00      0.84      1.00      0.91      0.91      0.82       237

avg / total       0.98      0.98      0.86      0.98      0.91      0.85      1672


0.9766746411483254

As we see the oversampling/ downsampling in general, does not provide better accuracy for this current set of data. What we have learned is that is better to choose a better machine learning model than apply sampling methods. And also the cleaning of the data provides a big improvement in the NLP accuracy

Actually NLP is one of the most common areas in which resampling of data is needed as there are many text classification tasks dealing with imbalanced problem but SMOTE seem to be problematic here for some reasons: SMOTE works in feature space. It means that the output of SMOTE is not a synthetic data which is a real representative of a text inside its feature space. On one side SMOTE works with KNN and on the other hand, feature spaces for NLP problem are dramatically huge. KNN will easily fail in those huge dimensions.

Do not care about the real text representation of new synthetic samples. You need to balance the distribution for your classifier not for a reader of text data. In principle we can use SMOTE as traditional with some Dimensionality Reduction step. 1) Lets assume you want to make your data samples from minor class double using K-NN. Ignore the major class(es) and keep only minor class samples. 2) Ignore the major class. Get a length distribution of all documents in minor class so that we generate new samples according the the true document length (number of words/phrases).

Well I hope was helpful the discussion, you can download this notebook here

Congratulations! You have practiced how to classify messages by using different NLP methods.

Posted:

Leave a comment