Logistic Regression with Python and Spark

6 minute read

Customer churn with Logistic Regression

A marketing agency is concerned about the number of customers stop buying their service. They need to understand who is leaving. Imagine that you are an analyst at this company and you have to find out the customers most at risk to churn an account manager.

Table of contents


The installation of Python and Pyspark and the introduction of the Logistic Regression is given here.

Logistic Regression with Python

Lets first import required libraries:

import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
%matplotlib inline 
import matplotlib.pyplot as plt

Understanding the Data

churn_df = pd.read_csv("customer_churn.csv")
churn_df.head(1)
Names Age Total_Purchase Account_Manager Years Num_Sites Onboard_date Location Company Churn
0 Cameron Williams 42.0 11066.8 0 7.22 8.0 2013-08-30 07:00:40 10265 Elizabeth Mission Barkerburgh, AK 89518 Harvey LLC 1

The data is saved as customer_churn.csv. Here are the fields and their definitions:

Name : Name of the latest contact at Company
Age: Customer Age
Total_Purchase: Total Ads Purchased
Account_Manager: Binary 0=No manager, 1= Account manager assigned
Years: Totaly Years as a customer
Num_sites: Number of websites that use the service.
Onboard_date: Date that the name of the latest contact was onboarded
Location: Client HQ Address
Company: Name of Client Company

Lets select some features for the modeling. Also we change the target data type to be integer, as it is a requirement by the skitlearn algorithm:

churn_df.columns
Index(['Names', 'Age', 'Total_Purchase', 'Account_Manager', 'Years',
       'Num_Sites', 'Onboard_date', 'Location', 'Company', 'Churn'],
      dtype='object')
inputCols=['Age',
 'Total_Purchase',
 'Account_Manager',
 'Years',
 'Num_Sites','Churn']
churn_df = churn_df[inputCols]
churn_df['Churn'] = churn_df['Churn'].astype('int')
churn_df.head()
Age Total_Purchase Account_Manager Years Num_Sites Churn
0 42.0 11066.80 0 7.22 8.0 1
1 41.0 11916.22 0 6.50 11.0 1
2 38.0 12884.75 0 6.67 12.0 1
3 42.0 8010.76 0 6.71 10.0 1
4 37.0 9191.58 0 5.56 9.0 1

Lets define X, and y for our dataset:

X = np.asarray(churn_df[['Age', 'Total_Purchase', 'Account_Manager', 'Years', 'Num_Sites']])X[0:5]
array([[4.200000e+01, 1.106680e+04, 0.000000e+00, 7.220000e+00,
        8.000000e+00],
       [4.100000e+01, 1.191622e+04, 0.000000e+00, 6.500000e+00,
        1.100000e+01],
       [3.800000e+01, 1.288475e+04, 0.000000e+00, 6.670000e+00,
        1.200000e+01],
       [4.200000e+01, 8.010760e+03, 0.000000e+00, 6.710000e+00,
        1.000000e+01],
       [3.700000e+01, 9.191580e+03, 0.000000e+00, 5.560000e+00,
        9.000000e+00]])
y = np.asarray(churn_df['Churn'])y [0:5]
array([1, 1, 1, 1, 1])

Also, we normalize the dataset:

from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]
array([[ 0.0299361 ,  0.41705373, -0.96290958,  1.52844634, -0.33323478],
       [-0.13335172,  0.76990459, -0.96290958,  0.96318219,  1.36758544],
       [-0.6232152 ,  1.172234  , -0.96290958,  1.09664734,  1.93452551],
       [ 0.0299361 , -0.85243173, -0.96290958,  1.1280509 ,  0.80064537],
       [-0.78650303, -0.36191661, -0.96290958,  0.22519844,  0.2337053 ]])
## Train/Test dataset

We split our dataset into train and test set:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)
Train set: (720, 5) (720,)Test set: (180, 5) (180,)

Lets build our model using LogisticRegression from Scikit-learn package. The version of Logistic Regression in Scikit-learn, support regularization. Regularization is a technique used to solve the overfitting problem in machine learning models.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LR
LogisticRegression(C=0.01, solver='liblinear')

Now we can predict using our test set:

yhat = LR.predict(X_test)
predict_proba returns estimates for all classes, ordered by the label of classes. So, the first column is the probability of class 1, P(Y=1 X), and second column is probability of class 0, P(Y=0 X):
yhat_prob = LR.predict_proba(X_test)

confusion matrix

Another way of looking at accuracy of classifier is to look at confusion matrix.

from sklearn.metrics import classification_report, confusion_matrix
import itertools
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
print(confusion_matrix(y_test, yhat, labels=[1,0]))
[[ 14  16] [  6 144]]
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, yhat, labels=[1,0])
np.set_printoptions(precision=2)


# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['churn=1','churn=0'],normalize= False,  title='Confusion matrix')
Confusion matrix, without normalization[[ 14  16] [  6 144]]

png

print (classification_report(y_test, yhat))
              precision    recall  f1-score   support           0       0.90      0.96      0.93       150           1       0.70      0.47      0.56        30    accuracy                           0.88       180   macro avg       0.80      0.71      0.74       180weighted avg       0.87      0.88      0.87       180

The average accuracy for this classifier is the average of the F1-score for both labels, which is 0.87 in our case.

Based on the count of each section, we can calculate precision and recall of each label:

  • Precision is a measure of the accuracy provided that a class label has been predicted. It is defined by: precision = TP / (TP + FP)

  • Recall is true positive rate. It is defined as: Recall =  TP / (TP + FN)

we can calculate precision and recall of each class.

F1 score: Now we are in the position to calculate the F1 scores for each label based on the precision and recall of that label.

The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. It is a good way to show that a classifer has a good value for both recall and precision.

Logistic Regression with Pyspark

import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('logregconsult').getOrCreate()
data = spark.read.csv('customer_churn.csv',inferSchema=True,                     header=True)
data.printSchema()
root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Churn: integer (nullable = true)

Check out the data

data.describe().show(1)
+-------+-----+---+--------------+---------------+-----+---------+------------+--------+-------+-----+
|summary|Names|Age|Total_Purchase|Account_Manager|Years|Num_Sites|Onboard_date|Location|Company|Churn|
+-------+-----+---+--------------+---------------+-----+---------+------------+--------+-------+-----+
|  count|  900|900|           900|            900|  900|      900|         900|     900|    900|  900|
+-------+-----+---+--------------+---------------+-----+---------+------------+--------+-------+-----+
only showing top 1 row

Format for MLlib

We’ll ues the numerical columns. We’ll include Account Manager because its easy enough, but keep in mind it probably won’t be any sort of a signal because the agency mentioned its randomly assigned!

from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=['Age', 'Total_Purchase', 'Account_Manager', 'Years', 'Num_Sites'],outputCol='features')
output = assembler.transform(data)
final_data = output.select('features','churn')

Test Train Split

train_churn,test_churn = final_data.randomSplit([0.7,0.3])

Fit the model

from pyspark.ml.classification import LogisticRegression
lr_churn = LogisticRegression(labelCol='churn')
fitted_churn_model = lr_churn.fit(train_churn)
training_sum = fitted_churn_model.summary
training_sum.predictions.describe().show()
+-------+-------------------+-------------------+
|summary|              churn|         prediction|
+-------+-------------------+-------------------+
|  count|                613|                613|
|   mean|0.17781402936378465|0.12887438825448613|
| stddev|  0.382668372099746|   0.33533449140226|
|    min|                0.0|                0.0|
|    max|                1.0|                1.0|
+-------+-------------------+-------------------+

Evaluate results

Let’s evaluate the results on the data set we were given (using the test data)

from pyspark.ml.evaluation import BinaryClassificationEvaluator
pred_and_labels = fitted_churn_model.evaluate(test_churn)
pred_and_labels.predictions.show()
+--------------------+-----+--------------------+--------------------+----------+|            features|churn|       rawPrediction|         probability|prediction|+--------------------+-----+--------------------+--------------------+----------+|[22.0,11254.38,1....|    0|[4.05430795643892...|[0.98294832241403...|       0.0||[26.0,8939.61,0.0...|    0|[5.77405037803505...|[0.99690247758406...|       0.0||[28.0,8670.98,0.0...|    0|[7.23826268284963...|[0.99928195691507...|       0.0||[28.0,9090.43,1.0...|    0|[0.94821948475399...|[0.72075696108881...|       0.0||[29.0,9617.59,0.0...|    0|[3.91324626692407...|[0.98041565826455...|       0.0||[29.0,10203.18,1....|    0|[3.26598082266084...|[0.96324313418667...|       0.0||[29.0,11274.46,1....|    0|[4.17445848712145...|[0.98484954751144...|       0.0||[29.0,13255.05,1....|    0|[3.93587227636939...|[0.98084540599653...|       0.0||[30.0,8403.78,1.0...|    0|[5.44208726208492...|[0.99568823673280...|       0.0||[30.0,8677.28,1.0...|    0|[3.43610521361442...|[0.96881405579738...|       0.0||[30.0,10183.98,1....|    0|[2.48028976858033...|[0.92274845618277...|       0.0||[31.0,5304.6,0.0,...|    0|[2.67870443526006...|[0.93575828504198...|       0.0||[31.0,5387.75,0.0...|    0|[1.70594375589089...|[0.84630943026440...|       0.0||[31.0,10058.87,1....|    0|[3.92946303722257...|[0.98072461932954...|       0.0||[31.0,10182.6,1.0...|    0|[4.50159146584151...|[0.98903033715124...|       0.0||[31.0,11297.57,1....|    1|[0.60412216064385...|[0.64659882437760...|       0.0||[31.0,12264.68,1....|    0|[3.22109124776813...|[0.96162030913584...|       0.0||[32.0,5756.12,0.0...|    0|[3.45850939238844...|[0.96948389815194...|       0.0||[32.0,6367.22,1.0...|    0|[2.58844173988655...|[0.93011399503796...|       0.0||[32.0,7896.65,0.0...|    0|[2.74301964391580...|[0.93951791284489...|       0.0|+--------------------+-----+--------------------+--------------------+----------+only showing top 20 rows

Using AUC

churn_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction',                                           labelCol='churn')
auc = churn_eval.evaluate(pred_and_labels.predictions)
auc
0.7926829268292683

Predict on brand new unlabeled data

final_lr_model = lr_churn.fit(final_data)
new_customers = spark.read.csv('new_customers.csv',inferSchema=True,                              header=True)
new_customers.printSchema()
root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)

test_new_customers = assembler.transform(new_customers)
test_new_customers.printSchema()
root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- features: vector (nullable = true)

final_results = final_lr_model.transform(test_new_customers)
final_results.select('Company','prediction').show()
+----------------+----------+
|         Company|prediction|
+----------------+----------+
|        King Ltd|       0.0|
|   Cannon-Benson|       1.0|
|Barron-Robertson|       1.0|
|   Sexton-Golden|       1.0|
|        Wood LLC|       0.0|
|   Parks-Robbins|       1.0|
+----------------+----------+

Now we know that we should assign Acocunt Managers to Cannon-Benson,Barron-Robertson,Sexton-GOlden, and Parks-Robbins!

You can download the notebook here

Congratulations! We have practiced Logistic Regression with Python and Spark.

Posted:

Leave a comment