Logistic Regression with Python and Spark

6 minute read

Customer churn with Logistic Regression

A marketing agency is concerned about the number of customers stop buying their service. They need to understand who is leaving. Imagine that you are an analyst at this company and you have to find out the customers most at risk to churn an account manager.

Logistic Regression with Python
Logistic Regression with Pyspark

The installation of Python and Pyspark and the introduction of the Logistic Regression is given here.

Logistic Regression with Python

Lets first import required libraries:

import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
%matplotlib inline 
import matplotlib.pyplot as plt

Understanding the Data

churn_df = pd.read_csv("customer_churn.csv")
churn_df.head(1)

	Names	Age	Total_Purchase	Account_Manager	Years	Num_Sites	Onboard_date	Location	Company	Churn
0	Cameron Williams	42.0	11066.8	0	7.22	8.0	2013-08-30 07:00:40	10265 Elizabeth Mission Barkerburgh, AK 89518	Harvey LLC	1

The data is saved as customer_churn.csv. Here are the fields and their definitions:

Name : Name of the latest contact at Company
Age: Customer Age
Total_Purchase: Total Ads Purchased
Account_Manager: Binary 0=No manager, 1= Account manager assigned
Years: Totaly Years as a customer
Num_sites: Number of websites that use the service.
Onboard_date: Date that the name of the latest contact was onboarded
Location: Client HQ Address
Company: Name of Client Company

Lets select some features for the modeling. Also we change the target data type to be integer, as it is a requirement by the skitlearn algorithm:

churn_df.columns

Index(['Names', 'Age', 'Total_Purchase', 'Account_Manager', 'Years',
       'Num_Sites', 'Onboard_date', 'Location', 'Company', 'Churn'],
      dtype='object')

inputCols=['Age',
 'Total_Purchase',
 'Account_Manager',
 'Years',
 'Num_Sites','Churn']

churn_df = churn_df[inputCols]

churn_df['Churn'] = churn_df['Churn'].astype('int')
churn_df.head()

	Age	Total_Purchase	Years	Num_Sites	Churn
0	42.0	11066.80	7.22	8.0	1
1	41.0	11916.22	6.50	11.0	1
2	38.0	12884.75	6.67	12.0	1
3	42.0	8010.76	6.71	10.0	1
4	37.0	9191.58	5.56	9.0	1

Lets define X, and y for our dataset:

X = np.asarray(churn_df[['Age', 'Total_Purchase', 'Account_Manager', 'Years', 'Num_Sites']])X[0:5]

array([[4.200000e+01, 1.106680e+04, 0.000000e+00, 7.220000e+00,
        8.000000e+00],
       [4.100000e+01, 1.191622e+04, 0.000000e+00, 6.500000e+00,
        1.100000e+01],
       [3.800000e+01, 1.288475e+04, 0.000000e+00, 6.670000e+00,
        1.200000e+01],
       [4.200000e+01, 8.010760e+03, 0.000000e+00, 6.710000e+00,
        1.000000e+01],
       [3.700000e+01, 9.191580e+03, 0.000000e+00, 5.560000e+00,
        9.000000e+00]])

y = np.asarray(churn_df['Churn'])y [0:5]

array([1, 1, 1, 1, 1])

Also, we normalize the dataset:

from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5]

array([[ 0.0299361 ,  0.41705373, -0.96290958,  1.52844634, -0.33323478],
       [-0.13335172,  0.76990459, -0.96290958,  0.96318219,  1.36758544],
       [-0.6232152 ,  1.172234  , -0.96290958,  1.09664734,  1.93452551],
       [ 0.0299361 , -0.85243173, -0.96290958,  1.1280509 ,  0.80064537],
       [-0.78650303, -0.36191661, -0.96290958,  0.22519844,  0.2337053 ]])

## Train/Test dataset

We split our dataset into train and test set:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (720, 5) (720,)Test set: (180, 5) (180,)

Lets build our model using LogisticRegression from Scikit-learn package. The version of Logistic Regression in Scikit-learn, support regularization. Regularization is a technique used to solve the overfitting problem in machine learning models.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
LR

LogisticRegression(C=0.01, solver='liblinear')

Now we can predict using our test set:

yhat = LR.predict(X_test)

predict_proba returns estimates for all classes, ordered by the label of classes. So, the first column is the probability of class 1, P(Y=1

X), and second column is probability of class 0, P(Y=0

X):

yhat_prob = LR.predict_proba(X_test)

confusion matrix

Another way of looking at accuracy of classifier is to look at confusion matrix.

from sklearn.metrics import classification_report, confusion_matrix
import itertools
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
print(confusion_matrix(y_test, yhat, labels=[1,0]))

[[ 14  16] [  6 144]]

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, yhat, labels=[1,0])
np.set_printoptions(precision=2)


# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['churn=1','churn=0'],normalize= False,  title='Confusion matrix')

Confusion matrix, without normalization[[ 14  16] [  6 144]]

png

print (classification_report(y_test, yhat))

              precision    recall  f1-score   support           0       0.90      0.96      0.93       150           1       0.70      0.47      0.56        30    accuracy                           0.88       180   macro avg       0.80      0.71      0.74       180weighted avg       0.87      0.88      0.87       180

The average accuracy for this classifier is the average of the F1-score for both labels, which is 0.87 in our case.

Based on the count of each section, we can calculate precision and recall of each label:

Precision is a measure of the accuracy provided that a class label has been predicted. It is defined by: precision = TP / (TP + FP)
Recall is true positive rate. It is defined as: Recall = TP / (TP + FN)

we can calculate precision and recall of each class.

F1 score: Now we are in the position to calculate the F1 scores for each label based on the precision and recall of that label.

The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. It is a good way to show that a classifer has a good value for both recall and precision.

Logistic Regression with Pyspark

import findspark

findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('logregconsult').getOrCreate()

data = spark.read.csv('customer_churn.csv',inferSchema=True,                     header=True)

data.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Churn: integer (nullable = true)

Check out the data

data.describe().show(1)

+-------+-----+---+--------------+---------------+-----+---------+------------+--------+-------+-----+
|summary|Names|Age|Total_Purchase|Account_Manager|Years|Num_Sites|Onboard_date|Location|Company|Churn|
+-------+-----+---+--------------+---------------+-----+---------+------------+--------+-------+-----+
|  count|  900|900|           900|            900|  900|      900|         900|     900|    900|  900|
+-------+-----+---+--------------+---------------+-----+---------+------------+--------+-------+-----+
only showing top 1 row

Format for MLlib

We’ll ues the numerical columns. We’ll include Account Manager because its easy enough, but keep in mind it probably won’t be any sort of a signal because the agency mentioned its randomly assigned!

from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=['Age', 'Total_Purchase', 'Account_Manager', 'Years', 'Num_Sites'],outputCol='features')

output = assembler.transform(data)

final_data = output.select('features','churn')

Test Train Split

train_churn,test_churn = final_data.randomSplit([0.7,0.3])

Fit the model

from pyspark.ml.classification import LogisticRegression

lr_churn = LogisticRegression(labelCol='churn')

fitted_churn_model = lr_churn.fit(train_churn)

training_sum = fitted_churn_model.summary

training_sum.predictions.describe().show()

+-------+-------------------+-------------------+
|summary|              churn|         prediction|
+-------+-------------------+-------------------+
|  count|                613|                613|
|   mean|0.17781402936378465|0.12887438825448613|
| stddev|  0.382668372099746|   0.33533449140226|
|    min|                0.0|                0.0|
|    max|                1.0|                1.0|
+-------+-------------------+-------------------+

Evaluate results

Let’s evaluate the results on the data set we were given (using the test data)

from pyspark.ml.evaluation import BinaryClassificationEvaluator

pred_and_labels = fitted_churn_model.evaluate(test_churn)

pred_and_labels.predictions.show()

+--------------------+-----+--------------------+--------------------+----------+|            features|churn|       rawPrediction|         probability|prediction|+--------------------+-----+--------------------+--------------------+----------+|[22.0,11254.38,1....|    0|[4.05430795643892...|[0.98294832241403...|       0.0||[26.0,8939.61,0.0...|    0|[5.77405037803505...|[0.99690247758406...|       0.0||[28.0,8670.98,0.0...|    0|[7.23826268284963...|[0.99928195691507...|       0.0||[28.0,9090.43,1.0...|    0|[0.94821948475399...|[0.72075696108881...|       0.0||[29.0,9617.59,0.0...|    0|[3.91324626692407...|[0.98041565826455...|       0.0||[29.0,10203.18,1....|    0|[3.26598082266084...|[0.96324313418667...|       0.0||[29.0,11274.46,1....|    0|[4.17445848712145...|[0.98484954751144...|       0.0||[29.0,13255.05,1....|    0|[3.93587227636939...|[0.98084540599653...|       0.0||[30.0,8403.78,1.0...|    0|[5.44208726208492...|[0.99568823673280...|       0.0||[30.0,8677.28,1.0...|    0|[3.43610521361442...|[0.96881405579738...|       0.0||[30.0,10183.98,1....|    0|[2.48028976858033...|[0.92274845618277...|       0.0||[31.0,5304.6,0.0,...|    0|[2.67870443526006...|[0.93575828504198...|       0.0||[31.0,5387.75,0.0...|    0|[1.70594375589089...|[0.84630943026440...|       0.0||[31.0,10058.87,1....|    0|[3.92946303722257...|[0.98072461932954...|       0.0||[31.0,10182.6,1.0...|    0|[4.50159146584151...|[0.98903033715124...|       0.0||[31.0,11297.57,1....|    1|[0.60412216064385...|[0.64659882437760...|       0.0||[31.0,12264.68,1....|    0|[3.22109124776813...|[0.96162030913584...|       0.0||[32.0,5756.12,0.0...|    0|[3.45850939238844...|[0.96948389815194...|       0.0||[32.0,6367.22,1.0...|    0|[2.58844173988655...|[0.93011399503796...|       0.0||[32.0,7896.65,0.0...|    0|[2.74301964391580...|[0.93951791284489...|       0.0|+--------------------+-----+--------------------+--------------------+----------+only showing top 20 rows

Using AUC

churn_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction',                                           labelCol='churn')

auc = churn_eval.evaluate(pred_and_labels.predictions)

auc

0.7926829268292683

Predict on brand new unlabeled data

final_lr_model = lr_churn.fit(final_data)

new_customers = spark.read.csv('new_customers.csv',inferSchema=True,                              header=True)

new_customers.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)

test_new_customers = assembler.transform(new_customers)

test_new_customers.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- features: vector (nullable = true)

final_results = final_lr_model.transform(test_new_customers)

final_results.select('Company','prediction').show()

+----------------+----------+
|         Company|prediction|
+----------------+----------+
|        King Ltd|       0.0|
|   Cannon-Benson|       1.0|
|Barron-Robertson|       1.0|
|   Sexton-Golden|       1.0|
|        Wood LLC|       0.0|
|   Parks-Robbins|       1.0|
+----------------+----------+

Now we know that we should assign Acocunt Managers to Cannon-Benson,Barron-Robertson,Sexton-GOlden, and Parks-Robbins!

You can download the notebook here

Congratulations! We have practiced Logistic Regression with Python and Spark.

Share on

Twitter Facebook LinkedIn

Ruslan Magana Vsevolodovna

Logistic Regression with Python and Spark

Customer churn with Logistic Regression

Table of contents

Logistic Regression with Python

Understanding the Data

confusion matrix

Logistic Regression with Pyspark

Check out the data

Format for MLlib

Test Train Split

Fit the model

Evaluate results

Using AUC

Predict on brand new unlabeled data

Share on

Leave a comment

You may also enjoy

From Zero to Hero: Building a Multi-Agent System with Watsonx Orchestrate

07 Jul 2025

Watsonx.ai Agent to MCP Gateway

01 Jul 2025

Building a Watsonx.ai Chatbot RAG Server with MCP

20 Apr 2025

Building RAG Applications with IBM watsonx.ai and Langflow

20 Apr 2025