This analysis was a technical exercise I completed as part of a job application.
It’s a brief demonstration of some of my statistics and programming skills and was written in a Jupyter Notebook. This blog is generated using the Pelican blogging framework which makes it super easy to convert a notebook to a blog post.Data Exploration Exercise¶
Using whichever methods and libraries you prefer, create a notebook with the following:
- Data preparation and Data exploration
- Identify the three most significant data features which drive the credit risk
- Modeling the credit risk
- Model validation and evaluation using the methods that you find correct for the problem
Your solution should have instructions and be self-contained. For instance, If your choice is a python notebook, your notebook should install all the required dependencies to run it.
Import and preparation ¶
# display more than 1 output per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%%capture
import sys
!{sys.executable} -m pip install --upgrade pip
!{sys.executable} -m pip install pandas numpy sklearn matplotlib seaborn statsmodels
import pandas as pd
pd.options.display.max_columns = None # show all columns in a pandas dataframe
data = pd.read_csv('credit-g.csv')
data.head()
Data Exploration¶
### overview of the dataset
data.shape
data.columns
data.nunique() # unique values per column
print(f'class: {len(data[data["class"]=="good"])} "good" rows')
print(f'class: {len(data[data["class"]=="bad"])} "bad" rows')
In order to visually inspect the data it’s necessary to convert the datatype of the categorical features from string to category. This will also be necessary to train the model. Therefore the data will be formatted before further exploration.
Outcome distribution¶
Ideally there would be an approximately equal number of outcome classes. In this dataset the split is 30% “bad” and 70% “good”. The low number of bad credit risk assessments may limit the models ability to accurately predict a bad credit assessment relative to good assessments because of the limited training examples. This could result in more False Positives than would typically be expected.
Data formating¶
# remove whitespace around column names
data.columns = [col.strip() for col in data.columns]
# Categorical variables have a limited and usually fixed number of possible values.
# Categorical data might have an order (e.g. ‘strongly agree', ‘agree’, 'disagree', 'strongly disagree')
unordered_category_cols = [
'credit_history',
'purpose',
'personal_status',
'other_parties',
'property_magnitude',
'other_payment_plans',
'housing',
'job',
'own_telephone',
'foreign_worker',
'class'
]
for col in unordered_category_cols:
data[col] = data[col].astype('category')
ordered_category_cols = [
('checking_status', ["no checking", "<0", "0<=X<200", ">=200"], True),
('savings_status',['no known savings', '<100', '100<=X<500', '500<=X<1000', '>=1000'], True ),
('employment', ['unemployed', '<1', '1<=X<4', '4<=X<7', '>=7'], True),
]
for col in ordered_category_cols:
data[col[0]]=pd.Categorical(data[col[0]], categories=col[1], ordered=col[2])
# convert categories to numnerical values, for SelectKBest
cat_columns = data.select_dtypes(['category']).columns
data[cat_columns] = data[cat_columns].apply(lambda x: x.cat.codes)
# all columns are now either categorical and encoded as an int (ordered or unordered) or numerical.
data.dtypes
# this will take a while..
import seaborn as sns # Create the default pairplot
pairplot = sns.pairplot(
data,
hue="class",
diag_kind = 'kde',
plot_kws = {'alpha': 0.6, 's': 80, 'edgecolor': 'k'},
height = 3
)
fig = pairplot.fig
fig.savefig('pairplot.png', dpi=200) # default dpi is 100