데이터 한 그릇

피마 인디언 당뇨병 예측(모델 평가) 본문

머신러닝

피마 인디언 당뇨병 예측(모델 평가)

장사이언스 2021. 3. 31. 00:34
피마 인디언 당뇨병 예측

피마 인디언 당뇨병 예측

Column information

1)Pregnancies (임신횟수) =>Number of times pregnant

2)Glucose (포도당) =>Plasma glucose concentration a 2 hours in an oral glucose tolerance test

3)BloodPressure (혈압) =>Diastolic blood pressure (mm Hg)

4)SkinThickness (삼두근 피부 두께) =>Triceps skin fold thickness (mm)

5)Insulin (2시간 혈청 인슐린) =>2-Hour serum insulin (mu U/ml)

6)BMI (체질량지수) =>Body mass index (weight in kg/(height in m)^2)

7)DiabetesPedigreeFunction (당뇨병 혈통 기능) =>Diabetes pedigree function

8)Age =>Age (years)

9)Outcome =>Class variable (0 or 1) 268 of 768 are 1, the others are 0

EDA

In [253]:
import pandas as pd

pima_df = pd.read_csv('C:\ca_da\DataHandling/diabetes.csv')
pima_df
Out[253]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
... ... ... ... ... ... ... ... ... ...
763 10 101 76 48 180 32.9 0.171 63 0
764 2 122 70 27 0 36.8 0.340 27 0
765 5 121 72 23 112 26.2 0.245 30 0
766 1 126 60 0 0 30.1 0.349 47 1
767 1 93 70 31 0 30.4 0.315 23 0

768 rows × 9 columns

In [254]:
pima_df.head()
Out[254]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
In [255]:
pima_df.tail(10)
Out[255]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
758 1 106 76 0 0 37.5 0.197 26 0
759 6 190 92 0 0 35.5 0.278 66 1
760 2 88 58 26 16 28.4 0.766 22 0
761 9 170 74 31 0 44.0 0.403 43 1
762 9 89 62 0 0 22.5 0.142 33 0
763 10 101 76 48 180 32.9 0.171 63 0
764 2 122 70 27 0 36.8 0.340 27 0
765 5 121 72 23 112 26.2 0.245 30 0
766 1 126 60 0 0 30.1 0.349 47 1
767 1 93 70 31 0 30.4 0.315 23 0
In [256]:
#768 x 9 dataFrame

#ALL DATATYPE -> int or float

#null 값 0개

pima_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
In [257]:
pima_df.describe()
Out[257]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.240885 0.348958
std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329 11.760232 0.476951
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.000000 0.000000
25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.000000 0.000000
50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.000000 0.000000
75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.000000 1.000000
max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 1.000000
In [258]:
pima_df['Pregnancies'].hist()
'평균값 : {:.2f}'.format(pima_df['Pregnancies'].mean())
Out[258]:
'평균값 : 3.85'
In [259]:
pima_df['Glucose'].hist()
'평균값 : {:.2f}'.format(pima_df['Pregnancies'].mean())
Out[259]:
'평균값 : 3.85'
In [260]:
pima_df['BloodPressure'].hist()
Out[260]:
<matplotlib.axes._subplots.AxesSubplot at 0x128100e8988>
In [261]:
pima_df['SkinThickness'].hist()
Out[261]:
<matplotlib.axes._subplots.AxesSubplot at 0x12810247fc8>
In [262]:
pima_df['Insulin'].hist()
Out[262]:
<matplotlib.axes._subplots.AxesSubplot at 0x12807147e48>
In [263]:
pima_df['BMI'].hist()
Out[263]:
<matplotlib.axes._subplots.AxesSubplot at 0x1280fd9cd88>
In [264]:
pima_df['DiabetesPedigreeFunction'].hist()
Out[264]:
<matplotlib.axes._subplots.AxesSubplot at 0x1280fe20fc8>
In [265]:
pima_df['Age'].hist()
Out[265]:
<matplotlib.axes._subplots.AxesSubplot at 0x128113c24c8>
In [266]:
pima_df['Outcome'].hist()
Out[266]:
<matplotlib.axes._subplots.AxesSubplot at 0x128114448c8>
In [267]:
pima_df['Outcome'].value_counts().plot(kind = 'bar')
Out[267]:
<matplotlib.axes._subplots.AxesSubplot at 0x128114b4c88>
In [268]:
pima_df.corr(method = 'pearson')
Out[268]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
Pregnancies 1.000000 0.129459 0.141282 -0.081672 -0.073535 0.017683 -0.033523 0.544341 0.221898
Glucose 0.129459 1.000000 0.152590 0.057328 0.331357 0.221071 0.137337 0.263514 0.466581
BloodPressure 0.141282 0.152590 1.000000 0.207371 0.088933 0.281805 0.041265 0.239528 0.065068
SkinThickness -0.081672 0.057328 0.207371 1.000000 0.436783 0.392573 0.183928 -0.113970 0.074752
Insulin -0.073535 0.331357 0.088933 0.436783 1.000000 0.197859 0.185071 -0.042163 0.130548
BMI 0.017683 0.221071 0.281805 0.392573 0.197859 1.000000 0.140647 0.036242 0.292695
DiabetesPedigreeFunction -0.033523 0.137337 0.041265 0.183928 0.185071 0.140647 1.000000 0.033561 0.173844
Age 0.544341 0.263514 0.239528 -0.113970 -0.042163 0.036242 0.033561 1.000000 0.238356
Outcome 0.221898 0.466581 0.065068 0.074752 0.130548 0.292695 0.173844 0.238356 1.000000
In [269]:
sns.heatmap(pima_df.corr(method = 'pearson'), annot = True)
Out[269]:
<matplotlib.axes._subplots.AxesSubplot at 0x12811545048>

Modeling

Logistic linear regression

In [270]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score,f1_score,confusion_matrix
In [271]:
X = pima_df.iloc[:,:-1]
Y = pima_df.iloc[:,-1]

X_train,X_test,Y_train,Y_test = train_test_split(X,Y, test_size = 0.2, random_state = 156, stratify = Y)


Logistic = LogisticRegression()
Logistic.fit(X_train,Y_train)
pred = Logistic.predict(X_test)
pred
pred_proba = Logistic.predict_proba(X_test)[:,1]
C:\ca_da\anaconda\envs\ca-da\lib\site-packages\sklearn\linear_model\_logistic.py:764: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
In [272]:
Logistic.score(X_train,Y_train)
Out[272]:
0.7833876221498371
In [273]:
def get_clf_eval(Y_test, pred=None, pred_proba=None):
    confusion = confusion_matrix(Y_test, pred)
    accuracy = accuracy_score(Y_test, pred)
    precision = precision_score(Y_test, pred)
    recall = recall_score(Y_test, pred)
    f1 = f1_score(Y_test, pred)
    roc_auc = roc_auc_score(Y_test, pred_proba)
    print('오차 행렬')
    print(confusion)
    print('\n 정확도 : {:.2f} \n 정밀도 : {:.2f} \n 재현율 {:.2f} \n f1 : {:.2f} \n AUC : {:.2f}'.format(accuracy, precision, recall, f1, roc_auc))
In [274]:
get_clf_eval(Y_test, pred, pred_proba)
오차 행렬
[[88 12]
 [23 31]]

 정확도 : 0.77 
 정밀도 : 0.72 
 재현율 0.57 
 f1 : 0.64 
 AUC : 0.79
In [275]:
from sklearn.metrics import roc_curve
from sklearn.metrics import precision_recall_curve
pred_proba_c1 = Logistic.predict_proba(X_test)[:,1]
In [276]:
precision_recall_curve(Y_test,pred_proba_c1)
Out[276]:
(array([0.35064935, 0.34640523, 0.34868421, 0.34437086, 0.34666667,
        0.34228188, 0.34459459, 0.34693878, 0.34931507, 0.35172414,
        0.35416667, 0.35664336, 0.35915493, 0.36170213, 0.36428571,
        0.36690647, 0.36956522, 0.37226277, 0.375     , 0.37777778,
        0.38059701, 0.38345865, 0.38636364, 0.38931298, 0.39230769,
        0.39534884, 0.3984375 , 0.4015748 , 0.4047619 , 0.4       ,
        0.40322581, 0.40650407, 0.40983607, 0.41322314, 0.41666667,
        0.42016807, 0.42372881, 0.42735043, 0.43103448, 0.43478261,
        0.43859649, 0.44247788, 0.44642857, 0.45045045, 0.45454545,
        0.4587156 , 0.46296296, 0.46728972, 0.47169811, 0.46666667,
        0.47115385, 0.46601942, 0.47058824, 0.47524752, 0.48      ,
        0.48484848, 0.48979592, 0.48453608, 0.48958333, 0.49473684,
        0.4893617 , 0.49462366, 0.5       , 0.49450549, 0.5       ,
        0.50561798, 0.51136364, 0.51724138, 0.51162791, 0.50588235,
        0.5       , 0.5060241 , 0.51219512, 0.51851852, 0.525     ,
        0.51898734, 0.52564103, 0.53246753, 0.53947368, 0.54666667,
        0.54054054, 0.54794521, 0.55555556, 0.56338028, 0.57142857,
        0.57971014, 0.58823529, 0.58208955, 0.57575758, 0.58461538,
        0.578125  , 0.58730159, 0.59677419, 0.60655738, 0.61666667,
        0.61016949, 0.62068966, 0.63157895, 0.64285714, 0.65454545,
        0.64814815, 0.66037736, 0.67307692, 0.66666667, 0.68      ,
        0.69387755, 0.6875    , 0.70212766, 0.69565217, 0.71111111,
        0.70454545, 0.72093023, 0.73809524, 0.73170732, 0.725     ,
        0.71794872, 0.71052632, 0.7027027 , 0.72222222, 0.74285714,
        0.73529412, 0.75757576, 0.78125   , 0.77419355, 0.76666667,
        0.79310345, 0.82142857, 0.85185185, 0.84615385, 0.84      ,
        0.875     , 0.91304348, 0.90909091, 0.9047619 , 0.9       ,
        0.89473684, 0.88888889, 0.88235294, 0.875     , 0.86666667,
        0.92857143, 0.92307692, 0.91666667, 0.90909091, 0.9       ,
        0.88888889, 0.875     , 0.85714286, 0.83333333, 0.8       ,
        0.75      , 0.66666667, 0.5       , 0.        , 1.        ]),
 array([1.        , 0.98148148, 0.98148148, 0.96296296, 0.96296296,
        0.94444444, 0.94444444, 0.94444444, 0.94444444, 0.94444444,
        0.94444444, 0.94444444, 0.94444444, 0.94444444, 0.94444444,
        0.94444444, 0.94444444, 0.94444444, 0.94444444, 0.94444444,
        0.94444444, 0.94444444, 0.94444444, 0.94444444, 0.94444444,
        0.94444444, 0.94444444, 0.94444444, 0.94444444, 0.92592593,
        0.92592593, 0.92592593, 0.92592593, 0.92592593, 0.92592593,
        0.92592593, 0.92592593, 0.92592593, 0.92592593, 0.92592593,
        0.92592593, 0.92592593, 0.92592593, 0.92592593, 0.92592593,
        0.92592593, 0.92592593, 0.92592593, 0.92592593, 0.90740741,
        0.90740741, 0.88888889, 0.88888889, 0.88888889, 0.88888889,
        0.88888889, 0.88888889, 0.87037037, 0.87037037, 0.87037037,
        0.85185185, 0.85185185, 0.85185185, 0.83333333, 0.83333333,
        0.83333333, 0.83333333, 0.83333333, 0.81481481, 0.7962963 ,
        0.77777778, 0.77777778, 0.77777778, 0.77777778, 0.77777778,
        0.75925926, 0.75925926, 0.75925926, 0.75925926, 0.75925926,
        0.74074074, 0.74074074, 0.74074074, 0.74074074, 0.74074074,
        0.74074074, 0.74074074, 0.72222222, 0.7037037 , 0.7037037 ,
        0.68518519, 0.68518519, 0.68518519, 0.68518519, 0.68518519,
        0.66666667, 0.66666667, 0.66666667, 0.66666667, 0.66666667,
        0.64814815, 0.64814815, 0.64814815, 0.62962963, 0.62962963,
        0.62962963, 0.61111111, 0.61111111, 0.59259259, 0.59259259,
        0.57407407, 0.57407407, 0.57407407, 0.55555556, 0.53703704,
        0.51851852, 0.5       , 0.48148148, 0.48148148, 0.48148148,
        0.46296296, 0.46296296, 0.46296296, 0.44444444, 0.42592593,
        0.42592593, 0.42592593, 0.42592593, 0.40740741, 0.38888889,
        0.38888889, 0.38888889, 0.37037037, 0.35185185, 0.33333333,
        0.31481481, 0.2962963 , 0.27777778, 0.25925926, 0.24074074,
        0.24074074, 0.22222222, 0.2037037 , 0.18518519, 0.16666667,
        0.14814815, 0.12962963, 0.11111111, 0.09259259, 0.07407407,
        0.05555556, 0.03703704, 0.01851852, 0.        , 0.        ]),
 array([0.01673088, 0.02178327, 0.03585348, 0.04204322, 0.0452228 ,
        0.05198571, 0.05475138, 0.05804483, 0.05810688, 0.06556026,
        0.06593609, 0.06623214, 0.06751545, 0.06794042, 0.06814109,
        0.06947113, 0.07472193, 0.07988302, 0.08304228, 0.08480979,
        0.08553367, 0.08959614, 0.09015531, 0.09202543, 0.09686889,
        0.09991602, 0.10163466, 0.10230041, 0.10243785, 0.10301033,
        0.10483266, 0.10780982, 0.11168524, 0.11278688, 0.12144879,
        0.12718674, 0.13034447, 0.13508315, 0.1371848 , 0.13986084,
        0.14005381, 0.14211412, 0.14395622, 0.15102764, 0.15172204,
        0.1549306 , 0.15558546, 0.15877689, 0.16080906, 0.16301558,
        0.1652011 , 0.17034847, 0.18053371, 0.19512   , 0.19730517,
        0.19942034, 0.20076455, 0.20383642, 0.20551999, 0.20672576,
        0.2083626 , 0.21079054, 0.21430139, 0.22739734, 0.23133405,
        0.2390121 , 0.25119356, 0.25293632, 0.25504938, 0.26221549,
        0.2672248 , 0.27294676, 0.27898923, 0.28101273, 0.2833532 ,
        0.2948495 , 0.30150228, 0.31265525, 0.31506349, 0.31558217,
        0.32403026, 0.33026969, 0.33909671, 0.33975528, 0.34285471,
        0.34348097, 0.34911358, 0.35055791, 0.35226718, 0.35794658,
        0.36651573, 0.36856541, 0.37737665, 0.37851504, 0.39402324,
        0.40578985, 0.40659636, 0.40684379, 0.40815079, 0.40909755,
        0.4104481 , 0.42973566, 0.44646487, 0.46233474, 0.46237492,
        0.4708803 , 0.47184877, 0.47859752, 0.48978564, 0.49417445,
        0.49888844, 0.50467892, 0.52232442, 0.53108412, 0.53284273,
        0.53479487, 0.5566477 , 0.56013958, 0.57079051, 0.58831196,
        0.60810433, 0.64480311, 0.65039534, 0.6556265 , 0.65748997,
        0.65915666, 0.67448579, 0.67827177, 0.67850542, 0.68442358,
        0.69627082, 0.70464069, 0.71103135, 0.71799077, 0.72075925,
        0.73447005, 0.73926469, 0.74407792, 0.75384527, 0.76032622,
        0.76158012, 0.7624241 , 0.76692943, 0.77351532, 0.78508876,
        0.80784319, 0.81180363, 0.82622197, 0.83345045, 0.8476989 ,
        0.85505117, 0.87843049, 0.91769695, 0.98964115]))
In [277]:
import matplotlib.pyplot as plt
In [278]:
def precision_recall_curve_plot(Y_test, pred_proba_c1):
    # threshold ndarray 와 이 threshold에 따른 정밀도, 재현율 ndarray 추출.
    precisions,recalls, thresholds = precision_recall_curve(Y_test,pred_proba_c1)
    
    # X축을 threshold값으로, Y축은 정밀도, 재현율 값으로 각각 PLOT 수행, 정밀도는 점선으로 표시
    plt.figure(figsize =(8,6))
    threshold_boundary = thresholds.shape[0]
    plt.plot(thresholds, precisions[0:threshold_boundary], linestyle = '--',label='precision')
    plt.plot(thresholds, recalls[0:threshold_boundary],label='recall')
    
    # Threshold 값 X축의 Scale을 0.1 단위로 변경
    start, end = plt.xlim()
    plt.xticks(np.round(np.arange(start, end, 0.1),2))
    
    #X축, Y축 label과 legend, 그리고 grid 설정
    plt.xlabel('Thredshold value');plt.ylabel('Precision and Recall value')
    plt.legend();plt.grid()
    plt.show()
In [279]:
np.round(np.arange(start,end,0.1),2)
Out[279]:
array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])
In [280]:
precision_recall_curve_plot(Y_test,pred_proba_c1)
In [281]:
pima_df.head(2)
Out[281]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
In [282]:
pima_df.describe()
#precnancies, gucose, bloodpressure, skonthickness,lnsulin
pima_df[pima_df['Pregnancies']==0]['Pregnancies'].count()

zero_mean_column = ['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI']

total_data_count = pima_df['Pregnancies'].count()

for i in zero_mean_column:
    zero_count = pima_df[pima_df[i]==0][i].count()
    print('{} 전체 데이터 건수 : {}, 0의 비율 : {:.2f}%'.format(i, total_data_count, 100*zero_count/total_data_count))
    
Pregnancies 전체 데이터 건수 : 768, 0의 비율 : 14.45%
Glucose 전체 데이터 건수 : 768, 0의 비율 : 0.65%
BloodPressure 전체 데이터 건수 : 768, 0의 비율 : 4.56%
SkinThickness 전체 데이터 건수 : 768, 0의 비율 : 29.56%
Insulin 전체 데이터 건수 : 768, 0의 비율 : 48.70%
BMI 전체 데이터 건수 : 768, 0의 비율 : 1.43%
In [283]:
#SkinThickness and Insulin 0의 비중이 너무 높음

zero_mean_column_mean = pima_df[zero_mean_column].mean()
In [284]:
zero_mean_column_mean
pima_df[zero_mean_column] = pima_df[zero_mean_column].replace(0,zero_mean_column_mean)
In [285]:
from sklearn.preprocessing import StandardScaler
X = pima_df.iloc[:,:-1]
Y = pima_df.iloc[:,-1]

scaler = StandardScaler()
x_scaled = scaler.fit_transform(X)

X_train,X_test,Y_train,Y_test = train_test_split(x_scaled,Y, test_size = 0.2, random_state = 156, stratify = Y)

Logistic = LogisticRegression()
Logistic.fit(X_train,Y_train)
pred = Logistic.predict(X_test)
pred_proba = Logistic.predict_proba(X_test)[:,1]

get_clf_eval(Y_test,pred,pred_proba)
오차 행렬
[[89 11]
 [20 34]]

 정확도 : 0.80 
 정밀도 : 0.76 
 재현율 0.63 
 f1 : 0.69 
 AUC : 0.85

배운점

1)from sklearn.metrics 에서 accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,confusion_matrix 를 이용하여 정확도, 정밀도, 재현율, f1스코어, roc_auc점수, 오차 행렬 등을 이용하여 모델을 평가하였는데, 이를 해석하는 법을 배운 것 같다.

2)sklearn의 precision_recall_curve 에 y테스트 데이터와 xtest를 통해서 도출된 각 관측치의 0과 1의 확률을 집어 넣어서 정밀도, 정확도, 임계값 데이터를 얻은 후(임계값에 따른 정밀도, 정확도의 모습) 유틸리티 함수를 하나 만들어서 '시각화'하여 정밀도, 재현율의 가장 이상적인 임계값을 유추할 수 있다.

3)StandardScaler의 fit_transform 함수를 통해서 feature들을 머신러닝 하기 좋게 바꿈

4)아쉬운점

sklearn.preprocessing import Binarizer은 임계값을 조정할 수 있다. inarizer(threshold = ?) 를 하여 임계값 조정한 것을 표로 만들어서 위의 precision_recall_curve 그래프에서 눈대중으로 확인한 것을 정확히 파악

그런데 이거 만들기가 코드 오류가 걸린다...

'머신러닝' 카테고리의 다른 글

캐글 입문, 타이타닉 생존자 예측 모델 만들기!  (0) 2021.03.24
Comments