피마 인디언 당뇨병 예측¶

Column information¶

1)Pregnancies (임신횟수) =>Number of times pregnant

2)Glucose (포도당) =>Plasma glucose concentration a 2 hours in an oral glucose tolerance test

3)BloodPressure (혈압) =>Diastolic blood pressure (mm Hg)

4)SkinThickness (삼두근 피부 두께) =>Triceps skin fold thickness (mm)

5)Insulin (2시간 혈청 인슐린) =>2-Hour serum insulin (mu U/ml)

6)BMI (체질량지수) =>Body mass index (weight in kg/(height in m)^2)

7)DiabetesPedigreeFunction (당뇨병 혈통 기능) =>Diabetes pedigree function

8)Age =>Age (years)

9)Outcome =>Class variable (0 or 1) 268 of 768 are 1, the others are 0

EDA¶

import pandas as pd

pima_df = pd.read_csv('C:\ca_da\DataHandling/diabetes.csv')
pima_df

pima_df.head()

pima_df.tail(10)

#768 x 9 dataFrame

#ALL DATATYPE -> int or float

#null 값 0개

pima_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

pima_df.describe()

pima_df['Pregnancies'].hist()
'평균값 : {:.2f}'.format(pima_df['Pregnancies'].mean())

'평균값 : 3.85'

pima_df['Glucose'].hist()
'평균값 : {:.2f}'.format(pima_df['Pregnancies'].mean())

'평균값 : 3.85'

pima_df['BloodPressure'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x128100e8988>

pima_df['SkinThickness'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x12810247fc8>

pima_df['Insulin'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x12807147e48>

pima_df['BMI'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x1280fd9cd88>

pima_df['DiabetesPedigreeFunction'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x1280fe20fc8>

pima_df['Age'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x128113c24c8>

pima_df['Outcome'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x128114448c8>

pima_df['Outcome'].value_counts().plot(kind = 'bar')

<matplotlib.axes._subplots.AxesSubplot at 0x128114b4c88>

pima_df.corr(method = 'pearson')

sns.heatmap(pima_df.corr(method = 'pearson'), annot = True)

<matplotlib.axes._subplots.AxesSubplot at 0x12811545048>

Modeling¶

Logistic linear regression

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score,f1_score,confusion_matrix

X = pima_df.iloc[:,:-1]
Y = pima_df.iloc[:,-1]

X_train,X_test,Y_train,Y_test = train_test_split(X,Y, test_size = 0.2, random_state = 156, stratify = Y)


Logistic = LogisticRegression()
Logistic.fit(X_train,Y_train)
pred = Logistic.predict(X_test)
pred
pred_proba = Logistic.predict_proba(X_test)[:,1]

C:\ca_da\anaconda\envs\ca-da\lib\site-packages\sklearn\linear_model\_logistic.py:764: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)

Logistic.score(X_train,Y_train)

0.7833876221498371

def get_clf_eval(Y_test, pred=None, pred_proba=None):
    confusion = confusion_matrix(Y_test, pred)
    accuracy = accuracy_score(Y_test, pred)
    precision = precision_score(Y_test, pred)
    recall = recall_score(Y_test, pred)
    f1 = f1_score(Y_test, pred)
    roc_auc = roc_auc_score(Y_test, pred_proba)
    print('오차 행렬')
    print(confusion)
    print('\n 정확도 : {:.2f} \n 정밀도 : {:.2f} \n 재현율 {:.2f} \n f1 : {:.2f} \n AUC : {:.2f}'.format(accuracy, precision, recall, f1, roc_auc))

get_clf_eval(Y_test, pred, pred_proba)

오차 행렬
[[88 12]
 [23 31]]

 정확도 : 0.77 
 정밀도 : 0.72 
 재현율 0.57 
 f1 : 0.64 
 AUC : 0.79

from sklearn.metrics import roc_curve
from sklearn.metrics import precision_recall_curve
pred_proba_c1 = Logistic.predict_proba(X_test)[:,1]

precision_recall_curve(Y_test,pred_proba_c1)

(array([0.35064935, 0.34640523, 0.34868421, 0.34437086, 0.34666667,
        0.34228188, 0.34459459, 0.34693878, 0.34931507, 0.35172414,
        0.35416667, 0.35664336, 0.35915493, 0.36170213, 0.36428571,
        0.36690647, 0.36956522, 0.37226277, 0.375     , 0.37777778,
        0.38059701, 0.38345865, 0.38636364, 0.38931298, 0.39230769,
        0.39534884, 0.3984375 , 0.4015748 , 0.4047619 , 0.4       ,
        0.40322581, 0.40650407, 0.40983607, 0.41322314, 0.41666667,
        0.42016807, 0.42372881, 0.42735043, 0.43103448, 0.43478261,
        0.43859649, 0.44247788, 0.44642857, 0.45045045, 0.45454545,
        0.4587156 , 0.46296296, 0.46728972, 0.47169811, 0.46666667,
        0.47115385, 0.46601942, 0.47058824, 0.47524752, 0.48      ,
        0.48484848, 0.48979592, 0.48453608, 0.48958333, 0.49473684,
        0.4893617 , 0.49462366, 0.5       , 0.49450549, 0.5       ,
        0.50561798, 0.51136364, 0.51724138, 0.51162791, 0.50588235,
        0.5       , 0.5060241 , 0.51219512, 0.51851852, 0.525     ,
        0.51898734, 0.52564103, 0.53246753, 0.53947368, 0.54666667,
        0.54054054, 0.54794521, 0.55555556, 0.56338028, 0.57142857,
        0.57971014, 0.58823529, 0.58208955, 0.57575758, 0.58461538,
        0.578125  , 0.58730159, 0.59677419, 0.60655738, 0.61666667,
        0.61016949, 0.62068966, 0.63157895, 0.64285714, 0.65454545,
        0.64814815, 0.66037736, 0.67307692, 0.66666667, 0.68      ,
        0.69387755, 0.6875    , 0.70212766, 0.69565217, 0.71111111,
        0.70454545, 0.72093023, 0.73809524, 0.73170732, 0.725     ,
        0.71794872, 0.71052632, 0.7027027 , 0.72222222, 0.74285714,
        0.73529412, 0.75757576, 0.78125   , 0.77419355, 0.76666667,
        0.79310345, 0.82142857, 0.85185185, 0.84615385, 0.84      ,
        0.875     , 0.91304348, 0.90909091, 0.9047619 , 0.9       ,
        0.89473684, 0.88888889, 0.88235294, 0.875     , 0.86666667,
        0.92857143, 0.92307692, 0.91666667, 0.90909091, 0.9       ,
        0.88888889, 0.875     , 0.85714286, 0.83333333, 0.8       ,
        0.75      , 0.66666667, 0.5       , 0.        , 1.        ]),
 array([1.        , 0.98148148, 0.98148148, 0.96296296, 0.96296296,
        0.94444444, 0.94444444, 0.94444444, 0.94444444, 0.94444444,
        0.94444444, 0.94444444, 0.94444444, 0.94444444, 0.94444444,
        0.94444444, 0.94444444, 0.94444444, 0.94444444, 0.94444444,
        0.94444444, 0.94444444, 0.94444444, 0.94444444, 0.94444444,
        0.94444444, 0.94444444, 0.94444444, 0.94444444, 0.92592593,
        0.92592593, 0.92592593, 0.92592593, 0.92592593, 0.92592593,
        0.92592593, 0.92592593, 0.92592593, 0.92592593, 0.92592593,
        0.92592593, 0.92592593, 0.92592593, 0.92592593, 0.92592593,
        0.92592593, 0.92592593, 0.92592593, 0.92592593, 0.90740741,
        0.90740741, 0.88888889, 0.88888889, 0.88888889, 0.88888889,
        0.88888889, 0.88888889, 0.87037037, 0.87037037, 0.87037037,
        0.85185185, 0.85185185, 0.85185185, 0.83333333, 0.83333333,
        0.83333333, 0.83333333, 0.83333333, 0.81481481, 0.7962963 ,
        0.77777778, 0.77777778, 0.77777778, 0.77777778, 0.77777778,
        0.75925926, 0.75925926, 0.75925926, 0.75925926, 0.75925926,
        0.74074074, 0.74074074, 0.74074074, 0.74074074, 0.74074074,
        0.74074074, 0.74074074, 0.72222222, 0.7037037 , 0.7037037 ,
        0.68518519, 0.68518519, 0.68518519, 0.68518519, 0.68518519,
        0.66666667, 0.66666667, 0.66666667, 0.66666667, 0.66666667,
        0.64814815, 0.64814815, 0.64814815, 0.62962963, 0.62962963,
        0.62962963, 0.61111111, 0.61111111, 0.59259259, 0.59259259,
        0.57407407, 0.57407407, 0.57407407, 0.55555556, 0.53703704,
        0.51851852, 0.5       , 0.48148148, 0.48148148, 0.48148148,
        0.46296296, 0.46296296, 0.46296296, 0.44444444, 0.42592593,
        0.42592593, 0.42592593, 0.42592593, 0.40740741, 0.38888889,
        0.38888889, 0.38888889, 0.37037037, 0.35185185, 0.33333333,
        0.31481481, 0.2962963 , 0.27777778, 0.25925926, 0.24074074,
        0.24074074, 0.22222222, 0.2037037 , 0.18518519, 0.16666667,
        0.14814815, 0.12962963, 0.11111111, 0.09259259, 0.07407407,
        0.05555556, 0.03703704, 0.01851852, 0.        , 0.        ]),
 array([0.01673088, 0.02178327, 0.03585348, 0.04204322, 0.0452228 ,
        0.05198571, 0.05475138, 0.05804483, 0.05810688, 0.06556026,
        0.06593609, 0.06623214, 0.06751545, 0.06794042, 0.06814109,
        0.06947113, 0.07472193, 0.07988302, 0.08304228, 0.08480979,
        0.08553367, 0.08959614, 0.09015531, 0.09202543, 0.09686889,
        0.09991602, 0.10163466, 0.10230041, 0.10243785, 0.10301033,
        0.10483266, 0.10780982, 0.11168524, 0.11278688, 0.12144879,
        0.12718674, 0.13034447, 0.13508315, 0.1371848 , 0.13986084,
        0.14005381, 0.14211412, 0.14395622, 0.15102764, 0.15172204,
        0.1549306 , 0.15558546, 0.15877689, 0.16080906, 0.16301558,
        0.1652011 , 0.17034847, 0.18053371, 0.19512   , 0.19730517,
        0.19942034, 0.20076455, 0.20383642, 0.20551999, 0.20672576,
        0.2083626 , 0.21079054, 0.21430139, 0.22739734, 0.23133405,
        0.2390121 , 0.25119356, 0.25293632, 0.25504938, 0.26221549,
        0.2672248 , 0.27294676, 0.27898923, 0.28101273, 0.2833532 ,
        0.2948495 , 0.30150228, 0.31265525, 0.31506349, 0.31558217,
        0.32403026, 0.33026969, 0.33909671, 0.33975528, 0.34285471,
        0.34348097, 0.34911358, 0.35055791, 0.35226718, 0.35794658,
        0.36651573, 0.36856541, 0.37737665, 0.37851504, 0.39402324,
        0.40578985, 0.40659636, 0.40684379, 0.40815079, 0.40909755,
        0.4104481 , 0.42973566, 0.44646487, 0.46233474, 0.46237492,
        0.4708803 , 0.47184877, 0.47859752, 0.48978564, 0.49417445,
        0.49888844, 0.50467892, 0.52232442, 0.53108412, 0.53284273,
        0.53479487, 0.5566477 , 0.56013958, 0.57079051, 0.58831196,
        0.60810433, 0.64480311, 0.65039534, 0.6556265 , 0.65748997,
        0.65915666, 0.67448579, 0.67827177, 0.67850542, 0.68442358,
        0.69627082, 0.70464069, 0.71103135, 0.71799077, 0.72075925,
        0.73447005, 0.73926469, 0.74407792, 0.75384527, 0.76032622,
        0.76158012, 0.7624241 , 0.76692943, 0.77351532, 0.78508876,
        0.80784319, 0.81180363, 0.82622197, 0.83345045, 0.8476989 ,
        0.85505117, 0.87843049, 0.91769695, 0.98964115]))

import matplotlib.pyplot as plt

def precision_recall_curve_plot(Y_test, pred_proba_c1):
    # threshold ndarray 와 이 threshold에 따른 정밀도, 재현율 ndarray 추출.
    precisions,recalls, thresholds = precision_recall_curve(Y_test,pred_proba_c1)
    
    # X축을 threshold값으로, Y축은 정밀도, 재현율 값으로 각각 PLOT 수행, 정밀도는 점선으로 표시
    plt.figure(figsize =(8,6))
    threshold_boundary = thresholds.shape[0]
    plt.plot(thresholds, precisions[0:threshold_boundary], linestyle = '--',label='precision')
    plt.plot(thresholds, recalls[0:threshold_boundary],label='recall')
    
    # Threshold 값 X축의 Scale을 0.1 단위로 변경
    start, end = plt.xlim()
    plt.xticks(np.round(np.arange(start, end, 0.1),2))
    
    #X축, Y축 label과 legend, 그리고 grid 설정
    plt.xlabel('Thredshold value');plt.ylabel('Precision and Recall value')
    plt.legend();plt.grid()
    plt.show()

np.round(np.arange(start,end,0.1),2)

array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])

precision_recall_curve_plot(Y_test,pred_proba_c1)

pima_df.head(2)

pima_df.describe()
#precnancies, gucose, bloodpressure, skonthickness,lnsulin
pima_df[pima_df['Pregnancies']==0]['Pregnancies'].count()

zero_mean_column = ['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI']

total_data_count = pima_df['Pregnancies'].count()

for i in zero_mean_column:
    zero_count = pima_df[pima_df[i]==0][i].count()
    print('{} 전체 데이터 건수 : {}, 0의 비율 : {:.2f}%'.format(i, total_data_count, 100*zero_count/total_data_count))

Pregnancies 전체 데이터 건수 : 768, 0의 비율 : 14.45%
Glucose 전체 데이터 건수 : 768, 0의 비율 : 0.65%
BloodPressure 전체 데이터 건수 : 768, 0의 비율 : 4.56%
SkinThickness 전체 데이터 건수 : 768, 0의 비율 : 29.56%
Insulin 전체 데이터 건수 : 768, 0의 비율 : 48.70%
BMI 전체 데이터 건수 : 768, 0의 비율 : 1.43%

#SkinThickness and Insulin 0의 비중이 너무 높음

zero_mean_column_mean = pima_df[zero_mean_column].mean()

zero_mean_column_mean
pima_df[zero_mean_column] = pima_df[zero_mean_column].replace(0,zero_mean_column_mean)

from sklearn.preprocessing import StandardScaler
X = pima_df.iloc[:,:-1]
Y = pima_df.iloc[:,-1]

scaler = StandardScaler()
x_scaled = scaler.fit_transform(X)

X_train,X_test,Y_train,Y_test = train_test_split(x_scaled,Y, test_size = 0.2, random_state = 156, stratify = Y)

Logistic = LogisticRegression()
Logistic.fit(X_train,Y_train)
pred = Logistic.predict(X_test)
pred_proba = Logistic.predict_proba(X_test)[:,1]

get_clf_eval(Y_test,pred,pred_proba)

오차 행렬
[[89 11]
 [20 34]]

 정확도 : 0.80 
 정밀도 : 0.76 
 재현율 0.63 
 f1 : 0.69 
 AUC : 0.85

배운점¶

1)from sklearn.metrics 에서 accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,confusion_matrix 를 이용하여 정확도, 정밀도, 재현율, f1스코어, roc_auc점수, 오차 행렬 등을 이용하여 모델을 평가하였는데, 이를 해석하는 법을 배운 것 같다.

2)sklearn의 precision_recall_curve 에 y테스트 데이터와 xtest를 통해서 도출된 각 관측치의 0과 1의 확률을 집어 넣어서 정밀도, 정확도, 임계값 데이터를 얻은 후(임계값에 따른 정밀도, 정확도의 모습) 유틸리티 함수를 하나 만들어서 '시각화'하여 정밀도, 재현율의 가장 이상적인 임계값을 유추할 수 있다.

3)StandardScaler의 fit_transform 함수를 통해서 feature들을 머신러닝 하기 좋게 바꿈

4)아쉬운점

sklearn.preprocessing import Binarizer은 임계값을 조정할 수 있다. inarizer(threshold = ?) 를 하여 임계값 조정한 것을 표로 만들어서 위의 precision_recall_curve 그래프에서 눈대중으로 확인한 것을 정확히 파악

그런데 이거 만들기가 코드 오류가 걸린다...

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
count	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000
mean	3.845052	120.894531	69.105469	20.536458	79.799479	31.992578	0.471876	33.240885	0.348958
std	3.369578	31.972618	19.355807	15.952218	115.244002	7.884160	0.331329	11.760232	0.476951
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.078000	21.000000	0.000000
25%	1.000000	99.000000	62.000000	0.000000	0.000000	27.300000	0.243750	24.000000	0.000000
50%	3.000000	117.000000	72.000000	23.000000	30.500000	32.000000	0.372500	29.000000	0.000000
75%	6.000000	140.250000	80.000000	32.000000	127.250000	36.600000	0.626250	41.000000	1.000000
max	17.000000	199.000000	122.000000	99.000000	846.000000	67.100000	2.420000	81.000000	1.000000

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
Pregnancies	1.000000	0.129459	0.141282	-0.081672	-0.073535	0.017683	-0.033523	0.544341	0.221898
Glucose	0.129459	1.000000	0.152590	0.057328	0.331357	0.221071	0.137337	0.263514	0.466581
BloodPressure	0.141282	0.152590	1.000000	0.207371	0.088933	0.281805	0.041265	0.239528	0.065068
SkinThickness	-0.081672	0.057328	0.207371	1.000000	0.436783	0.392573	0.183928	-0.113970	0.074752
Insulin	-0.073535	0.331357	0.088933	0.436783	1.000000	0.197859	0.185071	-0.042163	0.130548
BMI	0.017683	0.221071	0.281805	0.392573	0.197859	1.000000	0.140647	0.036242	0.292695
DiabetesPedigreeFunction	-0.033523	0.137337	0.041265	0.183928	0.185071	0.140647	1.000000	0.033561	0.173844
Age	0.544341	0.263514	0.239528	-0.113970	-0.042163	0.036242	0.033561	1.000000	0.238356
Outcome	0.221898	0.466581	0.065068	0.074752	0.130548	0.292695	0.173844	0.238356	1.000000

데이터 한 그릇

데이터 한 그릇