0%

Python数据科学_11_案例:丙型肝炎预测分析【分类案例】

利用所给数据集使用合适的分类算法,对标签列进行分类。

1
import pandas as pd

数据集下载

读取数据

1
data = pd.read_csv('HepatitisCdata.csv', index_col=0)
1
data.head()
Category Age Sex ALB ALP ALT AST BIL CHE CHOL CREA GGT PROT
1 0=Blood Donor 32 m 38.5 52.5 7.7 22.1 7.5 6.93 3.23 106.0 12.1 69.0
2 0=Blood Donor 32 m 38.5 70.3 18.0 24.7 3.9 11.17 4.80 74.0 15.6 76.5
3 0=Blood Donor 32 m 46.9 74.7 36.2 52.6 6.1 8.84 5.20 86.0 33.2 79.3
4 0=Blood Donor 32 m 43.2 52.0 30.6 22.6 18.9 7.33 4.74 80.0 33.8 75.7
5 0=Blood Donor 32 m 39.2 74.1 32.6 24.8 9.6 9.15 4.32 76.0 29.9 68.7

数据探索

查看数据的摘要信息

1
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 615 entries, 1 to 615
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Category  615 non-null    object 
 1   Age       615 non-null    int64  
 2   Sex       615 non-null    object 
 3   ALB       614 non-null    float64
 4   ALP       597 non-null    float64
 5   ALT       614 non-null    float64
 6   AST       615 non-null    float64
 7   BIL       615 non-null    float64
 8   CHE       615 non-null    float64
 9   CHOL      605 non-null    float64
 10  CREA      615 non-null    float64
 11  GGT       615 non-null    float64
 12  PROT      614 non-null    float64
dtypes: float64(10), int64(1), object(2)
memory usage: 67.3+ KB
1. 原始数据集中存在大量缺失值
2. Category和Sex,这两列是类别型数据

查看数据集中是否有重复值

1
data.shape
(615, 13)
1
data.drop_duplicates().shape
(615, 13)

说明数据集中不存在重复值

查看缺失值所在列的情况

1
data.isna().sum()
Category     0
Age          0
Sex          0
ALB          1
ALP         18
ALT          1
AST          0
BIL          0
CHE          0
CHOL        10
CREA         0
GGT          0
PROT         1
dtype: int64

缺失值较少,这里采用删除缺失值所在的行,去处理缺失值

数据预处理

处理缺失值

1
2
data.dropna(inplace=True)
data.shape
(589, 13)

处理类别型数据

Category列

1
data['Category'].value_counts()
0=Blood Donor             526
3=Cirrhosis                24
1=Hepatitis                20
2=Fibrosis                 12
0s=suspect Blood Donor      7
Name: Category, dtype: int64
1
2
3
4
5
def get_category(str_):
if '0s' in str_:
return 4
else:
return int(str_[0])
1
data['Category'] = data['Category'].apply(get_category)

Sex列

1
data['Sex'].value_counts()
m    363
f    226
Name: Sex, dtype: int64
1
2
3
4
5
def get_Sex(str_):
if str_ == 'm':
return 0
else:
return 1
1
data['Sex'] = data['Sex'].apply(get_Sex)

将特征和标签分开

1
2
x = data.iloc[:, 1:]
y = data.iloc[:, 0]
1
2
3
4
# 切分训练集和测试集
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

模型的搭建和预测

使用决策树模型

1
from sklearn.tree import DecisionTreeClassifier
1
2
tree_model = DecisionTreeClassifier()
tree_model.fit(x_train, y_train)
DecisionTreeClassifier()
1
tree_model.score(x_test, y_test)
0.8983050847457628

支持向量机模型

1
from sklearn.svm import SVC
1
svm_model = SVC()
1
svm_model.fit(x_train, y_train)
SVC()
1
svm_model.score(x_test, y_test)
0.8983050847457628

神经网络模型

1
from sklearn.neural_network import MLPClassifier
1
mlp_model = MLPClassifier()
1
mlp_model.fit(x_train, y_train)
D:\Users\Python\Anaconda3.8\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:692: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  warnings.warn(

MLPClassifier()
1
mlp_model.score(x_test, y_test)
0.9152542372881356

决策树分类模型

1
from sklearn.ensemble import RandomForestClassifier
1
forest_model = RandomForestClassifier()
1
forest_model.fit(x_train, y_train)
RandomForestClassifier()
1
forest_model.score(x_test, y_test)
0.9152542372881356
-------------本文结束感谢您的阅读-------------