利用所给数据集使用合适的分类算法,对标签列进行分类。
数据集下载
读取数据
1
| data = pd.read_csv('HepatitisCdata.csv', index_col=0)
|
|
Category |
Age |
Sex |
ALB |
ALP |
ALT |
AST |
BIL |
CHE |
CHOL |
CREA |
GGT |
PROT |
1 |
0=Blood Donor |
32 |
m |
38.5 |
52.5 |
7.7 |
22.1 |
7.5 |
6.93 |
3.23 |
106.0 |
12.1 |
69.0 |
2 |
0=Blood Donor |
32 |
m |
38.5 |
70.3 |
18.0 |
24.7 |
3.9 |
11.17 |
4.80 |
74.0 |
15.6 |
76.5 |
3 |
0=Blood Donor |
32 |
m |
46.9 |
74.7 |
36.2 |
52.6 |
6.1 |
8.84 |
5.20 |
86.0 |
33.2 |
79.3 |
4 |
0=Blood Donor |
32 |
m |
43.2 |
52.0 |
30.6 |
22.6 |
18.9 |
7.33 |
4.74 |
80.0 |
33.8 |
75.7 |
5 |
0=Blood Donor |
32 |
m |
39.2 |
74.1 |
32.6 |
24.8 |
9.6 |
9.15 |
4.32 |
76.0 |
29.9 |
68.7 |
数据探索
查看数据的摘要信息
<class 'pandas.core.frame.DataFrame'>
Int64Index: 615 entries, 1 to 615
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Category 615 non-null object
1 Age 615 non-null int64
2 Sex 615 non-null object
3 ALB 614 non-null float64
4 ALP 597 non-null float64
5 ALT 614 non-null float64
6 AST 615 non-null float64
7 BIL 615 non-null float64
8 CHE 615 non-null float64
9 CHOL 605 non-null float64
10 CREA 615 non-null float64
11 GGT 615 non-null float64
12 PROT 614 non-null float64
dtypes: float64(10), int64(1), object(2)
memory usage: 67.3+ KB
1. 原始数据集中存在大量缺失值
2. Category和Sex,这两列是类别型数据
查看数据集中是否有重复值
(615, 13)
1
| data.drop_duplicates().shape
|
(615, 13)
说明数据集中不存在重复值
查看缺失值所在列的情况
Category 0
Age 0
Sex 0
ALB 1
ALP 18
ALT 1
AST 0
BIL 0
CHE 0
CHOL 10
CREA 0
GGT 0
PROT 1
dtype: int64
缺失值较少,这里采用删除缺失值所在的行,去处理缺失值
数据预处理
处理缺失值
1 2
| data.dropna(inplace=True) data.shape
|
(589, 13)
处理类别型数据
Category列
1
| data['Category'].value_counts()
|
0=Blood Donor 526
3=Cirrhosis 24
1=Hepatitis 20
2=Fibrosis 12
0s=suspect Blood Donor 7
Name: Category, dtype: int64
1 2 3 4 5
| def get_category(str_): if '0s' in str_: return 4 else: return int(str_[0])
|
1
| data['Category'] = data['Category'].apply(get_category)
|
Sex列
1
| data['Sex'].value_counts()
|
m 363
f 226
Name: Sex, dtype: int64
1 2 3 4 5
| def get_Sex(str_): if str_ == 'm': return 0 else: return 1
|
1
| data['Sex'] = data['Sex'].apply(get_Sex)
|
将特征和标签分开
1 2
| x = data.iloc[:, 1:] y = data.iloc[:, 0]
|
1 2 3 4
| from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
|
模型的搭建和预测
使用决策树模型
1
| from sklearn.tree import DecisionTreeClassifier
|
1 2
| tree_model = DecisionTreeClassifier() tree_model.fit(x_train, y_train)
|
DecisionTreeClassifier()
1
| tree_model.score(x_test, y_test)
|
0.8983050847457628
支持向量机模型
1
| from sklearn.svm import SVC
|
1
| svm_model.fit(x_train, y_train)
|
SVC()
1
| svm_model.score(x_test, y_test)
|
0.8983050847457628
神经网络模型
1
| from sklearn.neural_network import MLPClassifier
|
1
| mlp_model = MLPClassifier()
|
1
| mlp_model.fit(x_train, y_train)
|
D:\Users\Python\Anaconda3.8\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:692: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
warnings.warn(
MLPClassifier()
1
| mlp_model.score(x_test, y_test)
|
0.9152542372881356
决策树分类模型
1
| from sklearn.ensemble import RandomForestClassifier
|
1
| forest_model = RandomForestClassifier()
|
1
| forest_model.fit(x_train, y_train)
|
RandomForestClassifier()
1
| forest_model.score(x_test, y_test)
|
0.9152542372881356