利用所给数据集使用合适的分类算法，对标签列进行分类。

1	import pandas as pd

读取数据

1	data = pd.read_csv('HepatitisCdata.csv', index_col=0)

1	data.head()

	Category	Age	Sex	ALB	ALP	ALT	AST	BIL	CHE	CHOL	CREA	GGT	PROT
1	0=Blood Donor	32	m	38.5	52.5	7.7	22.1	7.5	6.93	3.23	106.0	12.1	69.0
2	0=Blood Donor	32	m	38.5	70.3	18.0	24.7	3.9	11.17	4.80	74.0	15.6	76.5
3	0=Blood Donor	32	m	46.9	74.7	36.2	52.6	6.1	8.84	5.20	86.0	33.2	79.3
4	0=Blood Donor	32	m	43.2	52.0	30.6	22.6	18.9	7.33	4.74	80.0	33.8	75.7
5	0=Blood Donor	32	m	39.2	74.1	32.6	24.8	9.6	9.15	4.32	76.0	29.9	68.7

数据探索

查看数据的摘要信息

1	data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 615 entries, 1 to 615
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Category  615 non-null    object 
 1   Age       615 non-null    int64  
 2   Sex       615 non-null    object 
 3   ALB       614 non-null    float64
 4   ALP       597 non-null    float64
 5   ALT       614 non-null    float64
 6   AST       615 non-null    float64
 7   BIL       615 non-null    float64
 8   CHE       615 non-null    float64
 9   CHOL      605 non-null    float64
 10  CREA      615 non-null    float64
 11  GGT       615 non-null    float64
 12  PROT      614 non-null    float64
dtypes: float64(10), int64(1), object(2)
memory usage: 67.3+ KB

1. 原始数据集中存在大量缺失值
2. Category和Sex，这两列是类别型数据

查看数据集中是否有重复值

1	data.shape

(615, 13)

1	data.drop_duplicates().shape

(615, 13)

说明数据集中不存在重复值

查看缺失值所在列的情况

1	data.isna().sum()

Category     0
Age          0
Sex          0
ALB          1
ALP         18
ALT          1
AST          0
BIL          0
CHE          0
CHOL        10
CREA         0
GGT          0
PROT         1
dtype: int64

缺失值较少，这里采用删除缺失值所在的行，去处理缺失值

数据预处理

处理缺失值

1 2	data.dropna(inplace=True) data.shape

(589, 13)

处理类别型数据

Category列

1	data['Category'].value_counts()

0=Blood Donor             526
3=Cirrhosis                24
1=Hepatitis                20
2=Fibrosis                 12
0s=suspect Blood Donor      7
Name: Category, dtype: int64

def get_category(str_):
    if '0s' in str_:
        return 4
    else:
        return int(str_[0])

1	data['Category'] = data['Category'].apply(get_category)

Sex列

1	data['Sex'].value_counts()

m    363
f    226
Name: Sex, dtype: int64

def get_Sex(str_):
    if str_ == 'm':
        return 0
    else:
        return 1

1	data['Sex'] = data['Sex'].apply(get_Sex)

将特征和标签分开

1 2	x = data.iloc[:, 1:] y = data.iloc[:, 0]

# 切分训练集和测试集
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

模型的搭建和预测

使用决策树模型

1	from sklearn.tree import DecisionTreeClassifier

1 2	tree_model = DecisionTreeClassifier() tree_model.fit(x_train, y_train)

DecisionTreeClassifier()

1	tree_model.score(x_test, y_test)

0.8983050847457628

支持向量机模型

1	from sklearn.svm import SVC

1	svm_model = SVC()

1	svm_model.fit(x_train, y_train)

SVC()

1	svm_model.score(x_test, y_test)

0.8983050847457628

神经网络模型

1	from sklearn.neural_network import MLPClassifier

1	mlp_model = MLPClassifier()

1	mlp_model.fit(x_train, y_train)

D:\Users\Python\Anaconda3.8\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:692: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  warnings.warn(

MLPClassifier()

1	mlp_model.score(x_test, y_test)

0.9152542372881356

决策树分类模型

1	from sklearn.ensemble import RandomForestClassifier

1	forest_model = RandomForestClassifier()

1	forest_model.fit(x_train, y_train)

RandomForestClassifier()

1	forest_model.score(x_test, y_test)

0.9152542372881356

Ming-Log's Blog

Python数据科学_11_案例：丙型肝炎预测分析【分类案例】

读取数据

数据探索

查看数据的摘要信息

查看数据集中是否有重复值

查看缺失值所在列的情况

数据预处理

处理缺失值

处理类别型数据

Category列

Sex列

将特征和标签分开

模型的搭建和预测

使用决策树模型

支持向量机模型

神经网络模型

决策树分类模型