0%

Python数据科学_9_决策树算法基础

泰坦尼克号生还者预测

读取数据集

1
import pandas as pd
1
data = pd.read_csv('titanic_data.csv', index_col='PassengerId')

数据探索

1
data.head()
Survived Pclass Sex Age
PassengerId
1 0 3 male 22.0
2 1 1 female 38.0
3 1 3 female 26.0
4 1 1 female 35.0
5 0 3 male 35.0

重复值处理

1
2
# 查看去除重复值前数据框的尺寸
data.shape
(891, 4)
1
data.drop_duplicates(inplace=True)
1
2
# 查看去除重复值后数据框的尺寸
data.shape
(349, 4)

查看数据框的缺失值情况

1
data.isna().sum()
Survived     0
Pclass       0
Sex          0
Age         10
dtype: int64

可以发现原始数据框Age列存在缺失值

查看数据框的摘要信息

1
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 349 entries, 1 to 890
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  349 non-null    int64  
 1   Pclass    349 non-null    int64  
 2   Sex       349 non-null    object 
 3   Age       339 non-null    float64
dtypes: float64(1), int64(2), object(1)
memory usage: 13.6+ KB

发现性别列为定性数据,后续需要将其转化为定量数据

数据预处理

处理缺失值

当缺失值数量与样本总数相比时,数量不多,可以选择将缺失值删除处理;当缺失值数量与样本总数相比,数量较多时,不能删除缺失值。

1
2
# 删除缺失值
data.dropna(inplace=True)
1
data.isna().sum()
Survived    0
Pclass      0
Sex         0
Age         0
dtype: int64

现在可以看出已经不存在缺失值

将定性数据转化为定量数据

将类别型数据转化为数值型数据

1
2
3
4
def Sex2num(str_):
if str_ == 'male':
return 0
return 1
1
data['Sex'] = data['Sex'].apply(Sex2num)
1
data.head()
Survived Pclass Sex Age
PassengerId
1 0 3 0 22.0
2 1 1 1 38.0
3 1 3 1 26.0
4 1 1 1 35.0
5 0 3 0 35.0

分离特征和标签

1
2
x = data.iloc[:, -3:]
y = data.iloc[:, 0]
1
2
print(x.shape)
print(y.shape)
(339, 3)
(339,)

将数据集切分为训练集和测试集

1
2
3
4
5
6
7
# 数据集的切分
# 使用留出法将数据集切分为训练集和测试集
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
print(x_train.shape)
print(x_test.shape)
(271, 3)
(68, 3)

模型的搭建和训练

1
2
# 导入分类树模型
from sklearn.tree import DecisionTreeClassifier
1
decision_tree_model = DecisionTreeClassifier()
1
decision_tree_model.fit(x_train, y_train)
DecisionTreeClassifier()

模型测试

1
y_pred = decision_tree_model.predict(x_test)

打印分类报告

1
2
3
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.58      0.61      0.59        36
           1       0.53      0.50      0.52        32

    accuracy                           0.56        68
   macro avg       0.56      0.56      0.56        68
weighted avg       0.56      0.56      0.56        68
-------------本文结束感谢您的阅读-------------