泰坦尼克号生还者预测
读取数据集
1
| data = pd.read_csv('titanic_data.csv', index_col='PassengerId')
|
数据探索
|
Survived |
Pclass |
Sex |
Age |
PassengerId |
|
|
|
|
1 |
0 |
3 |
male |
22.0 |
2 |
1 |
1 |
female |
38.0 |
3 |
1 |
3 |
female |
26.0 |
4 |
1 |
1 |
female |
35.0 |
5 |
0 |
3 |
male |
35.0 |
重复值处理
(891, 4)
1
| data.drop_duplicates(inplace=True)
|
(349, 4)
查看数据框的缺失值情况
Survived 0
Pclass 0
Sex 0
Age 10
dtype: int64
可以发现原始数据框Age列存在缺失值
查看数据框的摘要信息
<class 'pandas.core.frame.DataFrame'>
Int64Index: 349 entries, 1 to 890
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Survived 349 non-null int64
1 Pclass 349 non-null int64
2 Sex 349 non-null object
3 Age 339 non-null float64
dtypes: float64(1), int64(2), object(1)
memory usage: 13.6+ KB
发现性别列为定性数据,后续需要将其转化为定量数据
数据预处理
处理缺失值
当缺失值数量与样本总数相比时,数量不多,可以选择将缺失值删除处理;当缺失值数量与样本总数相比,数量较多时,不能删除缺失值。
1 2
| data.dropna(inplace=True)
|
Survived 0
Pclass 0
Sex 0
Age 0
dtype: int64
现在可以看出已经不存在缺失值
将定性数据转化为定量数据
将类别型数据转化为数值型数据
1 2 3 4
| def Sex2num(str_): if str_ == 'male': return 0 return 1
|
1
| data['Sex'] = data['Sex'].apply(Sex2num)
|
|
Survived |
Pclass |
Sex |
Age |
PassengerId |
|
|
|
|
1 |
0 |
3 |
0 |
22.0 |
2 |
1 |
1 |
1 |
38.0 |
3 |
1 |
3 |
1 |
26.0 |
4 |
1 |
1 |
1 |
35.0 |
5 |
0 |
3 |
0 |
35.0 |
分离特征和标签
1 2
| x = data.iloc[:, -3:] y = data.iloc[:, 0]
|
1 2
| print(x.shape) print(y.shape)
|
(339, 3)
(339,)
将数据集切分为训练集和测试集
1 2 3 4 5 6 7
|
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2) print(x_train.shape) print(x_test.shape)
|
(271, 3)
(68, 3)
模型的搭建和训练
1 2
| from sklearn.tree import DecisionTreeClassifier
|
1
| decision_tree_model = DecisionTreeClassifier()
|
1
| decision_tree_model.fit(x_train, y_train)
|
DecisionTreeClassifier()
模型测试
1
| y_pred = decision_tree_model.predict(x_test)
|
打印分类报告
1 2 3
| from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
|
precision recall f1-score support
0 0.58 0.61 0.59 36
1 0.53 0.50 0.52 32
accuracy 0.56 68
macro avg 0.56 0.56 0.56 68
weighted avg 0.56 0.56 0.56 68