0%

Python数据科学_21_案例:B站数据抓取【综合爬虫项目】

获取单个视频

发送get请求

目标网址

1
import requests
1
url = 'https://www.bilibili.com/video/BV1bU4y1775K'
1
response = requests.get(url)

获取视频标题

1
import re
1
2
title = re.findall('"title":"(.*?)","pubdate":', response.text)[0]
title
'今天是袁隆平诞辰91周年,怀念…'

由于在windows中文件的名称是不能包含以下字符【/\\:*?”<>|!】的,所以需要将其在title中删除

1
2
title = re.sub('[/\\:*?"<>| ]', '', title)
title
'今天是袁隆平诞辰91周年,怀念…'

解析视频地址

1
json_file = re.findall('<script>window.__playinfo__=(.*?)</script>', response.text)[0]

经过我们的观察发现获取到的结果,是一个json格式的数据,后续在对其进行解析时,可以将其转化为字典格式。

1
import json
1
2
3
# 将字符串转化为字典格式
res_json = json.loads(json_file)
res_json['data']['dash'].keys()
dict_keys(['duration', 'minBufferTime', 'min_buffer_time', 'video', 'audio', 'dolby', 'flac'])

通过查看dash的所有键,我们发现里面有video和audio两个字段,说明在B站中,其音频和视频是分开。

1
2
video_url = res_json['data']['dash']['video'][0]['baseUrl']
video_url
'https://upos-sz-mirror08ct.bilivideo.com/upgcxcode/93/00/404120093/404120093_nb2-1-30064.m4s?e=ig8euxZM2rNcNbdlhoNvNC8BqJIzNbfqXBvEqxTEto8BTrNvN0GvT90W5JZMkX_YN0MvXg8gNEV4NC8xNEV4N03eN0B5tZlqNxTEto8BTrNvNeZVuJ10Kj_g2UB02J0mN0B5tZlqNCNEto8BTrNvNC7MTX502C8f2jmMQJ6mqF2fka1mqx6gqj0eN0B599M=&uipk=5&nbs=1&deadline=1663151930&gen=playurlv2&os=08ctbv&oi=2003056842&trid=eca493d85bed4bfabaa956f7e6c92ce0u&mid=0&platform=pc&upsig=6a6615ff1df40c1f003e64874e27383c&uparams=e,uipk,nbs,deadline,gen,os,oi,trid,mid,platform&bvc=vod&nettype=0&orderid=0,3&agrr=1&bw=110694&logo=80000000'

解析音频地址

1
2
audio_url = res_json['data']['dash']['audio'][0]['baseUrl']
audio_url
'https://upos-sz-estgoss.bilivideo.com/upgcxcode/93/00/404120093/404120093_nb2-1-30280.m4s?e=ig8euxZM2rNcNbdlhoNvNC8BqJIzNbfqXBvEqxTEto8BTrNvN0GvT90W5JZMkX_YN0MvXg8gNEV4NC8xNEV4N03eN0B5tZlqNxTEto8BTrNvNeZVuJ10Kj_g2UB02J0mN0B5tZlqNCNEto8BTrNvNC7MTX502C8f2jmMQJ6mqF2fka1mqx6gqj0eN0B599M=&uipk=5&nbs=1&deadline=1663151930&gen=playurlv2&os=upos&oi=2003056842&trid=eca493d85bed4bfabaa956f7e6c92ce0u&mid=0&platform=pc&upsig=15941e8938cb1bff44b5ecf719123273&uparams=e,uipk,nbs,deadline,gen,os,oi,trid,mid,platform&bvc=vod&nettype=0&orderid=0,3&agrr=1&bw=39999&logo=80000000'

在这里我们尝试了一下和前面抓取图片相同的方式,直接将音视频地址复制到浏览器地址栏中进行访问,结果返回403 Forbidden,说明这个资源是有的,但是无法被浏览器直接进行访问。

给请求头加上防盗链referer

1
2
3
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
}
1
2
video_res = requests.get(video_url, headers=headers)
video_res
<Response [403]>

发现在未添加防盗链时,数据是请求不到的。同样返回403错误

1
2
3
4
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
'referer': url
}
1
2
video_res = requests.get(video_url, headers=headers)
video_res
<Response [200]>
1
2
with open(title+'.mp4', 'wb') as f:
f.write(video_res.content)
1
2
audio_res = requests.get(audio_url, headers=headers)
audio_res
1
2
with open(title+'.mp3', 'wb') as f:
f.write(audio_res.content)

封装代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def get_audio_and_vedio(url, title=None):
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
'referer': url
}
response = requests.get(url)
if not title:
title = re.findall('"title":"(.*?)","pubdate":', response.text)[0]
title = re.sub('[/\\:*?"<>| ]', '', title)
json_file = re.findall('<script>window.__playinfo__=(.*?)</script>', response.text)[0]
# 将字符串转化为字典格式
res_json = json.loads(json_file)
video_url = res_json['data']['dash']['video'][0]['baseUrl']
audio_url = res_json['data']['dash']['audio'][0]['baseUrl']
video_res = requests.get(video_url, headers=headers)
audio_res = requests.get(audio_url, headers=headers)
with open(title+'.mp4', 'wb') as f:
f.write(video_res.content)
with open(title+'.mp3', 'wb') as f:
f.write(audio_res.content)
1
get_audio_and_vedio('https://www.bilibili.com/video/BV1Qe4y1b7y4?vd_source=7f9e6b6e1c2f8486b2f6f3d6520c63fb')

获取多个视频全集

目标网址

获取所有的选集信息

1
import requests
1
start_url = 'https://www.bilibili.com/video/BV1ps411x7rm'
1
response = requests.get(url)
1
import re
1
2
3
4
import json

res_json = json.loads('{' + re.findall('("pages":\[.*?\])', response.text)[0] + '}')
print('该视频的选集数量一共有:', len(res_json['pages']))
该视频的选集数量一共有: 31

批量获取所有选集信息

1
2
3
4
for i in range(len(res_json['pages'])):
url = start_url + f'?p={i+1}'
print('正在抓取:', url)
get_audio_and_vedio(url, res_json['pages'][i]['part'])
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=1
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=2
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=3
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=4
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=5
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=6
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=7
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=8
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=9
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=10
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=11
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=12
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=13
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=14
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=15
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=16
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=17
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=18
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=19
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=20
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=21
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=22
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=23
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=24
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=25
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=26
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=27
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=28
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=29
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=30
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=31

批量将所有音频和视频合并

首先需要在官网下载ffmpeg软件,并将其bin目录配置到系统的环境变量中。

1
import os
1
2
3
# 获取当前文件夹下所有的音频
a_list = [i for i in os.listdir('.') if i.endswith('.mp3')]
a_list
['10面向对象vsmatlabstyle.mp3',
 '11子图subplot.mp3',
 '12多图figure.mp3',
 '13网格.mp3',
 '14图例legend.mp3',
 '15坐标轴范围.mp3',
 '16坐标轴刻度.mp3',
 '17添加坐标轴.mp3',
 '18注释.mp3',
 '19文字.mp3',
 '1课程简介和环境搭建.mp3',
 '20Tex公式.mp3',
 '21工具栏.mp3',
 '22区域填充.mp3',
 '23形状.mp3',
 '24样式美化.mp3',
 '25极坐标.mp3',
 '26函数积分图一.mp3',
 '27函数积分图二.mp3',
 '28散点条形图一.mp3',
 '29散点条形图二.mp3',
 '2Numpy简介.mp3',
 '30球员能力图一.mp3',
 '31球员能力图二.mp3',
 '3散点图.mp3',
 '4折线图.mp3',
 '5条形图.mp3',
 '6直方图.mp3',
 '7饼状图.mp3',
 '8箱形图.mp3',
 '9颜色和样式.mp3']
1
2
v_list = [i for i in os.listdir('.') if i.endswith('.mp4')]
v_list
['10面向对象vsmatlabstyle.mp4',
 '11子图subplot.mp4',
 '12多图figure.mp4',
 '13网格.mp4',
 '14图例legend.mp4',
 '15坐标轴范围.mp4',
 '16坐标轴刻度.mp4',
 '17添加坐标轴.mp4',
 '18注释.mp4',
 '19文字.mp4',
 '1课程简介和环境搭建.mp4',
 '20Tex公式.mp4',
 '21工具栏.mp4',
 '22区域填充.mp4',
 '23形状.mp4',
 '24样式美化.mp4',
 '25极坐标.mp4',
 '26函数积分图一.mp4',
 '27函数积分图二.mp4',
 '28散点条形图一.mp4',
 '29散点条形图二.mp4',
 '2Numpy简介.mp4',
 '30球员能力图一.mp4',
 '31球员能力图二.mp4',
 '3散点图.mp4',
 '4折线图.mp4',
 '5条形图.mp4',
 '6直方图.mp4',
 '7饼状图.mp4',
 '8箱形图.mp4',
 '9颜色和样式.mp4']
1
outpath = 'result'
1
2
3
4
5
6
for i in range(len(a_list)):
file1 = a_list[i]
file2 = v_list[i]
result = outpath + '/' + file2
print(f'拼接:{file1}{file2}')
os.system(f"ffmpeg -i {file1} -i {file2} -acodec copy -vcodec copy {result}")
拼接:10面向对象vsmatlabstyle.mp3 和 10面向对象vsmatlabstyle.mp4
拼接:11子图subplot.mp3 和 11子图subplot.mp4
拼接:12多图figure.mp3 和 12多图figure.mp4
拼接:13网格.mp3 和 13网格.mp4
拼接:14图例legend.mp3 和 14图例legend.mp4
拼接:15坐标轴范围.mp3 和 15坐标轴范围.mp4
拼接:16坐标轴刻度.mp3 和 16坐标轴刻度.mp4
拼接:17添加坐标轴.mp3 和 17添加坐标轴.mp4
拼接:18注释.mp3 和 18注释.mp4
拼接:19文字.mp3 和 19文字.mp4
拼接:1课程简介和环境搭建.mp3 和 1课程简介和环境搭建.mp4
拼接:20Tex公式.mp3 和 20Tex公式.mp4
拼接:21工具栏.mp3 和 21工具栏.mp4
拼接:22区域填充.mp3 和 22区域填充.mp4
拼接:23形状.mp3 和 23形状.mp4
拼接:24样式美化.mp3 和 24样式美化.mp4
拼接:25极坐标.mp3 和 25极坐标.mp4
拼接:26函数积分图一.mp3 和 26函数积分图一.mp4
拼接:27函数积分图二.mp3 和 27函数积分图二.mp4
拼接:28散点条形图一.mp3 和 28散点条形图一.mp4
拼接:29散点条形图二.mp3 和 29散点条形图二.mp4
拼接:2Numpy简介.mp3 和 2Numpy简介.mp4
拼接:30球员能力图一.mp3 和 30球员能力图一.mp4
拼接:31球员能力图二.mp3 和 31球员能力图二.mp4
拼接:3散点图.mp3 和 3散点图.mp4
拼接:4折线图.mp3 和 4折线图.mp4
拼接:5条形图.mp3 和 5条形图.mp4
拼接:6直方图.mp3 和 6直方图.mp4
拼接:7饼状图.mp3 和 7饼状图.mp4
拼接:8箱形图.mp3 和 8箱形图.mp4
拼接:9颜色和样式.mp3 和 9颜色和样式.mp4

弹幕数据抓取

目标网址

1
import requests

请求主页信息

1
2
url = 'https://www.bilibili.com/video/BV1BT411M7bJ'
print(url)
https://www.bilibili.com/video/BV1BT411M7bJ
1
res = requests.get(url)

筛选出oid

1
import re
1
2
oid = re.findall('"cids":{"1":(.*?)}', res.text)[0]
oid
'832061733'

指定月份,获取该月所有包含弹幕的日期

1
month = '2022-09'
1
base_url1 = 'https://api.bilibili.com/x/v2/dm/history/index'
1
2
3
4
5
params1 = {
'month': month,
'type': '1',
'oid': oid
}
1
res2 = requests.get(base_url1, params=params1)
1
2
res2_json = res2.json()
res2_json
{'code': -101, 'message': '账号未登录', 'ttl': 1}

说明该数据包必须登录才能查看,指定cookie再次访问

1
2
3
4
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
'cookie': "buvid3=52C61008-A32D-4BCE-8246-0C9CD437E254143093infoc; rpdid=|(u))lkuRk)u0J'uY|lJmm)Rk; LIVE_BUVID=AUTO4216148255721396; _uuid=9A3BB763-7225-78CC-94D7-3938FD109A91A19781infoc; video_page_version=v_old_home; CURRENT_BLACKGAP=0; buvid4=60F4F6A8-BD07-75CE-6EB7-5118460C275528572-022021016-+6moO9JdWYAEsjUJboQT0Q%3D%3D; buvid_fp_plain=undefined; buvid_fp=ecab3369621b9b28c528e5895b39564f; DedeUserID=32636793; DedeUserID__ckMd5=622f1525d2637f75; i-wanna-go-back=-1; b_ut=5; nostalgia_conf=-1; blackside_state=1; fingerprint3=ef6bf6051317ad42936b6778505e3eba; fingerprint=874001425f189070546c39e2ee1eb8a1; CURRENT_QUALITY=80; b_nut=100; SESSDATA=37d640c4%2C1678677109%2C4f3a2%2A91; bili_jct=16f0e7adfa1ebe155f44899fb75f531b; sid=7yzfgwao; theme_style=light; bp_video_offset_32636793=705754682710032500; b_lsid=6D74EB61_1833ECBA581; go_old_video=-1; innersign=1; CURRENT_FNVAL=4048; PVID=9"
}
1
res2 = requests.get(base_url1, params=params1, headers=headers)
1
2
3
res2_json = res2.json()
data_times = res2_json['data']
data_times
['2022-09-14', '2022-09-15']

根据具体日期获取相应的弹幕数据包

1
base_url2 = 'https://api.bilibili.com/x/v2/dm/web/history/seg.so'
1
2
3
4
5
6
7
8
9
10
text = []
for i in range(len(data_times)):
params2 = {
'type': 1,
'oid': oid,
'date': data_times[i]
}
res3 = requests.get(base_url2, params=params2, headers=headers)
tmp_text = re.findall('[\u4e00-\u9fa5]+', res3.text.replace(' ', '')) # 匹配中文
text.extend(tmp_text)
1
len(text)
2162
-------------本文结束感谢您的阅读-------------