获取单个视频
发送get请求
目标网址
1
| url = 'https://www.bilibili.com/video/BV1bU4y1775K'
|
1
| response = requests.get(url)
|
获取视频标题
1 2
| title = re.findall('"title":"(.*?)","pubdate":', response.text)[0] title
|
'今天是袁隆平诞辰91周年,怀念…'
由于在windows中文件的名称是不能包含以下字符【/\\:*?”<>|!】的,所以需要将其在title中删除
1 2
| title = re.sub('[/\\:*?"<>| ]', '', title) title
|
'今天是袁隆平诞辰91周年,怀念…'
解析视频地址
1
| json_file = re.findall('<script>window.__playinfo__=(.*?)</script>', response.text)[0]
|
经过我们的观察发现获取到的结果,是一个json格式的数据,后续在对其进行解析时,可以将其转化为字典格式。
1 2 3
| res_json = json.loads(json_file) res_json['data']['dash'].keys()
|
dict_keys(['duration', 'minBufferTime', 'min_buffer_time', 'video', 'audio', 'dolby', 'flac'])
通过查看dash的所有键,我们发现里面有video和audio两个字段,说明在B站中,其音频和视频是分开。
1 2
| video_url = res_json['data']['dash']['video'][0]['baseUrl'] video_url
|
'https://upos-sz-mirror08ct.bilivideo.com/upgcxcode/93/00/404120093/404120093_nb2-1-30064.m4s?e=ig8euxZM2rNcNbdlhoNvNC8BqJIzNbfqXBvEqxTEto8BTrNvN0GvT90W5JZMkX_YN0MvXg8gNEV4NC8xNEV4N03eN0B5tZlqNxTEto8BTrNvNeZVuJ10Kj_g2UB02J0mN0B5tZlqNCNEto8BTrNvNC7MTX502C8f2jmMQJ6mqF2fka1mqx6gqj0eN0B599M=&uipk=5&nbs=1&deadline=1663151930&gen=playurlv2&os=08ctbv&oi=2003056842&trid=eca493d85bed4bfabaa956f7e6c92ce0u&mid=0&platform=pc&upsig=6a6615ff1df40c1f003e64874e27383c&uparams=e,uipk,nbs,deadline,gen,os,oi,trid,mid,platform&bvc=vod&nettype=0&orderid=0,3&agrr=1&bw=110694&logo=80000000'
解析音频地址
1 2
| audio_url = res_json['data']['dash']['audio'][0]['baseUrl'] audio_url
|
'https://upos-sz-estgoss.bilivideo.com/upgcxcode/93/00/404120093/404120093_nb2-1-30280.m4s?e=ig8euxZM2rNcNbdlhoNvNC8BqJIzNbfqXBvEqxTEto8BTrNvN0GvT90W5JZMkX_YN0MvXg8gNEV4NC8xNEV4N03eN0B5tZlqNxTEto8BTrNvNeZVuJ10Kj_g2UB02J0mN0B5tZlqNCNEto8BTrNvNC7MTX502C8f2jmMQJ6mqF2fka1mqx6gqj0eN0B599M=&uipk=5&nbs=1&deadline=1663151930&gen=playurlv2&os=upos&oi=2003056842&trid=eca493d85bed4bfabaa956f7e6c92ce0u&mid=0&platform=pc&upsig=15941e8938cb1bff44b5ecf719123273&uparams=e,uipk,nbs,deadline,gen,os,oi,trid,mid,platform&bvc=vod&nettype=0&orderid=0,3&agrr=1&bw=39999&logo=80000000'
在这里我们尝试了一下和前面抓取图片相同的方式,直接将音视频地址复制到浏览器地址栏中进行访问,结果返回403 Forbidden,说明这个资源是有的,但是无法被浏览器直接进行访问。
给请求头加上防盗链referer
1 2 3
| headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36', }
|
1 2
| video_res = requests.get(video_url, headers=headers) video_res
|
<Response [403]>
发现在未添加防盗链时,数据是请求不到的。同样返回403错误
1 2 3 4
| headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36', 'referer': url }
|
1 2
| video_res = requests.get(video_url, headers=headers) video_res
|
<Response [200]>
1 2
| with open(title+'.mp4', 'wb') as f: f.write(video_res.content)
|
1 2
| audio_res = requests.get(audio_url, headers=headers) audio_res
|
1 2
| with open(title+'.mp3', 'wb') as f: f.write(audio_res.content)
|
封装代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
| def get_audio_and_vedio(url, title=None): headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36', 'referer': url } response = requests.get(url) if not title: title = re.findall('"title":"(.*?)","pubdate":', response.text)[0] title = re.sub('[/\\:*?"<>| ]', '', title) json_file = re.findall('<script>window.__playinfo__=(.*?)</script>', response.text)[0] res_json = json.loads(json_file) video_url = res_json['data']['dash']['video'][0]['baseUrl'] audio_url = res_json['data']['dash']['audio'][0]['baseUrl'] video_res = requests.get(video_url, headers=headers) audio_res = requests.get(audio_url, headers=headers) with open(title+'.mp4', 'wb') as f: f.write(video_res.content) with open(title+'.mp3', 'wb') as f: f.write(audio_res.content)
|
1
| get_audio_and_vedio('https://www.bilibili.com/video/BV1Qe4y1b7y4?vd_source=7f9e6b6e1c2f8486b2f6f3d6520c63fb')
|
获取多个视频全集
目标网址
获取所有的选集信息
1
| start_url = 'https://www.bilibili.com/video/BV1ps411x7rm'
|
1
| response = requests.get(url)
|
1 2 3 4
| import json
res_json = json.loads('{' + re.findall('("pages":\[.*?\])', response.text)[0] + '}') print('该视频的选集数量一共有:', len(res_json['pages']))
|
该视频的选集数量一共有: 31
批量获取所有选集信息
1 2 3 4
| for i in range(len(res_json['pages'])): url = start_url + f'?p={i+1}' print('正在抓取:', url) get_audio_and_vedio(url, res_json['pages'][i]['part'])
|
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=1
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=2
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=3
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=4
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=5
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=6
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=7
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=8
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=9
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=10
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=11
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=12
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=13
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=14
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=15
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=16
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=17
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=18
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=19
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=20
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=21
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=22
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=23
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=24
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=25
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=26
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=27
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=28
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=29
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=30
正在抓取: https://www.bilibili.com/video/BV1ps411x7rm?p=31
批量将所有音频和视频合并
首先需要在官网下载ffmpeg软件,并将其bin目录配置到系统的环境变量中。
1 2 3
| a_list = [i for i in os.listdir('.') if i.endswith('.mp3')] a_list
|
['10面向对象vsmatlabstyle.mp3',
'11子图subplot.mp3',
'12多图figure.mp3',
'13网格.mp3',
'14图例legend.mp3',
'15坐标轴范围.mp3',
'16坐标轴刻度.mp3',
'17添加坐标轴.mp3',
'18注释.mp3',
'19文字.mp3',
'1课程简介和环境搭建.mp3',
'20Tex公式.mp3',
'21工具栏.mp3',
'22区域填充.mp3',
'23形状.mp3',
'24样式美化.mp3',
'25极坐标.mp3',
'26函数积分图一.mp3',
'27函数积分图二.mp3',
'28散点条形图一.mp3',
'29散点条形图二.mp3',
'2Numpy简介.mp3',
'30球员能力图一.mp3',
'31球员能力图二.mp3',
'3散点图.mp3',
'4折线图.mp3',
'5条形图.mp3',
'6直方图.mp3',
'7饼状图.mp3',
'8箱形图.mp3',
'9颜色和样式.mp3']
1 2
| v_list = [i for i in os.listdir('.') if i.endswith('.mp4')] v_list
|
['10面向对象vsmatlabstyle.mp4',
'11子图subplot.mp4',
'12多图figure.mp4',
'13网格.mp4',
'14图例legend.mp4',
'15坐标轴范围.mp4',
'16坐标轴刻度.mp4',
'17添加坐标轴.mp4',
'18注释.mp4',
'19文字.mp4',
'1课程简介和环境搭建.mp4',
'20Tex公式.mp4',
'21工具栏.mp4',
'22区域填充.mp4',
'23形状.mp4',
'24样式美化.mp4',
'25极坐标.mp4',
'26函数积分图一.mp4',
'27函数积分图二.mp4',
'28散点条形图一.mp4',
'29散点条形图二.mp4',
'2Numpy简介.mp4',
'30球员能力图一.mp4',
'31球员能力图二.mp4',
'3散点图.mp4',
'4折线图.mp4',
'5条形图.mp4',
'6直方图.mp4',
'7饼状图.mp4',
'8箱形图.mp4',
'9颜色和样式.mp4']
1 2 3 4 5 6
| for i in range(len(a_list)): file1 = a_list[i] file2 = v_list[i] result = outpath + '/' + file2 print(f'拼接:{file1} 和 {file2}') os.system(f"ffmpeg -i {file1} -i {file2} -acodec copy -vcodec copy {result}")
|
拼接:10面向对象vsmatlabstyle.mp3 和 10面向对象vsmatlabstyle.mp4
拼接:11子图subplot.mp3 和 11子图subplot.mp4
拼接:12多图figure.mp3 和 12多图figure.mp4
拼接:13网格.mp3 和 13网格.mp4
拼接:14图例legend.mp3 和 14图例legend.mp4
拼接:15坐标轴范围.mp3 和 15坐标轴范围.mp4
拼接:16坐标轴刻度.mp3 和 16坐标轴刻度.mp4
拼接:17添加坐标轴.mp3 和 17添加坐标轴.mp4
拼接:18注释.mp3 和 18注释.mp4
拼接:19文字.mp3 和 19文字.mp4
拼接:1课程简介和环境搭建.mp3 和 1课程简介和环境搭建.mp4
拼接:20Tex公式.mp3 和 20Tex公式.mp4
拼接:21工具栏.mp3 和 21工具栏.mp4
拼接:22区域填充.mp3 和 22区域填充.mp4
拼接:23形状.mp3 和 23形状.mp4
拼接:24样式美化.mp3 和 24样式美化.mp4
拼接:25极坐标.mp3 和 25极坐标.mp4
拼接:26函数积分图一.mp3 和 26函数积分图一.mp4
拼接:27函数积分图二.mp3 和 27函数积分图二.mp4
拼接:28散点条形图一.mp3 和 28散点条形图一.mp4
拼接:29散点条形图二.mp3 和 29散点条形图二.mp4
拼接:2Numpy简介.mp3 和 2Numpy简介.mp4
拼接:30球员能力图一.mp3 和 30球员能力图一.mp4
拼接:31球员能力图二.mp3 和 31球员能力图二.mp4
拼接:3散点图.mp3 和 3散点图.mp4
拼接:4折线图.mp3 和 4折线图.mp4
拼接:5条形图.mp3 和 5条形图.mp4
拼接:6直方图.mp3 和 6直方图.mp4
拼接:7饼状图.mp3 和 7饼状图.mp4
拼接:8箱形图.mp3 和 8箱形图.mp4
拼接:9颜色和样式.mp3 和 9颜色和样式.mp4
弹幕数据抓取
目标网址
请求主页信息
1 2
| url = 'https://www.bilibili.com/video/BV1BT411M7bJ' print(url)
|
https://www.bilibili.com/video/BV1BT411M7bJ
筛选出oid
1 2
| oid = re.findall('"cids":{"1":(.*?)}', res.text)[0] oid
|
'832061733'
指定月份,获取该月所有包含弹幕的日期
1
| base_url1 = 'https://api.bilibili.com/x/v2/dm/history/index'
|
1 2 3 4 5
| params1 = { 'month': month, 'type': '1', 'oid': oid }
|
1
| res2 = requests.get(base_url1, params=params1)
|
1 2
| res2_json = res2.json() res2_json
|
{'code': -101, 'message': '账号未登录', 'ttl': 1}
说明该数据包必须登录才能查看,指定cookie再次访问
1 2 3 4
| headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36', 'cookie': "buvid3=52C61008-A32D-4BCE-8246-0C9CD437E254143093infoc; rpdid=|(u))lkuRk)u0J'uY|lJmm)Rk; LIVE_BUVID=AUTO4216148255721396; _uuid=9A3BB763-7225-78CC-94D7-3938FD109A91A19781infoc; video_page_version=v_old_home; CURRENT_BLACKGAP=0; buvid4=60F4F6A8-BD07-75CE-6EB7-5118460C275528572-022021016-+6moO9JdWYAEsjUJboQT0Q%3D%3D; buvid_fp_plain=undefined; buvid_fp=ecab3369621b9b28c528e5895b39564f; DedeUserID=32636793; DedeUserID__ckMd5=622f1525d2637f75; i-wanna-go-back=-1; b_ut=5; nostalgia_conf=-1; blackside_state=1; fingerprint3=ef6bf6051317ad42936b6778505e3eba; fingerprint=874001425f189070546c39e2ee1eb8a1; CURRENT_QUALITY=80; b_nut=100; SESSDATA=37d640c4%2C1678677109%2C4f3a2%2A91; bili_jct=16f0e7adfa1ebe155f44899fb75f531b; sid=7yzfgwao; theme_style=light; bp_video_offset_32636793=705754682710032500; b_lsid=6D74EB61_1833ECBA581; go_old_video=-1; innersign=1; CURRENT_FNVAL=4048; PVID=9" }
|
1
| res2 = requests.get(base_url1, params=params1, headers=headers)
|
1 2 3
| res2_json = res2.json() data_times = res2_json['data'] data_times
|
['2022-09-14', '2022-09-15']
根据具体日期获取相应的弹幕数据包
1
| base_url2 = 'https://api.bilibili.com/x/v2/dm/web/history/seg.so'
|
1 2 3 4 5 6 7 8 9 10
| text = [] for i in range(len(data_times)): params2 = { 'type': 1, 'oid': oid, 'date': data_times[i] } res3 = requests.get(base_url2, params=params2, headers=headers) tmp_text = re.findall('[\u4e00-\u9fa5]+', res3.text.replace(' ', '')) text.extend(tmp_text)
|
2162