发送get请求 1 2 url = "http://hzbmmc.com/"
1 2 response = requests.get(url)
200
状态码为200说明访问成功
'<!DOCTYPE html>\r\n<html lang="en">\r\n\r\n<head>\r\n <meta charset="UTF-8" />\r\n <meta name="viewport" content="width=device-width, initial-scale=1.0" />\r\n <meta name="referrer" content="no-referrer" />\r\n <link rel="icon" href="./images/v2_qmyicon.jpg" />\r\n <title>â\x80\x9cå\x8d\x8eä¸\xadæ\x9d¯â\x80\x9d大å\xad¦ç\x94\x9fæ\x95°å\xad¦å»ºæ¨¡æ\x8c\x91æ\x88\x98èµ\x9b</title>\r\n <meta name="description" content="为äº\x86æ¿\x80å\x8a±å\xad¦ç\x94\x9få\xad¦ä¹\xa0æ\x95°å\xad¦ç\x9a\x84积æ\x9e\x81æ\x80§ï¼\x8cæ\x8b\x93å±\x95æ\x95°å\xad¦å\x8f\x8aç\x9b¸å\x85³å\xad¦ç§\x91ç\x9f¥è¯\x86é\x9d¢ï¼\x8cæ\x8f\x90é«\x98å\xad¦ç\x94\x9fç\x8b¬ç«\x8bå\x88\x86æ\x9e\x90ã\x80\x81建模ã\x80\x81æ±\x82解ã\x80\x81åº\x94ç\x94¨ã\x80\x81å\x86\x99ä½\x9cç\xad\x89综å\x90\x88è\x83½å\x8a\x9bï¼\x8cæ¹\x96å\x8c\x97ç\x9c\x81å·¥ä¸\x9aä¸\x8eåº\x94ç\x94¨æ\x95°å\xad¦å\xad¦ä¼\x9aä½\x9c为主å\x8a\x9eï¼\x8c举å\x8a\x9eâ\x80\x9cå\x8d\x8eä¸\xadæ\x9d¯â\x80\x9d大å\xad¦ç\x94\x9fæ\x95°å\xad¦å»ºæ¨¡æ\x8c\x91æ\x88\x98èµ\x9bã\x80\x82â\x80\x9cå\x8d\x8eä¸\xadæ\x9d¯â\x80\x9d大å\xad¦ç\x94\x9fæ\x95°å\xad¦å»ºæ¨¡æ\x8c\x91æ\x88\x98èµ\x9bç³»å\x8e\x9få\x8d\x8eä¸\xadå\x9c°å\x8cºå¤§å\xad¦ç\x94\x9fæ\x95°å\xad¦å»ºæ¨¡é\x82\x80请èµ\x9bï¼\x8cå·²è¿\x9eç»\xad举å\x8a\x9eå\x8d\x81äº\x8cå±\x8aï¼\x8c2020å¹´ç\x94±äº\x8eç\x96«æ\x83\x85å\x8e\x9få\x9b\xa0ï¼\x8cæ\x9a\x82å\x81\x9cä¸\x80å¹´ã\x80\x822021å¹´æ\x9b´å\x90\x8d为â\x80\x9cå\x8d\x8eä¸\xadæ\x9d¯â\x80\x9d大å\xad¦ç\x94\x9fæ\x95°å\xad¦å»ºæ¨¡æ\x8c\x91æ\x88\x98èµ\x9bã\x80\x82">\r\n <meta name="Keywords" content="å\x8d\x8eä¸\xadæ\x9d¯ï¼\x8cå\x8d\x8eä¸\xadæ\x9d¯å¤§å\xad¦ç\x94\x9fæ\x95°å\xad¦å»ºæ¨¡æ\x8c\x91æ\x88\x98èµ\x9bï¼\x8cæ\x95°å\xad¦å»ºæ¨¡ï¼\x8cæ\x95°å\xad¦å»ºæ¨¡æ\x8c\x91æ\x88\x98èµ\x9bï¼\x8c泰迪æ\x99ºè\x83½ç§\x91æ\x8a\x80ï¼\x88æ\xad¦æ±\x89ï¼\x89æ\x9c\x89é\x99\x90å\x85¬å\x8f¸ï¼\x8c泰迪æ\x99ºè\x83½ç§\x91æ\x8a\x80">\r\n <meta name="viewport" content="width=device-width, initial-scale=1.0">\r\n <link rel="stylesheet" type="text/css" href="./lib/bootstrap-3.3.7-dist/css/bootstrap.min.css" />\r\n <link rel="stylesheet" href="./lib/swiper/css/swiper-bundle.min.css">\r\n <link rel="stylesheet" type="text/css" href="./style/index/index.css" />\r\n <link rel="stylesheet" type="text/css" href="./style/public/header/nav.css" />\r\n <script src="./lib/less/less.min.js"></script>\r\n</head>\r\n\r\n<body>\r\n\r\n <!-- 头é\x83¨å\x8cºå\x9f\x9f -->\r\n <header>\r\n <div class="hzb-header w">\r\n <div class="hzb-header-logo">\r\n <img src="./images/v2_qmyf0e.png" class="logo-img" />\r\n </div>\r\n <div id="hzb-header-nav" class="hzb-header-text">\r\n \r\n </div>\r\n <div class="hzb-header-login"></div>\r\n </div>\r\n </header>\r\n\r\n <!-- è½®æ\x92\xadå\x9b¾ -->\r\n <div class="hzb-swiper">\r\n <div class="hzb-swiper-item w">\r\n <div class="swiper-container" id="top-banner">\r\n <div class="swiper-wrapper" id="swiper-box">\r\n </div>\r\n <!-- Add Pagination -->\r\n <div class="swiper-pagination" id="swiper-pagination"></div>\r\n </div>\r\n </div>\r\n </div>\r\n\r\n <!-- å\x8d\x8eä¸\xadæ\x9d¯å\x8f\x82èµ\x9bé\x98\x9fä¼\x8d -->\r\n <div class="hzb-troops w"></div>\r\n\r\n <!-- å\x86\x85容é\x83¨å\x88\x86 -->\r\n <div class="content">\r\n <div class="w" style="display: flex;">\r\n <div class="item fl" id="click">\r\n <div class="item_title">\r\n <h3>大èµ\x9bç®\x80ä»\x8b</h3>\r\n </div>\r\n <div class="item_n">\r\n <div class="item_p">\r\n <p></p>\r\n </div>\r\n </div>\r\n </div>\r\n <div class="item fr">\r\n <div class="item_title">\r\n <h3>æ\x9c\x80æ\x96°é\x80\x9aç\x9f¥</h3>\r\n <a href="./views/inform/inform.html?navigate=inform">æ\x9b´å¤\x9a ></a>\r\n </div>\r\n <div class="item_n">\r\n <ul id="inform">\r\n \r\n </ul> \r\n </div>\r\n </div>\r\n </div>\r\n </div>\r\n <!-- <div class="hzb-item w">\r\n <div class="hzb-left-item-newsflash">\r\n <div class="hzb-left-newsflash-news">\r\n <div class="hzb-left-newsflash-news-hd">\r\n <h5 class="hzb-left-newsflash-news-hd-text">大èµ\x9bç®\x80ä»\x8b</h5>\r\n <a href="./views/organization/brief.html" class="hzb-left-newsflash-news-hd-mode">æ\x9b´å¤\x9a<span\r\n class="glyphicon glyphicon-menu-right"></span></a>\r\n </div>\r\n <div class="hzb-left-newsflash-news-bd">\r\n <ul class="hzb-left-newsflash-news-bd-list">\r\n <li class="hzb-left-newsflash-news-bd-list-item">\r\n <span class="hzb-left-newsflash-news-bd-list-item-text">\r\n </span>\r\n </li>\r\n </ul>\r\n </div>\r\n </div>\r\n </div>\r\n <div class="hzb-right-item-newsflash">\r\n <div class="hzh-right-newsflash-news">\r\n <div class="hzh-right-newsflash-news-hd">\r\n <h5 class="hzh-right-newsflash-news-hd-text">æ\x9c\x80æ\x96°é\x80\x9aç\x9f¥</h5>\r\n <a href="./views/inform/inform.html" class="hzh-right-newsflash-news-hd-mode">æ\x9b´å¤\x9a<span\r\n class="glyphicon glyphicon-menu-right"></span></a>\r\n </div>\r\n <div class="hzh-right-newsflash-news-bd">\r\n \r\n </div>\r\n </div>\r\n </div>\r\n </div> -->\r\n\r\n <!-- å\x90\x88ä½\x9cå\x8d\x95ä½\x8d -->\r\n <div class="hzb-cooperator w">\r\n <div class="hzb-cooperator-text">å\x90\x88ä½\x9cå\x8d\x95ä½\x8d</div>\r\n <div class="hzb-cooperator-images">\r\n <div class="hzb-cooperator-images-item">\r\n <div class="swiper-container" id="swiper-footer">\r\n <div class="swiper-wrapper" id="swiper-wrapper-footer">\r\n <div class="swiper-slide cooperator-images-footer">slider1</div>\r\n <div class="swiper-slide cooperator-images-footer">slider2</div>\r\n <div class="swiper-slide cooperator-images-footer">slider3</div>\r\n </div>\r\n <div class="swiper-button-prev"></div>\r\n <div class="swiper-button-next"></div> \r\n </div>\r\n </div>\r\n </div>\r\n </div>\r\n\r\n <!-- åº\x95é\x83¨ -->\r\n <div class="hzb-bottom">\r\n\r\n </div>\r\n <!-- åº\x95é\x83¨å¤\x87æ¡\x88 -->\r\n <div class="hzb-bottom-records">\r\n <p class="hzb-bottom-records-text w">\r\n ©2021 å\x8d\x8eä¸\xadæ\x9d¯ å¤\x87æ¡\x88å\x8f·ï¼\x9a\r\n <span class="hzb-bottom-records-text-item">\r\n <a style="text-decoration: none; " href="http://beian.miit.gov.cn/">é\x84\x82ICPå¤\x872021001283 </a>\r\n </span> | é\x84\x82å\x85¬ç½\x91å®\x89å¤\x87 13011403590714å\x8f·\r\n </p>\r\n </div>\r\n\r\n <!-- è½®æ\x92\xadå\x9b¾ -->\r\n <script type="text/template" id="BannerList">\r\n {{ each list }}\r\n <div class="swiper-slide">\r\n <img src="{{$value.cover}}" alt="">\r\n </div>\r\n {{ /each }}\r\n </script>\r\n\r\n\r\n\r\n <script src="./lib/jquery/jquery.min.js"></script>\r\n <script src="./js/public/nav.js"></script>\r\n <script src="./js/common/utils.js"></script>\r\n <script src="./lib/bootstrap-3.3.7-dist/js/bootstrap.min.js"></script>\r\n <script src="./lib/template-web/template-web.js"></script>\r\n <script src="./lib/swiper/js/swiper-bundle.min.js"></script>\r\n <script src="./js/index/request.js"></script>\r\n <script src="./js/index/index.js"></script>\r\n \r\n</body>\r\n\r\n</html>'
'ISO-8859-1'
1 2 response.encoding = 'UTF-8'
'<!DOCTYPE html>\r\n<html lang="en">\r\n\r\n<head>\r\n <meta charset="UTF-8" />\r\n <meta name="viewport" content="width=device-width, initial-scale=1.0" />\r\n <meta name="referrer" content="no-referrer" />\r\n <link rel="icon" href="./images/v2_qmyicon.jpg" />\r\n <title>“华中杯”大学生数学建模挑战赛</title>\r\n <meta name="description" content="为了激励学生学习数学的积极性,拓展数学及相关学科知识面,提高学生独立分析、建模、求解、应用、写作等综合能力,湖北省工业与应用数学学会作为主办,举办“华中杯”大学生数学建模挑战赛。“华中杯”大学生数学建模挑战赛系原华中地区大学生数学建模邀请赛,已连续举办十二届,2020年由于疫情原因,暂停一年。2021年更名为“华中杯”大学生数学建模挑战赛。">\r\n <meta name="Keywords" content="华中杯,华中杯大学生数学建模挑战赛,数学建模,数学建模挑战赛,泰迪智能科技(武汉)有限公司,泰迪智能科技">\r\n <meta name="viewport" content="width=device-width, initial-scale=1.0">\r\n <link rel="stylesheet" type="text/css" href="./lib/bootstrap-3.3.7-dist/css/bootstrap.min.css" />\r\n <link rel="stylesheet" href="./lib/swiper/css/swiper-bundle.min.css">\r\n <link rel="stylesheet" type="text/css" href="./style/index/index.css" />\r\n <link rel="stylesheet" type="text/css" href="./style/public/header/nav.css" />\r\n <script src="./lib/less/less.min.js"></script>\r\n</head>\r\n\r\n<body>\r\n\r\n <!-- 头部区域 -->\r\n <header>\r\n <div class="hzb-header w">\r\n <div class="hzb-header-logo">\r\n <img src="./images/v2_qmyf0e.png" class="logo-img" />\r\n </div>\r\n <div id="hzb-header-nav" class="hzb-header-text">\r\n \r\n </div>\r\n <div class="hzb-header-login"></div>\r\n </div>\r\n </header>\r\n\r\n <!-- 轮播图 -->\r\n <div class="hzb-swiper">\r\n <div class="hzb-swiper-item w">\r\n <div class="swiper-container" id="top-banner">\r\n <div class="swiper-wrapper" id="swiper-box">\r\n </div>\r\n <!-- Add Pagination -->\r\n <div class="swiper-pagination" id="swiper-pagination"></div>\r\n </div>\r\n </div>\r\n </div>\r\n\r\n <!-- 华中杯参赛队伍 -->\r\n <div class="hzb-troops w"></div>\r\n\r\n <!-- 内容部分 -->\r\n <div class="content">\r\n <div class="w" style="display: flex;">\r\n <div class="item fl" id="click">\r\n <div class="item_title">\r\n <h3>大赛简介</h3>\r\n </div>\r\n <div class="item_n">\r\n <div class="item_p">\r\n <p></p>\r\n </div>\r\n </div>\r\n </div>\r\n <div class="item fr">\r\n <div class="item_title">\r\n <h3>最新通知</h3>\r\n <a href="./views/inform/inform.html?navigate=inform">更多 ></a>\r\n </div>\r\n <div class="item_n">\r\n <ul id="inform">\r\n \r\n </ul> \r\n </div>\r\n </div>\r\n </div>\r\n </div>\r\n <!-- <div class="hzb-item w">\r\n <div class="hzb-left-item-newsflash">\r\n <div class="hzb-left-newsflash-news">\r\n <div class="hzb-left-newsflash-news-hd">\r\n <h5 class="hzb-left-newsflash-news-hd-text">大赛简介</h5>\r\n <a href="./views/organization/brief.html" class="hzb-left-newsflash-news-hd-mode">更多<span\r\n class="glyphicon glyphicon-menu-right"></span></a>\r\n </div>\r\n <div class="hzb-left-newsflash-news-bd">\r\n <ul class="hzb-left-newsflash-news-bd-list">\r\n <li class="hzb-left-newsflash-news-bd-list-item">\r\n <span class="hzb-left-newsflash-news-bd-list-item-text">\r\n </span>\r\n </li>\r\n </ul>\r\n </div>\r\n </div>\r\n </div>\r\n <div class="hzb-right-item-newsflash">\r\n <div class="hzh-right-newsflash-news">\r\n <div class="hzh-right-newsflash-news-hd">\r\n <h5 class="hzh-right-newsflash-news-hd-text">最新通知</h5>\r\n <a href="./views/inform/inform.html" class="hzh-right-newsflash-news-hd-mode">更多<span\r\n class="glyphicon glyphicon-menu-right"></span></a>\r\n </div>\r\n <div class="hzh-right-newsflash-news-bd">\r\n \r\n </div>\r\n </div>\r\n </div>\r\n </div> -->\r\n\r\n <!-- 合作单位 -->\r\n <div class="hzb-cooperator w">\r\n <div class="hzb-cooperator-text">合作单位</div>\r\n <div class="hzb-cooperator-images">\r\n <div class="hzb-cooperator-images-item">\r\n <div class="swiper-container" id="swiper-footer">\r\n <div class="swiper-wrapper" id="swiper-wrapper-footer">\r\n <div class="swiper-slide cooperator-images-footer">slider1</div>\r\n <div class="swiper-slide cooperator-images-footer">slider2</div>\r\n <div class="swiper-slide cooperator-images-footer">slider3</div>\r\n </div>\r\n <div class="swiper-button-prev"></div>\r\n <div class="swiper-button-next"></div> \r\n </div>\r\n </div>\r\n </div>\r\n </div>\r\n\r\n <!-- 底部 -->\r\n <div class="hzb-bottom">\r\n\r\n </div>\r\n <!-- 底部备案 -->\r\n <div class="hzb-bottom-records">\r\n <p class="hzb-bottom-records-text w">\r\n ©2021 华中杯 备案号:\r\n <span class="hzb-bottom-records-text-item">\r\n <a style="text-decoration: none; " href="http://beian.miit.gov.cn/">鄂ICP备2021001283 </a>\r\n </span> | 鄂公网安备 13011403590714号\r\n </p>\r\n </div>\r\n\r\n <!-- 轮播图 -->\r\n <script type="text/template" id="BannerList">\r\n {{ each list }}\r\n <div class="swiper-slide">\r\n <img src="{{$value.cover}}" alt="">\r\n </div>\r\n {{ /each }}\r\n </script>\r\n\r\n\r\n\r\n <script src="./lib/jquery/jquery.min.js"></script>\r\n <script src="./js/public/nav.js"></script>\r\n <script src="./js/common/utils.js"></script>\r\n <script src="./lib/bootstrap-3.3.7-dist/js/bootstrap.min.js"></script>\r\n <script src="./lib/template-web/template-web.js"></script>\r\n <script src="./lib/swiper/js/swiper-bundle.min.js"></script>\r\n <script src="./js/index/request.js"></script>\r\n <script src="./js/index/index.js"></script>\r\n \r\n</body>\r\n\r\n</html>'
1 2 3 with open ('hzb.txt' , 'w' , encoding='UTF-8' ) as f: f.write(response.text)
网页解析 解析普通的文本
1 re.findall('<title>(.*?)</title>' , response.text)[0 ]
'“华中杯”大学生数学建模挑战赛'
1 re.findall('<meta name="description" content="(.*?)">\r\n <meta name="Keywo' , response.text)
['为了激励学生学习数学的积极性,拓展数学及相关学科知识面,提高学生独立分析、建模、求解、应用、写作等综合能力,湖北省工业与应用数学学会作为主办,举办“华中杯”大学生数学建模挑战赛。“华中杯”大学生数学建模挑战赛系原华中地区大学生数学建模邀请赛,已连续举办十二届,2020年由于疫情原因,暂停一年。2021年更名为“华中杯”大学生数学建模挑战赛。']
将图片地址解析出来,并保存 1 2 3 img_url = re.findall('<img src="./(.*?)" class="logo-img"' , response.text) img_path = url + img_url[0 ] img_path
'http://hzbmmc.com/images/v2_qmyf0e.png'
1 2 response2 = requests.get(img_path)
1 2 3 4 5 with open ('hzb.png' , 'wb' ) as f: f.write(response2.content)
设置请求头 抓取豆瓣短评 1 requests.get('https://movie.douban.com/subject/35503125/comments?limit=20&status=P&sort=new_score' )
<Response [418]>
错误响应,正常响应应该为200
1 2 3 headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36' }
1 2 response3 = requests.get('https://movie.douban.com/subject/35503125/comments?limit=20&status=P&sort=new_score' , headers = headers)
解析短评 获取一页的所有短评 1 2 comment = re.findall('<span class="short">(.*?)</span>' , response3.text) len (comment)
20
获取指定页面的内容 网址不同页面的规律为: n:页面的页数https://movie.douban.com/subject/35503125/comments?start=(n-1)*20&limit=20&status=P&sort=new_score
1 2 3 4 5 6 7 n = int (input ('请输入要获取的指定页面的评论:' )) url = f'https://movie.douban.com/subject/35503125/comments?start={(n-1 )*20 } &limit=20&status=P&sort=new_score' response3 = requests.get(url, headers = headers) comment = re.findall('<span class="short">(.*?)</span>' , response3.text) comment
请输入要获取的指定页面的评论:13
[]
批量获取指定页面的所有评论 1 2 3 4 5 6 7 8 9 10 11 12 13 n = int (input ('请输入要获取的评论页面数量:' )) comments = [] for i in range (1 , n+1 ): print (f'正在获取第{i} 页的评论数据。' ) url = f'https://movie.douban.com/subject/35503125/comments?start={(i-1 )*20 } &limit=20&status=P&sort=new_score' response3 = requests.get(url, headers = headers) comment = re.findall('<span class="short">(.*?)</span>' , response3.text) print (f'第{i} 页的评论有{len (comment)} 条' ) comments.extend(comment) print ('-' *50 )print (f'爬取完毕,共获取{n} 页的评论数据,共{len (comments)} 条' )
请输入要获取的评论页面数量:13
正在获取第1页的评论数据。
第1页的评论有20条
正在获取第2页的评论数据。
第2页的评论有18条
正在获取第3页的评论数据。
第3页的评论有19条
正在获取第4页的评论数据。
第4页的评论有18条
正在获取第5页的评论数据。
第5页的评论有18条
正在获取第6页的评论数据。
第6页的评论有17条
正在获取第7页的评论数据。
第7页的评论有18条
正在获取第8页的评论数据。
第8页的评论有20条
正在获取第9页的评论数据。
第9页的评论有17条
正在获取第10页的评论数据。
第10页的评论有20条
正在获取第11页的评论数据。
第11页的评论有19条
正在获取第12页的评论数据。
第12页的评论有0条
正在获取第13页的评论数据。
第13页的评论有0条
--------------------------------------------------
爬取完毕,共获取13页的评论数据,共204条
可以看到第12页和第13页的评论数为0,说明在未登录的状态下仅能访问前11页的评论数据。
保存用户的登录信息 1 2 3 4 headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36' , 'Cookie' : 'douban-fav-remind=1; _vwo_uuid_v2=DEB283CE2FCF01818BD91DB0E42F46F11|22c8862909bb03723aae0ddea873763b; __utmv=30149280.18583; gr_user_id=989fb263-49b3-4d65-b8b2-6a24983315ab; ll="118254"; bid=BLjurjtLS3g; Hm_lvt_16a14f3002af32bf3a75dfe352478639=1645777078; __utmz=30149280.1653440226.28.18.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; __utmz=223695111.1662952292.23.16.utmcsr=douban.com|utmccn=(referral)|utmcmd=referral|utmcct=/; __gads=ID=a40ba2b7513008b5-22805c8e30d70093:T=1662952295:RT=1662952295:S=ALNI_MaFVSLmeZninsXuQYULbkw8BWsTYA; __gpi=UID=000005b285c76c49:T=1653386100:RT=1662952295:S=ALNI_MZuqO0l4dGet48Z8GDQ80-qliaqWA; ap_v=0,6.0; __utmc=30149280; __utmc=223695111; dbcl2="185835619:ijKAyzNy4R4"; ck=2cwF; push_noty_num=0; push_doumail_num=0; __utma=30149280.89340240.1604890009.1662963811.1662967806.34; __utmt_douban=1; __utma=223695111.1159646691.1607870887.1662963811.1662967806.25; __utmb=223695111.0.10.1662967806; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1662967806%2C%22https%3A%2F%2Fwww.douban.com%2F%22%5D; _pk_ses.100001.4cf6=*; __utmb=30149280.2.10.1662967806; _pk_id.100001.4cf6=af92e5e3a944480c.1607870887.25.1662967838.1662965915.' }
1 2 3 4 5 6 7 n = int (input ('请输入要获取的指定页面的评论:' )) url = f'https://movie.douban.com/subject/35503125/comments?start={(n-1 )*20 } &limit=20&status=P&sort=new_score' response3 = requests.get(url, headers = headers) name = re.findall('<a href="https://www.douban.com/people/.+/" class="">(.*?)</a>' , response3.text) comment = re.findall(r'<span class="short">(.*?)</span>' , response3.text)
请输入要获取的指定页面的评论:13
19
1 2 3 4 5 txt = """ <p class=" comment-content"> <span class="short">【影院】 很尴尬,更尴尬的是人家在笑,我在独自尴尬…</span> </p>"""
1 re.findall(r'<span class="short">(.*?)</span>' , txt)
[]
1 2 re.findall(r'<span class="short">(.*?)</span>' , txt, re.S)
['【影院】\n很尴尬,更尴尬的是人家在笑,我在独自尴尬…']
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 n = int (input ('请输入要获取的评论页面数量:' )) names = [] comments = [] times = [] for i in range (1 , n+1 ): print (f'正在获取第{i} 页的评论数据。' ) url = f'https://movie.douban.com/subject/35503125/comments?start={(i-1 )*20 } &limit=20&status=P&sort=new_score' response3 = requests.get(url, headers = headers) name = re.findall('<a href="https://www.douban.com/people/.+/" class="">(.*?)</a>' , response3.text) comment = re.findall('<span class="short">(.*?)</span>' , response3.text, re.S) time = re.findall('<span class="comment-time " title="(.*?)">' , response3.text) print (f'第{i} 页的评论有{len (name)} 条' ) names.extend(name) comments.extend(comment) times.extend(time) print ('-' *50 )print (f'爬取完毕,共获取{n} 页的评论数据,共{len (comments)} 条' )
请输入要获取的评论页面数量:20
正在获取第1页的评论数据。
第1页的评论有20条
正在获取第2页的评论数据。
第2页的评论有20条
正在获取第3页的评论数据。
第3页的评论有20条
正在获取第4页的评论数据。
第4页的评论有20条
正在获取第5页的评论数据。
第5页的评论有20条
正在获取第6页的评论数据。
第6页的评论有20条
正在获取第7页的评论数据。
第7页的评论有20条
正在获取第8页的评论数据。
第8页的评论有20条
正在获取第9页的评论数据。
第9页的评论有20条
正在获取第10页的评论数据。
第10页的评论有20条
正在获取第11页的评论数据。
第11页的评论有20条
正在获取第12页的评论数据。
第12页的评论有20条
正在获取第13页的评论数据。
第13页的评论有20条
正在获取第14页的评论数据。
第14页的评论有20条
正在获取第15页的评论数据。
第15页的评论有20条
正在获取第16页的评论数据。
第16页的评论有20条
正在获取第17页的评论数据。
第17页的评论有20条
正在获取第18页的评论数据。
第18页的评论有20条
正在获取第19页的评论数据。
第19页的评论有20条
正在获取第20页的评论数据。
第20页的评论有20条
--------------------------------------------------
爬取完毕,共获取20页的评论数据,共400条
存储数据 存储单个数据到txt文件中 1 2 3 with open ('comment.txt' , 'w' , encoding='UTF-8' ) as f: for i in comments: f.writelines(i + '\n' )
存储多个数据到excel文件
1 2 3 data = pd.DataFrame([names, times, comments], index=['用户昵称' , '评论时间' , '评论内容' ]).T
1 data.to_excel('data.xlsx' , encoding='GBK' )
将获取到的评论数据绘制成词云图 1 2 import jiebafrom wordcloud import WordCloud
1 2 3 4 with open ('stopword.txt' , 'r' ) as f: stop_word = f.readlines() stop_word = [i.strip() for i in stop_word]
1 2 3 4 5 6 def cut_message (str_ ): res = jieba.lcut(str_) res = [i for i in res if i not in stop_word] return res
1 data['切分结果' ] = data['评论内容' ].apply(cut_message)
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\ming_log\AppData\Local\Temp\jieba.cache
Loading model cost 0.891 seconds.
Prefix dict has been built successfully.
1 2 all_words = [] _ = [all_words.extend(i) for i in data['切分结果' ]]
1 2 from collections import Counterimport matplotlib.pyplot as plt
1 count_word = Counter(all_words)
1 2 3 4 5 6 7 8 9 mask = plt.imread('duihuakuang.jpg' ) wc = WordCloud(mask=mask, background_color='white' , font_path=r'C:\Windows\Fonts\simhei.ttf' , max_font_size=100 , max_words=100 )
1 2 wc.fit_words(count_word)
<wordcloud.wordcloud.WordCloud at 0x263c33e07f0>
1 2 3 4 5 6 7 8 import matplotlib.pyplot as pltplt.figure(figsize=(5 ,5 ), dpi=500 ) plt.axis('off' ) plt.imshow(wc) plt.show()