博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
用Python爬虫对豆瓣《敦刻尔克》影评进行词云展示
阅读量:6787 次
发布时间:2019-06-26

本文共 3237 字,大约阅读时间需要 10 分钟。

最近很想看的一个电影,去知乎上看一下评论,刚好在学Python爬虫,就做个小实例。

代码基于第三方修改 原文链接  http://python.jobbole.com/88325/#comment-94754

#coding:utf-8from lib2to3.pgen2.grammar import line__author__ = 'hang'import warningswarnings.filterwarnings("ignore")import jieba    #分词包import numpy    #numpy计算包import reimport pandas as pdimport matplotlib.pyplot as pltimport urllib2from bs4 import BeautifulSoup as bsimport matplotlibmatplotlib.rcParams['figure.figsize'] = (10.0, 5.0)from wordcloud import WordCloud#词云包#分析网页函数def getNowPlayingMovie_list():    resp = urllib2.urlopen('https://movie.douban.com/nowplaying/hangzhou/')    html_data = resp.read().decode('utf-8')    soup = bs(html_data, 'html.parser')    nowplaying_movie = soup.find_all('div', id='nowplaying')    nowplaying_movie_list = nowplaying_movie[0].find_all('li', class_='list-item')    nowplaying_list = []    for item in nowplaying_movie_list:        nowplaying_dict = {}        nowplaying_dict['id'] = item['data-subject']        for tag_img_item in item.find_all('img'):            nowplaying_dict['name'] = tag_img_item['alt']            nowplaying_list.append(nowplaying_dict)    return nowplaying_list#爬取评论函数def getCommentsById(movieId, pageNum):    eachCommentStr = ''    if pageNum>0:         start = (pageNum-1) * 20    else:        return False    requrl = 'https://movie.douban.com/subject/' + movieId + '/comments' +'?' +'start=' + str(start) + '&limit=20'    print(requrl)    resp = urllib2.urlopen(requrl)    html_data = resp.read()    soup = bs(html_data, 'html.parser')    comment_div_lits = soup.find_all('div', class_='comment')    for item in comment_div_lits:        if item.find_all('p')[0].string is not None:            eachCommentStr+=item.find_all('p')[0].string    return eachCommentStr.strip()def main():    #循环获取第一个电影的前10页评论    commentStr = ''    NowPlayingMovie_list = getNowPlayingMovie_list()    for i in range(10):        num = i + 1        commentList_temp = getCommentsById(NowPlayingMovie_list[0]['id'], num)        commentStr+=commentList_temp.strip()    #print comments    cleaned_comments = re.sub("[\s+\.\!\/_,$%^*(+\"\')]+|[+——()?【】《》<>,“”!,...。?、~@#¥%……&*()]+", "",commentStr)    print cleaned_comments    #使用结巴分词进行中文分词    segment = jieba.lcut(cleaned_comments)    words_df=pd.DataFrame({
'segment':segment}) #去掉停用词 stopwords=pd.read_csv("D:\pycode\stopwords.txt",index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')#quoting=3全不引用 words_df=words_df[~words_df.segment.isin(stopwords.stopword)] print words_df #统计词频 words_stat=words_df.groupby(by=['segment'])['segment'].agg({
"计数":numpy.size}) words_stat=words_stat.reset_index().sort_values(by=["计数"],ascending=False) #用词云进行显示 wordcloud=WordCloud(font_path="D:\pycode\simhei.ttf",background_color="white",max_font_size=80) word_frequence = {x[0]:x[1] for x in words_stat.head(1000).values} word_frequence_list = [] for key in word_frequence: temp = (key,word_frequence[key]) word_frequence_list.append(temp) wordcloud = wordcloud.fit_words(dict(word_frequence_list)) plt.imshow(wordcloud) plt.axis("off") plt.show()#主函数main()

posted on
2017-09-05 17:49 阅读(
...) 评论(
...)

转载于:https://www.cnblogs.com/cheman/p/7479872.html

你可能感兴趣的文章
【kd-tree】bzoj1176 [Balkan2007]Mokia
查看>>
CodeBlocks中使用中文字符问题
查看>>
SQL plus连接远程Oralce数据库
查看>>
C#泛型详解
查看>>
PDMS RvmTranslator
查看>>
第一天使用博客园 ----与肝胆人共事,于无字句处读书
查看>>
20172318 2018-2019-1 《程序设计与数据结构》第4周学习总结
查看>>
【python3的学习之路十二】面向对象高级编程
查看>>
js——BOM
查看>>
常用的加密与解密类
查看>>
hrbeu 哈工程 Eular Graph
查看>>
web crawling(plus6) pic mining
查看>>
sintimental analysis
查看>>
打印沙漏
查看>>
visual studio 2005没有找到MSVCR80D.dll问题
查看>>
获取鼠标的当前位置
查看>>
django_models_一对一关系
查看>>
内核常见锁的机制与实现分析2
查看>>
Configure the handler mapping priority in Spring MVC
查看>>
Send an image over a network using Qt
查看>>