❤️大佬都在学什么？Python爬虫分析C站大佬收藏夹，跟着大佬一起学，你就是下一个大佬❤️!

Yang_River 发布于2021-09-06 15:02 / 1302人阅读

❤️大佬都在学什么？Python爬虫分析C站大佬收藏夹，跟着大佬一起学，你就是下一个大佬❤️!

前言

计算机行业的发展太快了，有时候几天不学习，就被时代所抛弃了，因此对于我们程序员而言，最重要的就是要时刻紧跟业界动态变化，学习新的技术，但是很多时候我们又不知道学什么好，万一学的新技术并不会被广泛使用，太小众了对学习工作也帮助不大，这时候我们就想要知道大佬们都在学什么了，跟着大佬学习走弯路的概率就小很多了。现在就让我们看看C站大佬们平时都收藏了什么，大佬学什么跟着大佬的脚步就好了！

程序说明

通过爬取 “CSDN” 获取全站排名靠前的博主的公开收藏夹，写入 csv 文件中，根据所获取数据分析领域大佬们的学习趋势，并通过可视化的方式进行展示。

数据爬取

使用 requests 库请求网页信息，使用 BeautifulSoup4 和 json 库解析网页。

获取 CSDN 作者总榜数据

首先，我们需要获取 CSDN 中在榜的大佬，获取他/她们的相关信息。由于数据是动态加载的(关于动态加载的更多说明，可以参考博文《渣男，你为什么有这么多小姐姐的照片？因为我Python爬虫学的好啊❤️！》)，因此使用开发者工具，在网络选项卡中可以找到请求的 JSON 数据：

观察请求链接：

https://blog.csdn.net/phoenix/web/blog/all-rank?page=0&pageSize=20https://blog.csdn.net/phoenix/web/blog/all-rank?page=1&pageSize=20...

可以发现每次请求 JSON 数据时，会获取20个数据，为了获取排名前100的大佬数据，使用如下方式构造请求：

url_rank_pattern = "https://blog.csdn.net/phoenix/web/blog/all-rank?page={}&pageSize=20"for i in range(5):    url = url_rank_pattern.format(i)    #声明网页编码方式    response = requests.get(url=url, headers=headers)    response.encoding = "utf-8"    response.raise_for_status()    soup = BeautifulSoup(response.text, "html.parser")

请求得到 Json 数据后，使用 json 模块解析数据(当然也可以使用 re 模块，根据自己的喜好选择就好了)，获取用户信息，从需求上讲，这里仅需要用户 userName，因此仅解析 userName 信息，也可以根据需求获取其他信息：

userNames = []information = json.loads(str(soup))for j in information["data"]["allRankListItem"]:    # 获取id信息    userNames.append(j["userName"])

获取收藏夹列表

获取到大佬的 userName 信息后，通过主页来观察收藏夹列表的请求方式，本文以自己的主页为例(给自己推广一波)，分析方法与上一步类似，在主页中切换到“收藏”选项卡，同样利用开发者工具的网络选项卡：

观察请求收藏夹列表的地址：

https://blog.csdn.net/community/home-api/v1/get-favorites-created-list?page=1&size=20&noMore=false&blogUsername=LOVEmy134611

可以看到这里我们上一步获取的 userName 就用上了，可以通过替换 blogUsername 的值来获取列表中大佬的收藏夹列表，同样当收藏夹数量大于20时，可以通过修改 page 值来获取所有收藏夹列表：

collections = "https://blog.csdn.net/community/home-api/v1/get-favorites-created-list?page=1&size=20&noMore=false&blogUsername={}"for userName in userNames:    url = collections.format(userName)    #声明网页编码方式    response = requests.get(url=url, headers=headers)    response.encoding = "utf-8"    response.raise_for_status()    soup = BeautifulSoup(response.text, "html.parser")

请求得到 Json 数据后，使用 json 模块解析数据，获取收藏夹信息，从需求上讲，这里仅需要收藏夹 id，因此仅解析 id 信息，也可以根据需求获取其他信息(例如可以获取关注人数等信息，找到最受欢迎的收藏夹)：

file_id_list = []information = json.loads(str(soup))# 获取收藏夹总数collection_number = information["data"]["total"]# 获取收藏夹idfor j in information["data"]["list"]:    file_id_list.append(j["id"])

这里大家可能会问，现在 CSDN 不是有新旧两种主页么，请求方式能一样么？答案是：不一样，在浏览器端进行访问时，旧版本使用了不同的请求接口，但是我们同样可以使用新版本的请求方式来进行获取，因此就不必区分新、旧版本的请求接口了，获取收藏数据时情况也是一样的。

获取收藏数据

最后，单击收藏夹展开按钮，就可以看到收藏夹中的内容了，然后同样利用开发者工具的网络选项卡进行分析：

观察请求收藏夹的地址：

https://blog.csdn.net/community/home-api/v1/get-favorites-item-list?blogUsername=LOVEmy134611&folderId=9406232&page=1&pageSize=200

可以看到刚刚获取的用户 userName 和收藏夹 id 就可以构造请求获取收藏夹中的收藏信息了：

file_url = "https://blog.csdn.net/community/home-api/v1/get-favorites-item-list?blogUsername={}&folderId={}&page=1&pageSize=200"for file_id in file_id_list:    url = file_url.format(userName,file_id)    #声明网页编码方式    response = requests.get(url=url, headers=headers)    response.encoding = "utf-8"    response.raise_for_status()    soup = BeautifulSoup(response.text, "html.parser")

最后用 re 模块解析：

    user = user_dict[userName]    user = preprocess(user)    # 标题    title_list  = analysis(r""title":"(.*?)",", str(soup))    # 链接    url_list = analysis(r""url":"(.*?)"", str(soup))    # 作者    nickname_list = analysis(r""nickname":"(.*?)",", str(soup))    # 收藏日期    date_list = analysis(r""dateTime":"(.*?)",", str(soup))    for i in range(len(title_list)):        title = preprocess(title_list[i])        url = preprocess(url_list[i])        nickname = preprocess(nickname_list[i])        date = preprocess(date_list[i])

爬虫程序完整代码

import timeimport requestsfrom bs4 import BeautifulSoupimport osimport jsonimport reimport csvif not os.path.exists("col_infor.csv"):    #创建存储csv文件存储数据    file = open("col_infor.csv", "w", encoding="utf-8-sig",newline="")    csv_head = csv.writer(file)    #表头    header = ["userName","title","url","anthor","date"]    csv_head.writerow(header)    file.close()headers = {    "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"}def preprocess(string):    return string.replace(","," ")url_rank_pattern = "https://blog.csdn.net/phoenix/web/blog/all-rank?page={}&pageSize=20"userNames = []user_dict = {}for i in range(5):    url = url_rank_pattern.format(i)    #声明网页编码方式    response = requests.get(url=url, headers=headers)    response.encoding = "utf-8"    response.raise_for_status()    soup = BeautifulSoup(response.text, "html.parser")    information = json.loads(str(soup))    for j in information["data"]["allRankListItem"]:        # 获取id信息        userNames.append(j["userName"])        user_dict[j["userName"]] = j["nickName"]def get_col_list(page,userName):    collections = "https://blog.csdn.net/community/home-api/v1/get-favorites-created-list?page={}&size=20&noMore=false&blogUsername={}"    url = collections.format(page,userName)    #声明网页编码方式    response = requests.get(url=url, headers=headers)    response.encoding = "utf-8"    response.raise_for_status()    soup = BeautifulSoup(response.text, "html.parser")    information = json.loads(str(soup))    return informationdef analysis(item,results):    pattern = re.compile(item, re.I|re.M)    result_list = pattern.findall(results)    return result_listdef get_col(userName, file_id, col_page):    file_url = "https://blog.csdn.net/community/home-api/v1/get-favorites-item-list?blogUsername={}&folderId={}&page={}&pageSize=200"    url = file_url.format(userName,file_id, col_page)    #声明网页编码方式    response = requests.get(url=url, headers=headers)    response.encoding = "utf-8"    response.raise_for_status()    soup = BeautifulSoup(response.text, "html.parser")    user = user_dict[userName]    user = preprocess(user)    # 标题    title_list  = analysis(r""title":"(.*?)",", str(soup))    # 链接    url_list = analysis(r""url":"(.*?)"", str(soup))    # 作者    nickname_list = analysis(r""nickname":"(.*?)",", str(soup))    # 收藏日期    date_list = analysis(r""dateTime":"(.*?)",", str(soup))    for i in range(len(title_list)):        title = preprocess(title_list[i])        url = preprocess(url_list[i])        nickname = preprocess(nickname_list[i])        date = preprocess(date_list[i])        if title and url and nickname and date:            with open("col_infor.csv", "a+", encoding="utf-8-sig") as f:                f.write(user + "," + title + "," + url + "," + nickname + "," + date  + "/n")    return informationfor userName in userNames:    page = 1    file_id_list = []    information = get_col_list(page, userName)    # 获取收藏夹总数    collection_number = information["data"]["total"]    # 获取收藏夹id    for j in information["data"]["list"]:        file_id_list.append(j["id"])    while collection_number > 20:        page = page + 1        collection_number = collection_number - 20        information = get_col_list(page, userName)        # 获取收藏夹id        for j in information["data"]["list"]:            file_id_list.append(j["id"])    collection_number = 0    # 获取收藏信息    for file_id in file_id_list:        col_page = 1        information = get_col(userName, file_id, col_page)        number_col = information["data"]["total"]        while number_col > 200:            col_page = col_page + 1            number_col = number_col - 200            get_col(userName, file_id, col_page)    number_col = 0

爬取数据结果

展示部分爬取结果：

数据分析及可视化

最后使用 wordcloud 库，绘制词云展示大佬收藏。

from os import pathfrom PIL import Imageimport matplotlib.pyplot as pltimport jiebafrom wordcloud import WordCloud, STOPWORDSimport pandas as pdimport matplotlib.ticker as tickerimport numpy as npimport mathimport redf = pd.read_csv("col_infor.csv", encoding="utf-8-sig",usecols=["userName","title","url","anthor","date"])place_array = df["title"].valuesplace_list = "，".join(place_array)with open("text.txt","a+") as f:    f.writelines(place_list)###当前文件路径d = path.dirname(__file__)# Read the whole text.file = open(path.join(d, "text.txt")).read()##进行分词#停用词stopwords = ["的","与","和","建议","收藏","使用","了","实现","我","中","你","在","之"]text_split = jieba.cut(file)  # 未去掉停用词的分词结果   list类型#去掉停用词的分词结果  list类型text_split_no = []for word in text_split:    if word not in stopwords:        text_split_no.append(word)#print(text_split_no)text =" ".join(text_split_no)#背景图片picture_mask = np.array(Image.open(path.join(d, "path.jpg")))stopwords = set(STOPWORDS)stopwords.add("said")wc = WordCloud(      #设置字体，指定字体路径    font_path=r"C:/Windows/Fonts/simsun.ttc",     # font_path=r"/usr/share/fonts/wps-office/simsun.ttc",     background_color="white",       max_words=2000,       mask=picture_mask,      stopwords=stopwords)  # 生成词云wc.generate(text)# 存储图片wc.to_file(path.join(d, "result.jpg"))

GPU云服务器云服务器大佬们大佬 vps大佬大佬人工智能

文章版权归作者所有，未经允许请勿转载,若此文章存在违规行为，您可以联系管理员删除。

转载请注明本文地址：https://www.ucloud.cn/yun/119310.html

❤️ 爬虫分析CSDN大佬之间关系，堪比娱乐圈 ❤️

? 作者主页：不吃西红柿 ? 简介：CSDN博客专家?、信息技术智库公号作者✌简历模板、PPT模板、技术资料尽管【关注】私聊我。历史文章目录：https://t.1yb.co/zHJo ? 欢迎点赞 ? 收藏 ⭐留言 ? 如有错误敬请指正！本文重点： 1、爬虫获取csdn大佬之间的关系 2、可视化分析暧昧关系，复杂堪比娱乐圈大佬简介 ? Java李杨勇：一个性感的计算机专业毕业的...

Michael_Ding 2021-09-02 15:11 评论0 收藏0
趁着课余时间学点Python（十四）文件操作

摘要：我是布小禅，一枚自学萌新，跟着我每天进步一点点吧说了这么多暂时也就够了，那么就告辞吧文章目录 ☀️ 前言 ☀️? 作者简介 ??文件操作?1️⃣、open函数...

abson 2021-09-07 09:59 评论0 收藏0
☀️苏州程序大白一文从基础手把手教你Python数据可视化大佬☀️《❤️记得收藏❤️》

☀️苏州程序大白一文从基础手把手教你Python数据可视化大佬☀️《❤️记得收藏❤️》目录 ?️‍?开讲啦！！！！?️‍?苏州程序大白?️‍??博主介绍前言数据关系可视化散点图 Scatter plots折线图强调连续性 Emphasizing continuity with line plots同时显示多了图表数据种类的可视化 Plotting with categorical da...

Drinkey 2021-10-09 09:44 评论0 收藏0
❤️爆肝十二万字《python从零到精通教程》，从零教你变大佬❤️（建议收藏）

文章目录强烈推荐系列教程，建议学起来！！一.pycharm下载安装二.python下载安装三.pycharm上配置python四.配置镜像源让你下载嗖嗖的快4.1pycharm内部配置 4.2手动添加镜像源4.3永久配置镜像源五.插件安装（比如汉化？）5.1自动补码神器第一款5.2汉化pycharm5.3其它插件六.美女背景七.自定义脚本开头八、这个前言一定要看九、pyt...

booster 2021-09-04 16:40 评论0 收藏0

发表评论

登陆后可评论

0条评论

Yang_River

男|高级讲师

我要关注我要私信

TA的文章

虚拟主机技术是什么-什么是虚拟主机？

阅读 4149·2021-09-22 16:03
如何登陆云主机-怎么登录云主机？

阅读 5527·2021-09-22 15:40
❤️大佬都在学什么？Python爬虫分析C站大佬收藏夹，跟着大佬一起学，你就是下一个大佬❤️!

阅读 1303·2021-09-06 15:02
web前端编码规范整合

阅读 955·2019-08-30 15:53
微信小程序中图片上传阿里云Oss

阅读 2346·2019-08-29 15:35
大话-node真的是单线程吗？

阅读 1188·2019-08-23 18:22
使用Proxy实现双向绑定

阅读 3433·2019-08-23 16:06
JavaScript之this

阅读 723·2019-08-23 12:27

资讯专栏INFORMATION COLUMN

上云采购季！| 2核2G4M爆款云服务器低至59元/年，更有多台、长期优惠，快来选购！

❤️大佬都在学什么？Python爬虫分析C站大佬收藏夹，跟着大佬一起学，你就是下一个大佬❤️!