资讯专栏INFORMATION COLUMN

使用Spark分析拉勾网招聘信息(二): 获取数据

caiyongji / 3084人阅读

摘要:要获取什么样的数据我们要获取的数据是指那些公开的可以轻易地获取地数据如果你有完整的数据集肯定是极好的但一般都很难通过还算正当的方式轻易获取单就本系列文章要研究的实时招聘信息来讲能获取最近一个月的相关信息已是足矣如何获取数据爬虫也是可以的作为

要获取什么样的数据?

我们要获取的数据,是指那些公开的,可以轻易地获取地数据.如果你有完整的数据集,肯定是极好的,但一般都很难通过还算正当的方式轻易获取.单就本系列文章要研究的实时招聘信息来讲,能获取最近一个月的相关信息,已是足矣.

如何获取数据?

爬虫,也是可以的,作为一个备选方案.但是,我注意到拉勾网本身的数据,是通过ajax请求更新的,所以批量获取变得更加简单.基于ajax请求来获取数据,方式有很多,这里我演示其中的自认为较为简单通用的一种: 使用 curl 模拟 ajax 请求获取数据.

注意,以下的步骤演示全部基于 Mac 版的 Google Chrome 浏览器,其他浏览器部分操作可能会有些许差异.最后一步会给出 提取出的通用 curl 脚本,直接其实也是可以的,如果对步骤不是很关心.

1.找到目标城市和目标职位,然后按"最新排序",参考链接: http://www.lagou.com/jobs/list_iOS?px=new&city=北京#order

2.双指击/右击 页面,弹出快捷菜单,选择"检查",以进入浏览器调试界面,切换到调试器的 network -> xhr 标签下.

3.cmd + R 刷新页面,此时会捕捉到此页面发出的xhr请求.找到 http://www.lagou.com/jobs/pos... 开头的请求,并双指击/右击,选择 copy as cUrl.

这个 curl代码非常长,对于本次分析来说,最关键的是 末尾的 pn=1&kd=iOS,分别代表着页面和职位,动态设置,即可获取更多职位的更多数据了,文章的其他篇幅,会多带带分析.

curl "http://www.lagou.com/jobs/positionAjax.json?px=new&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false" -H "Cookie: user_trace_token=20160522122749-8a4d6717-1fd5-11e6-963e-5254005c3644; LGUID=20160522122749-8a4d6cb3-1fd5-11e6-963e-5254005c3644; tencentSig=8513357824; LGMOID=20160818212815-33C56329AA2FB6D809D557FD6CC1DE3C; JSESSIONID=E0C5692414B160F3BEFFCFF8D240B693; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=http%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_iOS%3Fpx%3Dnew%26city%3D%25E5%258C%2597%25E4%25BA%25AC; ctk=1472587366; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1470032014,1470032429,1471526897,1471752875; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1472587368; LGSID=20160831035434-92fffed3-6eeb-11e6-8c02-5254005c3644; LGRID=20160831040247-b90762b6-6eec-11e6-a745-525400f775ce; _ga=GA1.2.1654849521.1463891270; SEARCH_ID=506c8097626343ab9b5a2c1139807a2c" -H "Origin: http://www.lagou.com" -H "X-Anit-Forge-Code: 0" -H "Accept-Encoding: gzip, deflate" -H "Accept-Language: zh-CN,zh;q=0.8,en;q=0.6" -H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36" -H "Content-Type: application/x-www-form-urlencoded; charset=UTF-8" -H "Accept: application/json, text/javascript, */*; q=0.01" -H "Cache-Control: max-age=0" -H "X-Requested-With: XMLHttpRequest" -H "Connection: keep-alive" -H "X-Anit-Forge-Token: None" -H "Referer: http://www.lagou.com/jobs/list_iOS?px=new&city=%E5%8C%97%E4%BA%AC" --data "first=true&pn=1&kd=iOS" --compressed

4.讲上一步中的curl指令复制到终端,桥下回车键,即可看到输出.

{"success":true,"requestId":null,"msg":null,"resubmitToken":null,"content":{"pageNo":1,"pageSize":15,"positionResult":{"totalCount":974,"resultSize":15,"locationInfo":{"city":"北京","district":null,"queryByGisCode":false,"businessZone":null,"locationCode":null},"queryAnalysisInfo":{"positionName":"ios","companyName":null,"usefulCompany":false,"industryName":null},"strategyProperty":{"name":"dm-csearch-newSimScorer","id":1},"result":[{"companyId":129801,"companyShortName":"言之有物科技","createTime":"2016-08-30 19:28:12","positionId":1857486,"positionAdvantage":"一线公司,技术驱动,免费三餐,超期望回报","salary":"25k-50k","score":0,"workYear":"不限","education":"本科","city":"北京","positionName":"iOS高级研发工程师/Lead","companyLogo":"i/image/M00/43/4E/CgqKkVeDGsuAXz0gAAA4XeGAAHQ390.png","financeStage":"成长型(A轮)","industryField":"移动互联网,电子商务","jobNature":"全职","approve":1,"companySize":"15-50人","district":null,"companyLabelList":["股票期权","扁平管理","美女多","领导好"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"19:28发布","gradeDescription":null,"companyFullName":"北京言之有物科技有限公司","businessZones":null,"imState":"today","lastLogin":1472556472000,"publisherId":5092848,"explain":null,"plus":null,"pcShow":0},{"companyId":133,"companyShortName":"猎豹移动","createTime":"2016-08-30 19:09:34","positionId":2151896,"positionAdvantage":"明星产品 超赞年终奖 靠谱领导","salary":"15k-30k","score":0,"workYear":"1-3年","education":"本科","city":"北京","positionName":"iOS","companyLogo":"image1/M00/39/70/CgYXBlWo3nqABJTsAADJ3hn5gmE062.jpg","financeStage":"上市公司","industryField":"移动互联网,信息安全","jobNature":"全职","approve":1,"companySize":"500-2000人","district":"朝阳区","companyLabelList":["带薪年假","美女前台","超赞年终奖","一公里工作圈"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"19:09发布","gradeDescription":null,"companyFullName":"北京金山网络科技有限公司","businessZones":["姚家园","十里堡","高碑店"],"imState":"today","lastLogin":1472555392000,"publisherId":129969,"explain":null,"plus":null,"pcShow":0},{"companyId":107608,"companyShortName":"MUM计算机","createTime":"2016-08-30 19:03:24","positionId":1963945,"positionAdvantage":"帮助程序员赴美做IT,享受高薪高品质生活","salary":"10k-20k","score":0,"workYear":"不限","education":"本科","city":"北京","positionName":"IOS程序员赴美项目推广员","companyLogo":"i/image/M00/00/C2/CgqKkVZVHmSAWPtRAASUg0iUVuI932.jpg","financeStage":"初创型(不需要融资)","industryField":"教育","jobNature":"全职","approve":0,"companySize":"少于15人","district":"昌平区","companyLabelList":["赴美工作","美元薪水","告别996","技术前沿"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"19:03发布","gradeDescription":null,"companyFullName":"北京玛赫西计算机教育咨询有限公司","businessZones":null,"imState":"disabled","lastLogin":1472558059000,"publisherId":5179699,"explain":null,"plus":null,"pcShow":0},{"companyId":67576,"companyShortName":"车满满","createTime":"2016-08-30 18:47:30","positionId":2307877,"positionAdvantage":"期权","salary":"20k-25k","score":0,"workYear":"3-5年","education":"本科","city":"北京","positionName":"iOS高级开发工程师","companyLogo":"i/image/M00/01/47/Cgp3O1ZmYACABBpPAAGzVR5S-Ps906.png","financeStage":"成长型(A轮)","industryField":"移动互联网","jobNature":"全职","approve":1,"companySize":"50-150人","district":"朝阳区","companyLabelList":["股票期权","技能培训","弹性工作","定期体检"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:47发布","gradeDescription":null,"companyFullName":"车满满(北京)信息技术有限公司","businessZones":["建外大街","CBD","国贸"],"imState":"today","lastLogin":1472566873000,"publisherId":2116322,"explain":null,"plus":null,"pcShow":0},{"companyId":1575,"companyShortName":"百度","createTime":"2016-08-30 18:30:05","positionId":2307765,"positionAdvantage":"BAT 薪酬福利好","salary":"15k-25k","score":0,"workYear":"3-5年","education":"本科","city":"北京","positionName":"iOS移动开发","companyLogo":"image1/M00/00/06/CgYXBlTUWAWAOBXrAABGHHFb0q8748.jpg","financeStage":"上市公司","industryField":"移动互联网,数据服务","jobNature":"全职","approve":1,"companySize":"2000人以上","district":null,"companyLabelList":["股票期权","弹性工作","五险一金","免费班车"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:30发布","gradeDescription":null,"companyFullName":"百度在线网络技术(北京)有限公司","businessZones":null,"imState":"disabled","lastLogin":1472553001000,"publisherId":5705515,"explain":null,"plus":null,"pcShow":0},{"companyId":13321,"companyShortName":"FunPlus 趣加游戏","createTime":"2016-08-30 18:26:28","positionId":2240276,"positionAdvantage":"国际一线团队,无限的成长空间,任你发挥","salary":"18k-36k","score":0,"workYear":"5-10年","education":"本科","city":"北京","positionName":"iOS 视频处理工程师/高级工程师","companyLogo":"image1/M00/00/1A/Cgo8PFTUWFWAKE5aAABwJ1mgAYw423.png","financeStage":"成长型(B轮)","industryField":"游戏","jobNature":"全职","approve":0,"companySize":"150-500人","district":"海淀区","companyLabelList":["绩效奖金","股票期权","专项奖金","五险一金"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:26发布","gradeDescription":null,"companyFullName":"北京趣加科技有限公司","businessZones":["中关村","知春路","双榆树"],"imState":"today","lastLogin":1472552889000,"publisherId":285309,"explain":null,"plus":null,"pcShow":0},{"companyId":15111,"companyShortName":"联拓天际","createTime":"2016-08-30 18:22:12","positionId":2307696,"positionAdvantage":"与其在别处仰望,不如在这里并肩","salary":"15k-25k","score":0,"workYear":"3-5年","education":"本科","city":"北京","positionName":"iOS","companyLogo":"image1/M00/00/1D/Cgo8PFTUWGGAZQdjAADRNZVO9fc470.jpg","financeStage":"成熟型(不需要融资)","industryField":"电子商务","jobNature":"全职","approve":1,"companySize":"500-2000人","district":null,"companyLabelList":["五险一金","午餐补助","定期体检","技能培训"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:22发布","gradeDescription":null,"companyFullName":"北京联拓天际电子商务有限公司","businessZones":null,"imState":"today","lastLogin":1472552392000,"publisherId":1595082,"explain":null,"plus":null,"pcShow":0},{"companyId":119049,"companyShortName":"优久科技","createTime":"2016-08-30 18:15:29","positionId":1853231,"positionAdvantage":"良好的工作环境、成长平台和工作伙伴","salary":"10k-18k","score":0,"workYear":"1-3年","education":"本科","city":"北京","positionName":"iOS","companyLogo":"i/image/M00/16/74/CgqKkVbvnVuAeC-YAAA_YSPyb5A166.jpg","financeStage":"初创型(天使轮)","industryField":"移动互联网","jobNature":"全职","approve":0,"companySize":"少于15人","district":"海淀区","companyLabelList":["交通补助","通讯津贴","午餐补助"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:15发布","gradeDescription":null,"companyFullName":"北京优久科技有限责任公司","businessZones":["中关村","知春路","人民大学"],"imState":"today","lastLogin":1472552013000,"publisherId":4427723,"explain":null,"plus":null,"pcShow":0},{"companyId":41878,"companyShortName":"商询科技","createTime":"2016-08-30 18:14:06","positionId":2278393,"positionAdvantage":"微软创业团队,工程师文化!","salary":"10k-15k","score":0,"workYear":"1-3年","education":"本科","city":"北京","positionName":"iOS开发","companyLogo":"i/image/M00/24/22/Cgp3O1cZmpWAGslpAAA9MdgVNWU645.jpg","financeStage":"成长型(A轮)","industryField":"企业服务,数据服务","jobNature":"全职","approve":1,"companySize":"15-50人","district":"朝阳区","companyLabelList":["股票期权","人脉资源","办公环境好","国际化团队"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:14发布","gradeDescription":null,"companyFullName":"北京商询科技有限公司","businessZones":["姚家园"],"imState":"today","lastLogin":1472554153000,"publisherId":803257,"explain":null,"plus":null,"pcShow":0},{"companyId":5832,"companyShortName":"新浪微博","createTime":"2016-08-30 18:02:30","positionId":254885,"positionAdvantage":"亿级别DAU,微博重点项目组","salary":"20k-40k","score":0,"workYear":"1-3年","education":"本科","city":"北京","positionName":"新浪微博iOS客户端研发工程师","companyLogo":"image1/M00/00/0D/CgYXBlTUWCCAdkhOAABNgyvZQag818.jpg","financeStage":"上市公司","industryField":"移动互联网","jobNature":"全职","approve":0,"companySize":"2000人以上","district":"海淀区","companyLabelList":["年底双薪","专项奖金","股票期权","五险一金"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:02发布","gradeDescription":null,"companyFullName":"微梦创科网络科技(中国)有限公司","businessZones":["西北旺","马连洼","上地"],"imState":"disabled","lastLogin":1472556144000,"publisherId":561302,"explain":null,"plus":null,"pcShow":0},{"companyId":48321,"companyShortName":"合广众","createTime":"2016-08-30 18:00:40","positionId":2263615,"positionAdvantage":"老板nice","salary":"10k-20k","score":0,"workYear":"3-5年","education":"本科","city":"北京","positionName":"iOS开发工程师","companyLogo":"i/image/M00/01/D6/CgqKkVZ496GAYypzAAAKATKLXuY379.png","financeStage":"初创型(天使轮)","industryField":"移动互联网","jobNature":"全职","approve":0,"companySize":"50-150人","district":"海淀区","companyLabelList":["节日礼物","带薪年假","绩效奖金","岗位晋升"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:00发布","gradeDescription":null,"companyFullName":"北京合广众文化发展有限公司","businessZones":["八里庄","定慧寺","四季青"],"imState":"today","lastLogin":1472550077000,"publisherId":3608518,"explain":null,"plus":null,"pcShow":0},{"companyId":38239,"companyShortName":"Keep","createTime":"2016-08-30 17:52:25","positionId":2076872,"positionAdvantage":"福利健全、北京工作居住证、C轮","salary":"25k-35k","score":0,"workYear":"5-10年","education":"本科","city":"北京","positionName":"iOS开发工程师","companyLogo":"image1/M00/0A/40/CgYXBlTun9KASqKdAAAs36QVurU409.png","financeStage":"成熟型(C轮)","industryField":"社交网络,文化娱乐","jobNature":"全职","approve":1,"companySize":"150-500人","district":null,"companyLabelList":["节日礼物","年度旅游","定期体检","五险一金"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"17:52发布","gradeDescription":null,"companyFullName":"北京卡路里科技有限公司","businessZones":null,"imState":"today","lastLogin":1472550738000,"publisherId":3425178,"explain":null,"plus":null,"pcShow":0},{"companyId":179,"companyShortName":"她理财","createTime":"2016-08-30 17:52:02","positionId":982402,"positionAdvantage":"五险一金 绩效奖金 年底15薪 带薪年假","salary":"15k-25k","score":0,"workYear":"1-3年","education":"本科","city":"北京","positionName":"高级iOS开发工程师","companyLogo":"image1/M00/0C/F2/CgYXBlT2mG2AOPevAAB_09mD2Ko247.png","financeStage":"成长型(A轮)","industryField":"电子商务,金融","jobNature":"全职","approve":1,"companySize":"50-150人","district":"朝阳区","companyLabelList":["年底双薪","节日礼物","技能培训","绩效奖金"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"17:52发布","gradeDescription":null,"companyFullName":"北京新工场投资顾问有限公司","businessZones":["大望路","华贸","百子湾"],"imState":"today","lastLogin":1472557005000,"publisherId":97147,"explain":null,"plus":null,"pcShow":0},{"companyId":11053,"companyShortName":"中科三方","createTime":"2016-08-30 17:33:13","positionId":2307276,"positionAdvantage":"留用机会,户口指标","salary":"2k-4k","score":0,"workYear":"应届毕业生","education":"本科","city":"北京","positionName":"iOS实习生","companyLogo":"image1/M00/00/16/CgYXBlTUWEWAXnWbAACvz96W4qA927.jpg","financeStage":"成长型(不需要融资)","industryField":"移动互联网","jobNature":"实习","approve":0,"companySize":"150-500人","district":"海淀区","companyLabelList":null,"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"17:33发布","gradeDescription":null,"companyFullName":"北京中科三方网络技术有限公司","businessZones":["中关村","知春路","双榆树"],"imState":"today","lastLogin":1472549621000,"publisherId":141237,"explain":null,"plus":null,"pcShow":0},{"companyId":116183,"companyShortName":"情非得已","createTime":"2016-08-30 17:28:11","positionId":1786957,"positionAdvantage":"五险一金、无限小吃、Mac办公、定期体检","salary":"8k-15k","score":0,"workYear":"1-3年","education":"不限","city":"北京","positionName":"android&iOS测试工程师","companyLogo":"i/image/M00/1C/58/CgqKkVcB1QyAJM2-AAA4t6tVzs8439.jpg","financeStage":"初创型(天使轮)","industryField":"移动互联网,企业服务","jobNature":"全职","approve":0,"companySize":"15-50人","district":"朝阳区","companyLabelList":["定期体检","年度旅游","领导好","扁平管理"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"17:28发布","gradeDescription":null,"companyFullName":"情非得已(北京)科技有限公司","businessZones":["建外大街","国贸","CBD"],"imState":"today","lastLogin":1472553855000,"publisherId":4170237,"explain":null,"plus":null,"pcShow":0}]}},"code":0}

可以看到,与网站的第一页获取的实际数据是完全对应的.

如何将数据保存为文件?

将curl的结果,直接保存为文件,才方便进一步处理,方法就是使用重定向符 >,以下代码,讲curl的结果,不是在控制器输出,而是保存到指定文件 1.json

curl "http://www.lagou.com/jobs/positionAjax.json?px=new&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false" -H "Cookie: user_trace_token=20160522122749-8a4d6717-1fd5-11e6-963e-5254005c3644; LGUID=20160522122749-8a4d6cb3-1fd5-11e6-963e-5254005c3644; tencentSig=8513357824; LGMOID=20160818212815-33C56329AA2FB6D809D557FD6CC1DE3C; JSESSIONID=E0C5692414B160F3BEFFCFF8D240B693; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=http%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_iOS%3Fpx%3Dnew%26city%3D%25E5%258C%2597%25E4%25BA%25AC; ctk=1472587366; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1470032014,1470032429,1471526897,1471752875; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1472587368; LGSID=20160831035434-92fffed3-6eeb-11e6-8c02-5254005c3644; LGRID=20160831040247-b90762b6-6eec-11e6-a745-525400f775ce; _ga=GA1.2.1654849521.1463891270; SEARCH_ID=506c8097626343ab9b5a2c1139807a2c" -H "Origin: http://www.lagou.com" -H "X-Anit-Forge-Code: 0" -H "Accept-Encoding: gzip, deflate" -H "Accept-Language: zh-CN,zh;q=0.8,en;q=0.6" -H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36" -H "Content-Type: application/x-www-form-urlencoded; charset=UTF-8" -H "Accept: application/json, text/javascript, */*; q=0.01" -H "Cache-Control: max-age=0" -H "X-Requested-With: XMLHttpRequest" -H "Connection: keep-alive" -H "X-Anit-Forge-Token: None" -H "Referer: http://www.lagou.com/jobs/list_iOS?px=new&city=%E5%8C%97%E4%BA%AC" --data "first=true&pn=1&kd=iOS" --compressed > 1.json
如何获取其他职位的数据?

此处需要一点更深入些的shell语法,简单说,需要一个for in 循环来遍历一组给定的职位,动态更改 前面curl脚本中的 末尾的kd属性的值,并写入职位对应的文件中,注意 末尾 --data后的 单引号对,要改成双引导对,否则无法应用变量.完整代码如下,职位数组,可按需自行添加:

for kd in "Java" "PHP" "C" "C++" "Android" "iOS"
do 
curl "http://www.lagou.com/jobs/positionAjax.json?px=new&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false" -H "Cookie: user_trace_token=20160522122749-8a4d6717-1fd5-11e6-963e-5254005c3644; LGUID=20160522122749-8a4d6cb3-1fd5-11e6-963e-5254005c3644; tencentSig=8513357824; LGMOID=20160818212815-33C56329AA2FB6D809D557FD6CC1DE3C; JSESSIONID=E0C5692414B160F3BEFFCFF8D240B693; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=http%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_iOS%3Fpx%3Dnew%26city%3D%25E5%258C%2597%25E4%25BA%25AC; ctk=1472587366; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1470032014,1470032429,1471526897,1471752875; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1472587368; LGSID=20160831035434-92fffed3-6eeb-11e6-8c02-5254005c3644; LGRID=20160831040247-b90762b6-6eec-11e6-a745-525400f775ce; _ga=GA1.2.1654849521.1463891270; SEARCH_ID=506c8097626343ab9b5a2c1139807a2c" -H "Origin: http://www.lagou.com" -H "X-Anit-Forge-Code: 0" -H "Accept-Encoding: gzip, deflate" -H "Accept-Language: zh-CN,zh;q=0.8,en;q=0.6" -H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36" -H "Content-Type: application/x-www-form-urlencoded; charset=UTF-8" -H "Accept: application/json, text/javascript, */*; q=0.01" -H "Cache-Control: max-age=0" -H "X-Requested-With: XMLHttpRequest" -H "Connection: keep-alive" -H "X-Anit-Forge-Token: None" -H "Referer: http://www.lagou.com/jobs/list_iOS?px=new&city=%E5%8C%97%E4%BA%AC" --data "first=true&pn=1&kd=$kd" --compressed > $kd.json 
done  
如何批量获取?

curl 脚本,现在是每次只可以获取单页,要想获取多页,加个for循环就可以了.经过观察,拉勾有效数据大概最多在100页左右,所以写个1~100的循环,并以 $kd_$pn.json 的格式保存:

for (( pn=1; pn<=100; pn=pn+1 )); do
for kd in "Java" "PHP" "C" "C++" "Android" "iOS"; do 
curl "http://www.lagou.com/jobs/positionAjax.json?px=new&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false" -H "Cookie: user_trace_token=20160522122749-8a4d6717-1fd5-11e6-963e-5254005c3644; LGUID=20160522122749-8a4d6cb3-1fd5-11e6-963e-5254005c3644; tencentSig=8513357824; LGMOID=20160818212815-33C56329AA2FB6D809D557FD6CC1DE3C; JSESSIONID=E0C5692414B160F3BEFFCFF8D240B693; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=http%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_iOS%3Fpx%3Dnew%26city%3D%25E5%258C%2597%25E4%25BA%25AC; ctk=1472587366; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1470032014,1470032429,1471526897,1471752875; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1472587368; LGSID=20160831035434-92fffed3-6eeb-11e6-8c02-5254005c3644; LGRID=20160831040247-b90762b6-6eec-11e6-a745-525400f775ce; _ga=GA1.2.1654849521.1463891270; SEARCH_ID=506c8097626343ab9b5a2c1139807a2c" -H "Origin: http://www.lagou.com" -H "X-Anit-Forge-Code: 0" -H "Accept-Encoding: gzip, deflate" -H "Accept-Language: zh-CN,zh;q=0.8,en;q=0.6" -H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36" -H "Content-Type: application/x-www-form-urlencoded; charset=UTF-8" -H "Accept: application/json, text/javascript, */*; q=0.01" -H "Cache-Control: max-age=0" -H "X-Requested-With: XMLHttpRequest" -H "Connection: keep-alive" -H "X-Anit-Forge-Token: None" -H "Referer: http://www.lagou.com/jobs/list_iOS?px=new&city=%E5%8C%97%E4%BA%AC" --data "first=true&pn=$pn&kd=$kd" --compressed > $kd\_$pn.json
done  
done
如何提高获取速度?

如果你运行了上面的脚本,如你所见,似乎有点太慢,因为curl请求是同步执行的,必须一条下载完成后,才会继续执行下面的代码.可以借助 & 符 异步同时获取多个请求,来提高速度.另外需要注意的一点是:一个电脑,能同时创建的 curl 链接是有限的,为了避免不必要的中断,加了个极短的sleep,改进后的代码如下:

注意: 此处代码,可能会导致您的ip被lagou封闭,如果不是太赶时间的话,慎用;当然,你可以多换几个ip.

for (( pn=1; pn<=100; pn=pn+1 )); do
for kd in "Java" "PHP" "C" "C++" "Android" "iOS"; do 
curl "http://www.lagou.com/jobs/positionAjax.json?px=new&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false" -H "Cookie: user_trace_token=20160522122749-8a4d6717-1fd5-11e6-963e-5254005c3644; LGUID=20160522122749-8a4d6cb3-1fd5-11e6-963e-5254005c3644; tencentSig=8513357824; LGMOID=20160818212815-33C56329AA2FB6D809D557FD6CC1DE3C; JSESSIONID=E0C5692414B160F3BEFFCFF8D240B693; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=http%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_iOS%3Fpx%3Dnew%26city%3D%25E5%258C%2597%25E4%25BA%25AC; ctk=1472587366; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1470032014,1470032429,1471526897,1471752875; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1472587368; LGSID=20160831035434-92fffed3-6eeb-11e6-8c02-5254005c3644; LGRID=20160831040247-b90762b6-6eec-11e6-a745-525400f775ce; _ga=GA1.2.1654849521.1463891270; SEARCH_ID=506c8097626343ab9b5a2c1139807a2c" -H "Origin: http://www.lagou.com" -H "X-Anit-Forge-Code: 0" -H "Accept-Encoding: gzip, deflate" -H "Accept-Language: zh-CN,zh;q=0.8,en;q=0.6" -H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36" -H "Content-Type: application/x-www-form-urlencoded; charset=UTF-8" -H "Accept: application/json, text/javascript, */*; q=0.01" -H "Cache-Control: max-age=0" -H "X-Requested-With: XMLHttpRequest" -H "Connection: keep-alive" -H "X-Anit-Forge-Token: None" -H "Referer: http://www.lagou.com/jobs/list_iOS?px=new&city=%E5%8C%97%E4%BA%AC" --data "first=true&pn=$pn&kd=$kd" --compressed > $kd\_$pn.json &
sleep 0.02
done  
done

注意: 如果一直卡住不动,可以 ctrl + c 退出;如果总是异常脚本中断,可以尝试将 sleep 后的数值调大.

一个更完整的脚本

此处,多带带将数据放到 jobs目录,以便于组织目录结构,完整数据可异步文首的github项目中下载:

mkdir jobs
for (( pn=1; pn<=100; pn=pn+1 )); do
for kd in "Java" "PHP" "C" "C++" "Android" "iOS"; do 
curl "http://www.lagou.com/jobs/positionAjax.json?px=new&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false" -H "Cookie: user_trace_token=20160522122749-8a4d6717-1fd5-11e6-963e-5254005c3644; LGUID=20160522122749-8a4d6cb3-1fd5-11e6-963e-5254005c3644; tencentSig=8513357824; LGMOID=20160818212815-33C56329AA2FB6D809D557FD6CC1DE3C; JSESSIONID=E0C5692414B160F3BEFFCFF8D240B693; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=http%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_iOS%3Fpx%3Dnew%26city%3D%25E5%258C%2597%25E4%25BA%25AC; ctk=1472587366; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1470032014,1470032429,1471526897,1471752875; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1472587368; LGSID=20160831035434-92fffed3-6eeb-11e6-8c02-5254005c3644; LGRID=20160831040247-b90762b6-6eec-11e6-a745-525400f775ce; _ga=GA1.2.1654849521.1463891270; SEARCH_ID=506c8097626343ab9b5a2c1139807a2c" -H "Origin: http://www.lagou.com" -H "X-Anit-Forge-Code: 0" -H "Accept-Encoding: gzip, deflate" -H "Accept-Language: zh-CN,zh;q=0.8,en;q=0.6" -H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36" -H "Content-Type: application/x-www-form-urlencoded; charset=UTF-8" -H "Accept: application/json, text/javascript, */*; q=0.01" -H "Cache-Control: max-age=0" -H "X-Requested-With: XMLHttpRequest" -H "Connection: keep-alive" -H "X-Anit-Forge-Token: None" -H "Referer: http://www.lagou.com/jobs/list_iOS?px=new&city=%E5%8C%97%E4%BA%AC" --data "first=true&pn=$pn&kd=$kd" --compressed > jobs/$kd\_$pn.json &
sleep 0.02
done  
done

另外,你可能会发现,部分职位并没有100页的有效数据,那是否需要额外处理这些数据呢?当然是没有的.Spark等大数据分析工具的一个基本功能就是适度数据集容错.部分异常数据,一般是不会影响数据本身的导入的.导入后,直接分析即可.这都是后话,此系列后面的文章会多带带讲述的.


本系列专属github地址:https://github.com/ios122/spark_lagou

文章版权归作者所有,未经允许请勿转载,若此文章存在违规行为,您可以联系管理员删除。

转载请注明本文地址:https://www.ucloud.cn/yun/8215.html

相关文章

  • 使用Spark分析勾网招聘信息(一):准备工作

    摘要:本系列专属地址前言我觉得如果动笔就应该努力地把要说的东西表达清楚今后一段时间尝试下系列博客文章简单说如果心里想表达想分享的就适当规划组织下使其相对自成体系以便于感兴趣但可能刚好某个领域还不是很熟的人也能很好地入手系列文章我会努力避免过于主观 本系列专属github地址:https://github.com/ios122/spark_lagou 前言 我觉得如果动笔,就应该努力地把要说的...

    刘德刚 评论0 收藏0
  • 使用Spark分析勾网招聘信息(三): BMR 入门

    摘要:简述本文意在以最小的篇幅来帮助对大数据和感兴趣的小伙伴能尽快搭建一个可用的开发环境力求言简意赅文章不敢自称的最佳实践但绝对可以帮助初学者迅速入门能够专心于本身的学习和实践不服的童鞋可以先自己折腾下再返回来读这篇文章哈创建实例假定你已经有了一 简述 本文,意在以最小的篇幅,来帮助对大数据和Spark感兴趣的小伙伴,能尽快搭建一个可用的Spark开发环境.力求言简意赅.文章,不敢自称BMR...

    levinit 评论0 收藏0
  • 新手向-爬取分析勾网招聘信息

    摘要:爱写作者爱写前言看了很多网站,只发现获取拉勾网招聘信息是只用方式就可以得到,应当是非常简单了。在环境下运行通过数据爬取篇伪造浏览器访问拉勾网打开浏览器,进入拉勾网官网,右键检查,调出开发者模式。 [TOC] 爱写bug(ID:icodebugs)作者:爱写bug 前言: ​ 看了很多网站,只发现获取拉勾网招聘信息是只用post方式就可以得到,应当是非常简单了。推荐刚接触数据分析...

    yimo 评论0 收藏0
  • 区块链招聘信息爬取与分析

    摘要:最近在研究区块链,闲来无事抓取了拉勾网上条区块链相关的招聘信息。拉勾网的反爬虫做的还是比较好的,毕竟自己也知道这种做招聘信息聚合的网站很容易被爬,而且比起妹子图这种网站,开发的技术水平应该高不少。 最近在研究区块链,闲来无事抓取了拉勾网上450条区块链相关的招聘信息。过程及结果如下。 拉勾网爬取 首先是从拉勾网爬取数据,用的requests库。拉勾网的反爬虫做的还是比较好的,毕竟自己也...

    kelvinlee 评论0 收藏0
  • node爬取勾网数据并导出为excel文件

    摘要:前言之前断断续续学习了,今天就拿拉勾网练练手,顺便通过数据了解了解最近的招聘行情哈方面算是萌新一个吧,希望可以和大家共同学习和进步。 前言 之前断断续续学习了node.js,今天就拿拉勾网练练手,顺便通过数据了解了解最近的招聘行情哈!node方面算是萌新一个吧,希望可以和大家共同学习和进步。 一、概要 我们首先需要明确具体的需求: 可以通过node index 城市 职位来爬取相关信...

    dkzwm 评论0 收藏0

发表评论

0条评论

caiyongji

|高级讲师

TA的文章

阅读更多
最新活动
阅读需要支付1元查看
<