资讯专栏INFORMATION COLUMN

crawl facebook user basic infomation and photos

arashicage / 3415人阅读

摘要:前言自从之前爬取后公司要求对进行爬取,瞬间心中有一万只。毕竟这些社交网络的站点反爬机制做的很不错。但既然上面安排下来只能硬着头皮上了。通过抓包,发现登陆站点的数据相比要简单所有就写了一套利用爬取的爬虫。

前言

自从之前爬取twitter后公司要求对fancebook进行爬取,瞬间心中有一万只×××。毕竟这些社交网络的站点反爬机制做的很不错。但既然上面安排下来只能硬着头皮上了。通过抓包,发现登陆m.facebook.com站点psot的数据相比facebook.com要简单,所有就写了一套利用scrapy爬取facebook的爬虫。

模拟登陆
from scrapy import Spider
from scrapy.http import Request, FormRequest


class FacebookLogin(Spider):
    download_delay = 0.5

    usr = "××××" # your username/email/phone number
    pwd = "××××" #account password

    def start_requests(self):
        return [Request("https://m.facebook.com/", callback=self.parse)]

    def parse(self, response):
        return FormRequest.from_response(response,
                                            formdata={
                                                "email": self.usr,
                                                "pass": self.pwd
                                            }, callback=self.remember_browser)

    def remember_browser(self, response):
        # if re.search(r"(checkpoint)", response.url):
            # Use "save_device" instead of "dont_save" to save device
        return FormRequest.from_response(response,
                                                formdata={"name_action_selected": "dont_save"},
                                                callback=self.after_login)

    def after_login(self, response):
        pass

注:为了保险起见可以在seething文件中添加一个手机端的USER-AGENT

爬取用户基本信息
# -*- coding: UTF-8 -*-
import re
from urlparse import urljoin

from scrapy import Item, Field
from scrapy.http import Request
from scrapy.selector import Selector

from facebook_login import FacebookLogin


class FacebookItems(Item):
    id = Field()
    url = Field()
    name = Field()
    work = Field()
    education = Field()
    family = Field()
    skills = Field()
    address = Field()
    contact_info = Field()
    basic_info = Field()
    bio = Field()
    quote = Field()
    nicknames = Field()
    relationship = Field()
    image_urls = Field()

class FacebookProfile(FacebookLogin):
    download_delay = 2
    name = "fb"
    links = None
    start_ids = [
        "plok74122", "bear.black.12","tabaco.wang","chaolin.chang.q","ahsien.liu","kaiwen.cheng.100","liang.kevin.92","bingheng.tsai.9","psppupu",
                  "cscgbakery","hc.shiao.l","asusisbad","benjamin","franklin",
        # "RobertScoble"
    ]
                  # "https://m.facebook.com/tabaco.wang?v=info","https://m.facebook.com/RobertScoble?v=info"]

    def after_login(self, response):
        for id in self.start_ids:
            url = "https://m.facebook.com/%s?v=info" %id
            yield Request(url, callback=self.parse_profile,meta={"id":id})

    def parse_profile(self, response):
        item = FacebookItems()

        item["id"] = response.meta["id"]
        item["url"] = response.url
        item["name"] = "".join(response.css("#root strong *::text").extract())

        item["work"] = self.parse_info_has_image(response, response.css("#work"))
        item["education"] = self.parse_info_has_image(response, response.css("#education"))
        item["family"] = self.parse_info_has_image(response, response.css("#family"))

        item["address"] = self.parse_info_has_table(response.css("#living"))
        item["contact_info"] = self.parse_info_has_table(response.css("#contact-info"))
        item["basic_info"] = self.parse_info_has_table(response.css("#basic-info"))
        item["nicknames"] = self.parse_info_has_table(response.css("#nicknames"))

        item["skills"] = self.parse_info_text_only(response.css("#skills"))
        item["bio"] = self.parse_info_text_only(response.css("#bio"))
        item["quote"] = self.parse_info_text_only(response.css("#quote"))
        item["relationship"] = self.parse_info_text_only(response.css("#relationship"))

        yield item


    def parse_info_has_image(self, response, css_path):
        info_list = []
        for div in css_path.xpath("div/div[2]/div"):
            url = urljoin(response.url, "".join(div.css("div > a::attr(href)").extract()))
            title = "".join(div.css("div").xpath("span | h3").xpath("a/text()").extract())
            info = "
".join(div.css("div").xpath("span | h3").xpath("text()").extract())
            if url and title and info:
                info_list.append({"url": url, "title": title, "info": info})
        return info_list

    def parse_info_has_table(self, css_path):
        info_dict = {}
        for div in css_path.xpath("div/div[2]/div"):
            key = "".join(div.css("td:first-child div").xpath("span | span/span[1]").xpath("text()").extract())
            value = "".join(div.css("td:last-child").xpath("div//text()").extract()).strip()
            if key and value:
                if key in info_dict:
                    info_dict[key] += ", %s" % value
                else:
                    info_dict[key] = value
        return info_dict

    def parse_info_text_only(self, css_path):
        text = css_path.xpath("div/div[2]//text()").extract()
        text = [t.strip() for t in text]
        text = [t for t in text if re.search("w+", t) and t != "Edit"]
        return "
".join(text)
爬取用户的所有图片

虽然图片在https://m.facebook.com/%s?v=info中会有显示,但是真正的图片链接却需要几次请求之后才能拿到,本作在spider中尽量少的操作原则故将抓取图片也多带带写成了一个爬虫,如下:

# -*- coding: UTF-8 -*-
from scrapy.spider import CrawlSpider,Rule,Spider
from scrapy.linkextractor import LinkExtractor
from facebook_login import FacebookLogin
from scrapy.http import Request
from scrapy.selector import Selector
from scrapy import Item, Field
import re,hashlib
import sys
reload(sys)
sys.setdefaultencoding("utf-8")

class FacebookPhotoItems(Item):
    url = Field()
    id = Field()
    photo_links = Field()
    md5 = Field()
class CrawlPhoto(FacebookLogin):
    name = "fbphoto"
    timelint_photo = None
    id = None
    links = []
    start_ids = [
        "plok74122", "bear.black.12", "tabaco.wang", "chaolin.chang.q",
        # "ashien.liu",
        "liang.kevin.92","qia.chen",
        "bingheng.tsai.9", "psppupu",
        "cscgbakery", "hc.shiao.l", "asusisbad", "benjamin", "franklin",
        # "RobertScoble"
    ]

    def after_login(self, response):
        for url in self.start_ids:
            yield Request("https://m.facebook.com/%s/photos"%url,callback=self.parse_item,meta={"id":url})
        # yield Request("https://m.facebook.com/%s/photos"%self.id,callback=self.parse_item)
    def parse_item(self,response):
        # print response.body
        urls = response.xpath("//span").extract()
        next_page = None
        try:
            next_page = response.xpath("//div[@class="co"]/a/@href").extract()[0].strip()
        except:
            pass
        # urls = response.xpath("//div[@data-sigil="marea"]").extract()
        for i in urls:
            # if i.find(u"时间线照片")!=-1:
            try:
                self.timeline_photo = Selector(text=i).xpath("//span/a/@href").extract()[0]
                if self.timeline_photo is not None:
                    yield Request("https://m.facebook.com/%s"%self.timeline_photo,callback=self.parse_photos,meta=response.meta)
            except:
                continue
        if next_page:
            print "-----------------------next image page -----------------------------------------"
            yield Request("https://m.facebook.com/%s"%next_page,callback=self.parse_item,meta=response.meta)
    def parse_photos(self,response):
        urls = response.xpath("//a[@class="bw bx"]/@href").extract()
        # urls = response.xpath("//a[@class="_39pi _4i6j"]/@href").extract()
        for i in urls:
            yield Request("https://m.facebook.com/%s"%i,callback=self.process_photo_url,meta=response.meta)
        if len(urls) == 12:
            next_page = response.xpath("//div[@id="m_more_item"]/a/@href").extract()[0]
            yield Request("https://m.facebook.com/%s"%next_page,callback=self.parse_photos,meta=response.meta)
    def process_photo_url(self,response):
        # photo_url = response.xpath("//i[@class="img img"]").extract()
        item = FacebookPhotoItems()
        item["url"] = response.url
        item["id"] = response.meta["id"]
        photo_url = response.xpath("//div[@style="text-align:center;"]/img/@src").extract()[0]
        item["photo_links"] = photo_url
        item["md5"] = self.getstr_md5(item["photo_links"])+".jpg"
        yield item

    def wirtefile(self,str):
        with open("temp2.html","w") as file:
            file.write(str)
            file.write("
")

    def getstr_md5(self, input):
        if input is None:
            input = ""
        md = hashlib.md5()
        md.update(input)
        return md.hexdigest()

因为我的python水平也是半路出家,所有还没有找到一个好的办法将图片链接的抓取集成到抓取基本信息的那个爬虫上,如果有大神知道还请指点一二。
下载图片没有使用scrapy的imagePipline,而是使用的wget命令,原因就是上面所说,python水平太菜。。。
下面是自己写的一个下载图片的pipline:

class MyOwenImageDownload(object):
    def process_item(self, item,spider):
        if len(item) >6:
            pass
        else:
            file = "image/"+item["id"]
            if os.path.exists(file):
                pass
            else:
                os.makedirs(file)
            cmd = "wget "%s" -O %s -P %s --timeout=10 -q"%(item["photo_links"],file+"/"+item["md5"],file)
            os.system(cmd)
        return item
结语

至此,整个爬虫基本的结构已经写完。。。源码地址

In the end, we will remember not the words of our enemies but the silence of our friends

文章版权归作者所有,未经允许请勿转载,若此文章存在违规行为,您可以联系管理员删除。

转载请注明本文地址:https://www.ucloud.cn/yun/38167.html

相关文章

  • Basic认证

    摘要:用户输入用户名和密码后,用户名和密码会经过加密附加到请求信息中再次请求服务器,服务器会根据请求头携带的认证信息,决定是否认证成功及做出相应的响应。给出的认证提示。认证窗口关闭之前,浏览器状态一直是等待用户输入。 Basic 概述 Basic 认证是HTTP 中非常简单的认证方式,因为简单,所以不是很安全,不过仍然非常常用。 当一个客户端向一个需要认证的HTTP服务器进行数据请求时,如果...

    testHs 评论0 收藏0
  • 使用Python开始Web Scraping

    摘要:主要元素是身体内容,可以表示为。提取每个元素的文本并最终组成单个文本。我们将使用故意慢的服务器来显示这一点。是表示值的承诺的对象。我们将使用仓库中提供的准备示例作为示例。请注意,其余代码基本上不受影响除了返回函数中的源链接。 showImg(https://segmentfault.com/img/remote/1460000019190698?w=480&h=260); 来源 | ...

    fobnn 评论0 收藏0
  • scrapy入门教程——爬取豆瓣电影Top250!

    摘要:注意爬豆爬一定要加入选项,因为只要解析到网站的有,就会自动进行过滤处理,把处理结果分配到相应的类别,但偏偏豆瓣里面的为空不需要分配,所以一定要关掉这个选项。 本课只针对python3环境下的Scrapy版本(即scrapy1.3+) 选取什么网站来爬取呢? 对于歪果人,上手练scrapy爬虫的网站一般是官方练手网站 http://quotes.toscrape.com 我们中国人,当然...

    senntyou 评论0 收藏0
  • 10、web爬虫讲解2—Scrapy框架爬虫—Scrapy安装—Scrapy指令

    摘要:负责处理被提取出来的。典型的处理有清理验证及持久化例如存取到数据库知识库项目的设置文件实现自定义爬虫的目录中间件是在引擎及之间的特定钩子,处理的输入和输出及。 【百度云搜索:http://www.bdyss.com】 【搜网盘:http://www.swpan.cn】 Scrapy框架安装 1、首先,终端执行命令升级pip: python -m pip install --upgrad...

    OnlyMyRailgun 评论0 收藏0
  • Python Scrapy爬虫框架学习

    摘要:组件引擎负责控制数据流在系统中所有组件中流动,并在相应动作发生时触发事件。下载器下载器负责获取页面数据并提供给引擎,而后提供给。下载器中间件下载器中间件是在引擎及下载器之间的特定钩子,处理传递给引擎的。 Scrapy 是用Python实现一个为爬取网站数据、提取结构性数据而编写的应用框架。 一、Scrapy框架简介 Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 ...

    harriszh 评论0 收藏0

发表评论

0条评论

最新活动
阅读需要支付1元查看
<