Python Scrapy爬虫框架学习

harriszh 发布于2019-07-31 11:00 / 1694人阅读

摘要：组件引擎负责控制数据流在系统中所有组件中流动，并在相应动作发生时触发事件。下载器下载器负责获取页面数据并提供给引擎，而后提供给。下载器中间件下载器中间件是在引擎及下载器之间的特定钩子，处理传递给引擎的。

Scrapy 是用Python实现一个为爬取网站数据、提取结构性数据而编写的应用框架。

一、Scrapy框架简介

Scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架。可以应用在包括数据挖掘，信息处理或存储历史数据等一系列的程序中。

其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的，也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。

二、架构流程图

接下来的图表展现了Scrapy的架构，包括组件及在系统中发生的数据流的概览(绿色箭头所示)。下面对每个组件都做了简单介绍，并给出了详细内容的链接。数据流如下所描述。

1、组件 Scrapy Engine

引擎负责控制数据流在系统中所有组件中流动，并在相应动作发生时触发事件。详细内容查看下面的数据流(Data Flow)部分。

调度器(Scheduler)

调度器从引擎接受request并将他们入队，以便之后引擎请求他们时提供给引擎。

下载器(Downloader)

下载器负责获取页面数据并提供给引擎，而后提供给spider。

Spiders

Spider是Scrapy用户编写用于分析response并提取item(即获取到的item)或额外跟进的URL的类。每个spider负责处理一个特定(或一些)网站。更多内容请看 Spiders 。

Item Pipeline

Item Pipeline负责处理被spider提取出来的item。典型的处理有清理、验证及持久化(例如存取到数据库中)。更多内容查看 Item Pipeline 。

下载器中间件(Downloader middlewares)

下载器中间件是在引擎及下载器之间的特定钩子(specific hook)，处理Downloader传递给引擎的response。其提供了一个简便的机制，通过插入自定义代码来扩展Scrapy功能。更多内容请看下载器中间件(Downloader Middleware) 。

Spider中间件(Spider middlewares)

Spider中间件是在引擎及Spider之间的特定钩子(specific hook)，处理spider的输入(response)和输出(items及requests)。其提供了一个简便的机制，通过插入自定义代码来扩展Scrapy功能。更多内容请看 Spider中间件(Middleware) 。

2、数据流(Data flow)

Scrapy中的数据流由执行引擎控制，其过程如下:

引擎打开一个网站(open a domain)，找到处理该网站的Spider并向该spider请求第一个要爬取的URL(s)。

引擎从Spider中获取到第一个要爬取的URL并在调度器(Scheduler)以Request调度。

引擎向调度器请求下一个要爬取的URL。

调度器返回下一个要爬取的URL给引擎，引擎将URL通过下载中间件(请求(request)方向)转发给下载器(Downloader)。

一旦页面下载完毕，下载器生成一个该页面的Response，并将其通过下载中间件(返回(response)方向)发送给引擎。

引擎从下载器中接收到Response并通过Spider中间件(输入方向)发送给Spider处理。

Spider处理Response并返回爬取到的Item及(跟进的)新的Request给引擎。

引擎将(Spider返回的)爬取到的Item给Item Pipeline，将(Spider返回的)Request给调度器。

(从第二步)重复直到调度器中没有更多地request，引擎关闭该网站。

3、事件驱动网络(Event-driven networking)

Scrapy基于事件驱动网络框架 Twisted 编写。因此，Scrapy基于并发性考虑由非阻塞(即异步)的实现。

关于异步编程及Twisted更多的内容请查看下列链接:

三、4步制作爬虫

新建项目（scrapy startproject xxx）:新建一个新的爬虫项目

明确目标（编写items.py）:明确你想要抓取的目标

制作爬虫（spiders/xxsp der.py）:制作爬虫开始爬取网页

存储内容（pipelines.py）:设计管道存储爬取内容

四、安装框架

这里我们使用 conda 来进行安装：

conda install scrapy

或者使用 pip 进行安装：

pip install scrapy

查看安装：

➜  spider scrapy -h
Scrapy 1.4.0 - no active project

Usage:
  scrapy  [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy  -h" to see more info about a command

1.创建项目

➜  spider scrapy startproject SF
New Scrapy project "SF", using template directory "/Users/kaiyiwang/anaconda2/lib/python2.7/site-packages/scrapy/templates/project", created in:
    /Users/kaiyiwang/Code/python/spider/SF

You can start your first spider with:
    cd SF
    scrapy genspider example example.com
➜  spider

使用 tree 命令可以查看项目结构：

➜  SF tree
.
├── SF
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       └── __init__.py
└── scrapy.cfg

2.在spiders 目录下创建模板

➜  spiders scrapy genspider sf "https://segmentfault.com"
Created spider "sf" using template "basic" in module:
  SF.spiders.sf
➜  spiders

这样，就生成了一个项目文件 sf.py

# -*- coding: utf-8 -*-
import scrapy
from SF.items import SfItem


class SfSpider(scrapy.Spider):
    name = "sf"
    allowed_domains = ["https://segmentfault.com"]
    start_urls = ["https://segmentfault.com/"]

    def parse(self, response):
        # print response.body
        # pass
        node_list = response.xpath("//h2[@class="title"]")

        # 用来存储所有的item字段的
        # items = []
        for node in node_list:
            # 创建item字段对象，用来存储信息
            item = SfItem()
            # .extract() 将xpath对象转换为 Unicode字符串
            title = node.xpath("./a/text()").extract()

            item["title"] = title[0]

            # 返回抓取到的item数据，给管道文件处理，同时还回来继续执行后边的代码
            yield.item
            #return item
            #return scrapy.Request(url)
            #items.append(item)

命令：

# 测试爬虫是否正常, sf为爬虫的名称
➜  scrapy check sf

# 运行爬虫
➜  scrapy crawl sf

3.item pipeline

当 item 在Spider中被收集之后，它将会被传递到 item Pipeline, 这些 item Pipeline 组件按定义的顺序处理 item.

每个 Item Pipeline 都是实现了简单方法的Python 类，比如决定此Item是丢弃或存储，以下是 item pipeline 的一些典型应用：

验证爬取得数据（检查item包含某些字段，比如说name字段）

查重（并丢弃）

将爬取结果保存到文件或者数据库总（数据持久化）

编写 item pipeline
编写 item pipeline 很简单，item pipeline 组件是一个独立的Python类，其中 process_item()方法必须实现。

from scrapy.exceptions import DropItem

class PricePipeline(object):

    vat_factor = 1.15

    def process_item(self, item, spider):
        if item["price"]:
            if item["price_excludes_vat"]:
                item["price"] = item["price"] * self.vat_factor
            return item
        else:
            raise DropItem("Missing price in %s" % item)

4.选择器(Selectors)

当抓取网页时，你做的最常见的任务是从HTML源码中提取数据。
Selector 有四个基本的方法，最常用的还是Xpath

xpath():传入xpath表达式，返回该表达式所对应的所有节点的selector list 列表。

extract(): 序列化该节点为Unicode字符串并返回list

css():传入CSS表达式，返回该表达式所对应的所有节点的selector list 列表，语法同 BeautifulSoup4

re():根据传入的正则表达式对数据进行提取，返回Unicode 字符串list 列表

Scrapy提取数据有自己的一套机制。它们被称作选择器(seletors)，因为他们通过特定的 XPath 或者 CSS 表达式来“选择” HTML文件中的某个部分。

XPath 是一门用来在XML文件中选择节点的语言，也可以用在HTML上。 CSS 是一门将HTML文档样式化的语言。选择器由它定义，并与特定的HTML元素的样式相关连。

Scrapy选择器构建于 lxml 库之上，这意味着它们在速度和解析准确性上非常相似。

XPath表达式的例子：

/html/head/title: 选择文档中标签内的元素
/html/head/title/text(): 选择上面提到的<title>元素的问题
//td: 选择所有的<td> 元素
//div[@class="mine"]:选择所有具有 class="mine" 属性的 div 元素</pre>
<p>更多XPath 语法总结请看这里。</p>
<b>五、爬取招聘信息</b>
<b>1.爬取腾讯招聘信息</b>
<p>爬取的地址：http://hr.tencent.com/positio...</p>
<b>1.1 创建项目</b>
<pre>> scrapy startproject Tencent

You can start your first spider with:
    cd Tencent
    scrapy genspider example example.com</pre>
<p><script type="text/javascript">showImg("https://segmentfault.com/img/bVZA6N?w=452&h=200");</script></p>
<p>需要抓取网页的元素：</p>
<p><script type="text/javascript">showImg("https://segmentfault.com/img/bVZA6V?w=845&h=572");</script></p>
<p>我们需要爬取以下信息：<br>职位名：positionName<br>职位链接：positionLink<br>职位类型：positionType<br>职位人数：positionNumber<br>工作地点：workLocation<br>发布时点：publishTime</p>
<p>在 <b>items.py</b> 文件中定义爬取的字段：</p>
<pre># -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

# 定义字段
class TencentItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    # 职位名
    positionName = scrapy.Field()

    # 职位链接
    positionLink = scrapy.Field()

    # 职位类型
    positionType = scrapy.Field()

    # 职位人数
    positionNumber = scrapy.Field()

    # 工作地点
    workLocation = scrapy.Field()

    # 发布时点
    publishTime = scrapy.Field()

    pass
</pre>
<b>1.2 写spider爬虫</b>
<p>使用命令创建</p>
<pre>➜  Tencent scrapy genspider tencent "tencent.com"
Created spider "tencent" using template "basic" in module:
  Tencent.spiders.tencent</pre>
<p>生成的 spider 在当前目录下的 <b>spiders/tencent.py</b></p>
<pre>➜  Tencent tree
.
├── __init__.py
├── __init__.pyc
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
├── settings.pyc
└── spiders
    ├── __init__.py
    ├── __init__.pyc
    └── tencent.py</pre>
<p>我们可以看下生成的这个初始化文件 <b>tencent.py</b></p>
<pre># -*- coding: utf-8 -*-
import scrapy


class TencentSpider(scrapy.Spider):
    name = "tencent"
    allowed_domains = ["tencent.com"]
    start_urls = ["http://tencent.com/"]

    def parse(self, response):
        pass
</pre>
<p>对初识文件<b>tencent.py</b>进行修改：</p>
<pre># -*- coding: utf-8 -*-
import scrapy
from Tencent.items import TencentItem

class TencentSpider(scrapy.Spider):
    name = "tencent"
    allowed_domains = ["tencent.com"]
    baseURL = "http://hr.tencent.com/position.php?&start="
    offset = 0  # 偏移量
    start_urls = [baseURL + str(offset)]

    def parse(self, response):

        # 请求响应
        # node_list = response.xpath("//tr[@class="even"] or //tr[@class="odd"]")
         node_list = response.xpath("//tr[@class="even"] | //tr[@class="odd"]")

        for node in node_list:
            item = TencentItem()   # 引入字段类

            # 文本内容, 取列表的第一个元素[0], 并且将提取出来的Unicode编码 转为 utf-8
            item["positionName"] = node.xpath("./td[1]/a/text()").extract()[0].encode("utf-8")
            item["positionLink"] = node.xpath("./td[1]/a/@href").extract()[0].encode("utf-8")         # 链接属性
            item["positionType"] = node.xpath("./td[2]/text()").extract()[0].encode("utf-8")
            item["positionNumber"] = node.xpath("./td[3]/text()").extract()[0].encode("utf-8")
            item["workLocation"] = node.xpath("./td[4]/text()").extract()[0].encode("utf-8")
            item["publishTime"] = node.xpath("./td[5]/text()").extract()[0].encode("utf-8")

            # 返回给管道处理
            yield item

        # 先爬 2000 页数据
        if self.offset < 2000:
            self.offset += 10
            url = self.baseURL + self.offset
            yield scrapy.Request(url, callback = self.parse)






        #pass
</pre>
<p>写管道文件 <b>pipelines.py</b>：</p>
<pre># -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don"t forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json

class TencentPipeline(object):
    def __init__(self):
        self.f = open("tencent.json", "w")

    # 所有的item使用共同的管道
    def process_item(self, item, spider):
        content = json.dumps(dict(item), ensure_ascii = False) + ",
"
        self.f.write(content)
        return item

    def close_spider(self, spider):
        self.f.close()

</pre>
<p>管道写好之后，在 <b>settings.py</b> 中启用管道</p>
<pre># Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    "Tencent.pipelines.TencentPipeline": 300,
}</pre>
<p>运行：</p>
<pre>> scrapy crawl tencent

File "/Users/kaiyiwang/Code/python/spider/Tencent/Tencent/spiders/tencent.py", line 21, in parse
    item["positionName"] = node.xpath("./td[1]/a/text()").extract()[0].encode("utf-8")
IndexError: list index out of range</pre>
<p>请求响应这里写的有问题，Xpath或应该为这种写法：</p>
<pre>  # 请求响应
        # node_list = response.xpath("//tr[@class="even"] or //tr[@class="odd"]")
         node_list = response.xpath("//tr[@class="even"] | //tr[@class="odd"]")
</pre>
<p>然后再执行命令：</p>
<pre>> scrapy crawl tencent</pre>
<p>执行结果文件 <b>tencent.json</b> ：</p>
<pre>{"positionName": "23673-财经运营中心热点运营组编辑", "publishTime": "2017-12-02", "positionLink": "position_detail.php?id=32718&keywords=&tid=0&lid=0", "positionType": "内容编辑类", "workLocation": "北京", "positionNumber": "1"},
{"positionName": "MIG03-腾讯地图高级算法评测工程师（北京）", "publishTime": "2017-12-02", "positionLink": "position_detail.php?id=30276&keywords=&tid=0&lid=0", "positionType": "技术类", "workLocation": "北京", "positionNumber": "1"},
{"positionName": "MIG10-微回收渠道产品运营经理（深圳）", "publishTime": "2017-12-02", "positionLink": "position_detail.php?id=32720&keywords=&tid=0&lid=0", "positionType": "产品/项目类", "workLocation": "深圳", "positionNumber": "1"},
{"positionName": "MIG03-iOS测试开发工程师（北京）", "publishTime": "2017-12-02", "positionLink": "position_detail.php?id=32715&keywords=&tid=0&lid=0", "positionType": "技术类", "workLocation": "北京", "positionNumber": "1"},
{"positionName": "19332-高级PHP开发工程师（上海）", "publishTime": "2017-12-02", "positionLink": "position_detail.php?id=31967&keywords=&tid=0&lid=0", "positionType": "技术类", "workLocation": "上海", "positionNumber": "2"}</pre>
<b>1.3 通过下一页爬取</b>
<p>我们上边是通过总的页数来抓取每页数据的，但是没有考虑到每天的数据是变化的，所以，需要爬取的总页数不能写死，那该怎么判断是否爬完了数据呢？其实很简单，我们可以根据<b>下一页</b>来爬取，只要下一页没有数据了，就说明数据已经爬完了。</p>
<p><script type="text/javascript">showImg("https://segmentfault.com/img/bVZBru?w=837&h=258");</script></p>
<p>我们通过 <b>下一页</b> 看下最后一页的特征：</p>
<p><script type="text/javascript">showImg("https://segmentfault.com/img/bVZBr3?w=752&h=270");</script></p>
<p>下一页的按钮为灰色，并且链接为 <b>class="noactive"</b>属性了，我们可以根据此特性来判断是否到最后一页了。</p>
<pre> # 写死总页数，先爬 100 页数据
        """
  
        if self.offset < 100:
            self.offset += 10
            url = self.baseURL + str(self.offset)
            yield scrapy.Request(url, callback = self.parse)
        """


        # 使用下一页爬取数据
        if len(response.xpath("//a[@class="noactive" and @id="next"]")) == 0:
            url = response.xpath("//a[@id="next"]/@href").extract()[0]
            yield scrapy.Request("http://hr.tencent.com/" + url, callback = self.parse)</pre>
<p>修改后的<b>tencent.py</b>文件：</p>
<pre># -*- coding: utf-8 -*-
import scrapy
from Tencent.items import TencentItem

class TencentSpider(scrapy.Spider):
    # 爬虫名
    name = "tencent"
    # 爬虫爬取数据的域范围
    allowed_domains = ["tencent.com"]
    # 1.需要拼接的URL
    baseURL = "http://hr.tencent.com/position.php?&start="
    # 需要拼接的URL地址的偏移量
    offset = 0  # 偏移量

    # 爬虫启动时，读取的URL地址列表
    start_urls = [baseURL + str(offset)]

    # 用来处理response
    def parse(self, response):

        # 提取每个response的数据
        node_list = response.xpath("//tr[@class="even"] | //tr[@class="odd"]")

        for node in node_list:

            # 构建item对象，用来保存数据
            item = TencentItem()

            # 文本内容, 取列表的第一个元素[0], 并且将提取出来的Unicode编码 转为 utf-8
            print node.xpath("./td[1]/a/text()").extract()

            item["positionName"] = node.xpath("./td[1]/a/text()").extract()[0].encode("utf-8")
            item["positionLink"] = node.xpath("./td[1]/a/@href").extract()[0].encode("utf-8")         # 链接属性

            # 进行是否为空判断
            if len(node.xpath("./td[2]/text()")):
                item["positionType"] = node.xpath("./td[2]/text()").extract()[0].encode("utf-8")
            else:
                item["positionType"] = ""

            item["positionNumber"] = node.xpath("./td[3]/text()").extract()[0].encode("utf-8")
            item["workLocation"] = node.xpath("./td[4]/text()").extract()[0].encode("utf-8")
            item["publishTime"] = node.xpath("./td[5]/text()").extract()[0].encode("utf-8")

            # yield的重要性，是返回数据后还能回来接着执行代码，返回给管道处理，如果为return 整个函数都退出了
            yield item

        # 第一种写法：拼接URL，适用场景：页面没有可以点击的请求链接，必须通过拼接URL才能获取响应
        """
  
        if self.offset < 100:
            self.offset += 10
            url = self.baseURL + str(self.offset)
            yield scrapy.Request(url, callback = self.parse)
        """


        # 第二种写法：直接从response获取需要爬取的连接，并发送请求处理，直到连接全部提取完（使用下一页爬取数据）
        if len(response.xpath("//a[@class="noactive" and @id="next"]")) == 0:
            url = response.xpath("//a[@id="next"]/@href").extract()[0]
            yield scrapy.Request("http://hr.tencent.com/" + url, callback = self.parse)


        #pass
</pre>
<p>OK，通过 根据下一页我们成功爬完招聘信息的所有数据。</p>
<b>1.4 小结</b>
<p>爬虫步骤：</p>

<p>1.创建项目 scrapy project XXX</p>
<p>2.scarpy genspider xxx "http://www.xxx.com"</p>
<p>3.编写 items.py, 明确需要提取的数据</p>
<p>4.编写 <b>spiders/xxx.py</b>, 编写爬虫文件，处理请求和响应，<strong>以及提取数据（yield item）</strong>
</p>
<p>5.编写 <b>pipelines.py</b>, 编写管道文件，处理spider返回item数据,比如本地数据持久化，写文件或存到表中。</p>
<p>6.编写 <b>settings.py</b>，启动管道组件<b>ITEM_PIPELINES</b>，以及其他相关设置</p>
<p>7.执行爬虫 <b>scrapy crawl xxx</b>
</p>

<p>有时候被爬取的网站可能做了很多限制，所以，我们请求时可以添加请求报头，scrapy 给我们提供了一个很方便的报头配置的地方，<b>settings.py</b> 中，我们可以开启:</p>
<pre>
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = "Tencent (+http://www.yourdomain.com)"
User-AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6)
              AppleWebKit/537.36 (KHTML, like Gecko)
              Chrome/62.0.3202.94 Safari/537.36"


# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
   "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
   "Accept-Language": "en",
}</pre>
<p>scrapy 最大的适用场景是爬取静态页面，性能非常强悍，但如果要爬取动态的json数据，那就没必要了。</p>
<hr>
<p>相关文章：</p>
<p>Scrapy入门教程</p>           
               
                                           
                       
                 </div>
            
                     <div class="mt-64 tags-seach" >
                 <div class="tags-info">
                                                                                                                    
                         <a style="width:120px;" title="GPU云服务器" href="https://www.ucloud.cn/site/product/gpu.html">GPU云服务器</a>
                                             
                         <a style="width:120px;" title="云服务器" href="https://www.ucloud.cn/site/active/kuaijiesale.html?ytag=seo">云服务器</a>
                                                                                                                                                 
                                      
                     
                    
                                                                                               <a style="width:120px;" title="python爬虫框架scrapy" href="https://www.ucloud.cn/yun/tag/pythonpachongkuangjiascrapy/">python爬虫框架scrapy</a>
                                                                                                           <a style="width:120px;" title="爬虫框架scrapy" href="https://www.ucloud.cn/yun/tag/pachongkuangjiascrapy/">爬虫框架scrapy</a>
                                                                                                           <a style="width:120px;" title="scrapy框架编写爬虫" href="https://www.ucloud.cn/yun/tag/scrapykuangjiabianxiepachong/">scrapy框架编写爬虫</a>
                                                                                                           <a style="width:120px;" title="python爬虫scrapy" href="https://www.ucloud.cn/yun/tag/pythonpachongscrapy/">python爬虫scrapy</a>
                                                         
                 </div>
               
              </div>
             
               <div class="entry-copyright mb-30">
                   <p class="mb-15"> 文章版权归作者所有，未经允许请勿转载,若此文章存在违规行为，您可以联系管理员删除。</p>
                 
                   <p>转载请注明本文地址：https://www.ucloud.cn/yun/44467.html</p>
               </div>
                      
               <ul class="pre-next-page">
                 
                                  <li class="ellipsis"><a class="hpf" href="https://www.ucloud.cn/yun/44466.html">上一篇：Python 面向对象编程指南 读书笔记</a></li>  
                                                
                                       <li class="ellipsis"><a class="hpf" href="https://www.ucloud.cn/yun/44468.html">下一篇：python读excel写入mysql小工具</a></li>
                                  </ul>
              </div>
              <div class="about_topicone-mid">
                <h3 class="top-com-title mb-0"><span data-id="0">相关文章</span></h3>
                <ul class="com_white-left-mid atricle-list-box">
                             
                                                                    <li>
                                                <div class="atricle-list-right">
                          <h2 class="ellipsis2"><a class="hpf" href="https://www.ucloud.cn/yun/44625.html"><b><em>Python</em><em>爬虫</em>之<em>Scrapy</em><em>学习</em>（基础篇）</b></a></h2>
                                                     <p class="ellipsis2 good">摘要：下载器下载器负责获取页面数据并提供给引擎，而后提供给。下载器中间件下载器中间件是在引擎及下载器之间的特定钩子，处理传递给引擎的。一旦页面下载完毕，下载器生成一个该页面的，并将其通过下载中间件返回方向发送给引擎。

作者：xiaoyu微信公众号：Python数据科学知乎：Python数据分析师

在爬虫的路上，学习scrapy是一个必不可少的环节。也许有好多朋友此时此刻也正在接触并学习sc...</p>
                                                   
                          <div class="com_white-left-info">
                                <div class="com_white-left-infol">
                                    <a href="https://www.ucloud.cn/yun/u-1360.html"><img src="https://www.ucloud.cn/yun/data/avatar/000/00/13/small_000001360.jpg" alt=""><span class="layui-hide64">pkhope</span></a>
                                    <time datetime="">2019-07-31 11:05</time>
                                    <span><i class="fa fa-commenting"></i>评论0</span> 
                                    <span><i class="fa fa-star"></i>收藏0</span> 
                                </div>
                          </div>
                      </div>
                    </li> 
                                                                                       <li>
                                                <div class="atricle-list-right">
                          <h2 class="ellipsis2"><a class="hpf" href="https://www.ucloud.cn/yun/38430.html"><b>零基础如何学<em>爬虫</em>技术</b></a></h2>
                                                     <p class="ellipsis2 good">摘要：楚江数据是专业的互联网数据技术服务，现整理出零基础如何学爬虫技术以供学习，。本文来源知乎作者路人甲链接楚江数据提供网站数据采集和爬虫软件定制开发服务，服务范围涵盖社交网络电子商务分类信息学术研究等。

楚江数据是专业的互联网数据技术服务，现整理出零基础如何学爬虫技术以供学习，http://www.chujiangdata.com。
第一：Python爬虫学习系列教程（来源于某博主：htt...</p>
                                                   
                          <div class="com_white-left-info">
                                <div class="com_white-left-infol">
                                    <a href="https://www.ucloud.cn/yun/u-128.html"><img src="https://www.ucloud.cn/yun/data/avatar/000/00/01/small_000000128.jpg" alt=""><span class="layui-hide64">KunMinX</span></a>
                                    <time datetime="">2019-07-25 11:29</time>
                                    <span><i class="fa fa-commenting"></i>评论0</span> 
                                    <span><i class="fa fa-star"></i>收藏0</span> 
                                </div>
                          </div>
                      </div>
                    </li> 
                                                                                       <li>
                                                <div class="atricle-list-right">
                          <h2 class="ellipsis2"><a class="hpf" href="https://www.ucloud.cn/yun/43405.html"><b><em>Python</em><em>爬虫</em><em>框架</em><em>scrapy</em>入门指引</b></a></h2>
                                                     <p class="ellipsis2 good">摘要：想爬点数据来玩玩，我想最方便的工具就是了。这框架把采集需要用到的功能全部封装好了，只要写写采集规则其他的就交给框架去处理，非常方便，没有之一，不接受反驳。首先，大概看下这门语言。如果文档看不懂的话，推荐看看这个教程爬虫教程

想爬点数据来玩玩， 我想最方便的工具就是Python scrapy了。 这框架把采集需要用到的功能全部封装好了，只要写写采集规则,其他的就交给框架去处理，非常方便，...</p>
                                                   
                          <div class="com_white-left-info">
                                <div class="com_white-left-infol">
                                    <a href="https://www.ucloud.cn/yun/u-61.html"><img src="https://www.ucloud.cn/yun/data/avatar/000/00/00/small_000000061.jpg" alt=""><span class="layui-hide64">孙淑建</span></a>
                                    <time datetime="">2019-07-31 10:11</time>
                                    <span><i class="fa fa-commenting"></i>评论0</span> 
                                    <span><i class="fa fa-star"></i>收藏0</span> 
                                </div>
                          </div>
                      </div>
                    </li> 
                                                                                       <li>
                                                <div class="atricle-list-right">
                          <h2 class="ellipsis2"><a class="hpf" href="https://www.ucloud.cn/yun/41386.html"><b><em>Scrapy</em> <em>框架</em>入门简介</b></a></h2>
                                                     <p class="ellipsis2 good">摘要：解析的方法，每个初始完成下载后将被调用，调用的时候传入从每一个传回的对象来作为唯一参数，主要作用如下负责解析返回的网页数据，提取结构化数据生成生成需要下一页的请求。

Scrapy 框架
Scrapy是用纯Python实现一个为了爬取网站数据、提取结构性数据而编写的应用框架，用途非常广泛。
框架的力量，用户只需要定制开发几个模块就可以轻松的实现一个爬虫，用来抓取网页内容以及各种图片，非常...</p>
                                                   
                          <div class="com_white-left-info">
                                <div class="com_white-left-infol">
                                    <a href="https://www.ucloud.cn/yun/u-1504.html"><img src="https://www.ucloud.cn/yun/data/avatar/000/00/15/small_000001504.jpg" alt=""><span class="layui-hide64">Coding01</span></a>
                                    <time datetime="">2019-07-30 15:39</time>
                                    <span><i class="fa fa-commenting"></i>评论0</span> 
                                    <span><i class="fa fa-star"></i>收藏0</span> 
                                </div>
                          </div>
                      </div>
                    </li> 
                                                                                                           
                </ul>
              </div>
              
               <div class="topicone-box-wangeditor">
                  
                  <h3 class="top-com-title mb-64"><span>发表评论</span></h3>
                   <div class="xcp-publish-main flex_box_zd">
                                      
                      <div class="unlogin-pinglun-box">
                        <a href="javascript:login()" class="grad">登陆后可评论</a>
                      </div>                   </div>
               </div>
              <div class="site-box-content">
                <div class="site-content-title">
                  <h3 class="top-com-title mb-64"><span>0条评论</span></h3>   
                </div> 
                      <div class="pages"></ul></div>
              </div>
           </div>
           <div class="layui-col-md4 layui-col-lg3 com_white-right site-wrap-right">
              <div class=""> 
                <div class="com_layuiright-box user-msgbox">
                    <a href="https://www.ucloud.cn/yun/u-227.html"><img src="https://www.ucloud.cn/yun/data/avatar/000/00/02/small_000000227.jpg" alt=""></a>
                    <h3><a href="https://www.ucloud.cn/yun/u-227.html" rel="nofollow">harriszh</a></h3>
                    <h6>男<span>|</span>高级讲师</h6>
                    <div class="flex_box_zd user-msgbox-atten">
                     
                                                                      <a href="javascript:attentto_user(227)" id="attenttouser_227" class="grad follow-btn notfollow attention">我要关注</a>
      
                                                                                        <a href="javascript:login()" title="发私信" >我要私信</a>
                     
                                            
                    </div>
                    <div class="user-msgbox-list flex_box_zd">
                          <h3 class="hpf">TA的文章</h3>
                          <a href="https://www.ucloud.cn/yun/ut-227.html" class="box_hxjz">阅读更多</a>
                    </div>
                      <ul class="user-msgbox-ul">
                                                  <li><h3 class="ellipsis"><a href="https://www.ucloud.cn/yun/116822.html">BUI Webapp用于项目中的一点小心得</a></h3>
                            <p>阅读 1751<span>·</span>2019-08-30 15:54</p></li>
                                                       <li><h3 class="ellipsis"><a href="https://www.ucloud.cn/yun/110130.html">前端面试题总结——综合问题(持续更新中)</a></h3>
                            <p>阅读 3402<span>·</span>2019-08-26 17:15</p></li>
                                                       <li><h3 class="ellipsis"><a href="https://www.ucloud.cn/yun/109592.html">在浏览器调起本地应用的方法</a></h3>
                            <p>阅读 3599<span>·</span>2019-08-26 13:49</p></li>
                                                       <li><h3 class="ellipsis"><a href="https://www.ucloud.cn/yun/109211.html">leetcode 链表相关题目解析</a></h3>
                            <p>阅读 2623<span>·</span>2019-08-26 13:38</p></li>
                                                       <li><h3 class="ellipsis"><a href="https://www.ucloud.cn/yun/108092.html">【刷算法】丑数</a></h3>
                            <p>阅读 2357<span>·</span>2019-08-26 12:08</p></li>
                                                       <li><h3 class="ellipsis"><a href="https://www.ucloud.cn/yun/106603.html">webstorm预览html配置localhost为本机ip地址</a></h3>
                            <p>阅读 3212<span>·</span>2019-08-26 10:41</p></li>
                                                       <li><h3 class="ellipsis"><a href="https://www.ucloud.cn/yun/106130.html">篮球即时比分api接口调用示例代码</a></h3>
                            <p>阅读 1415<span>·</span>2019-08-26 10:24</p></li>
                                                       <li><h3 class="ellipsis"><a href="https://www.ucloud.cn/yun/105498.html">Webpack包教不包会</a></h3>
                            <p>阅读 3428<span>·</span>2019-08-23 18:35</p></li>
                                                
                      </ul>
                </div>

                   <!-- 文章详情右侧广告-->
              
  <div class="com_layuiright-box">
                  <h6 class="top-com-title"><span>最新活动</span></h6> 
           
         <div class="com_adbox">
                    <div class="layui-carousel" id="right-item">
                      <div carousel-item>
                                                                                                                       <div>
                          <a href="https://www.ucloud.cn/site/active/kuaijiesale.html?ytag=seo"  rel="nofollow">
                            <img src="https://www.ucloud.cn/yun/data/attach/240625/2rTjEHmi.png" alt="云服务器">                                 
                          </a>
                        </div>
                                                <div>
                          <a href="https://www.ucloud.cn/site/product/gpu.html"  rel="nofollow">
                            <img src="https://www.ucloud.cn/yun/data/attach/240807/7NjZjdrd.png" alt="GPU云服务器">                                 
                          </a>
                        </div>
                                                                   
                    
                        
                      </div>
                    </div>
                      
                    </div>                    <!-- banner结束 -->
              
<div class="adhtml">

</div>
                <script>
                $(function(){
                    $.ajax({
                        type: "GET",
                                url:"https://www.ucloud.cn/yun/ad/getad/1.html",
                                cache: false,
                                success: function(text){
                                  $(".adhtml").html(text);
                                }
                        });
                    })
                </script>                </div>              </div>
           </div>
        </div>
      </div> 
    </section>
    <!-- wap拉出按钮 -->
     <div class="site-tree-mobile layui-hide">
      <i class="layui-icon layui-icon-spread-left"></i>
    </div>
    <!-- wap遮罩层 -->
    <div class="site-mobile-shade"></div>
    
       <!--付费阅读 -->
       <div id="payread">
         <div class="layui-form-item">阅读需要支付1元查看</div>  
         <div class="layui-form-item"><button class="btn-right">支付并查看</button></div>     
       </div>
      <script>
      var prei=0;

       
       $(".site-seo-depict pre").each(function(){
          var html=$(this).html().replace("<code>","").replace("</code>","").replace('<code class="javascript hljs" codemark="1">','');
          $(this).attr('data-clipboard-text',html).attr("id","pre"+prei);
          $(this).html("").append("<code>"+html+"</code>");
         prei++;
       })
           $(".site-seo-depict img").each(function(){
             
            if($(this).attr("src").indexOf('data:image/svg+xml')!= -1){
                $(this).remove();
            }
       })
     $("LINK[href*='style-49037e4d27.css']").remove();
       $("LINK[href*='markdown_views-d7a94ec6ab.css']").remove();
layui.use(['jquery', 'layer','code'], function(){
  $("pre").attr("class","layui-code");
      $("pre").attr("lay-title","");
       $("pre").attr("lay-skin","");
  layui.code(); 
       $(".layui-code-h3 a").attr("class","copycode").html("复制代码 ").attr("onclick","copycode(this)");
      
});
function copycode(target){
    var id=$(target).parent().parent().attr("id");
  
                  var clipboard = new ClipboardJS("#"+id);

clipboard.on('success', function(e) {


    e.clearSelection();
    alert("复制成功")
});

clipboard.on('error', function(e) {
    alert("复制失败")
});
}
//$(".site-seo-depict").html($(".site-seo-depict").html().slice(0, -5));
</script>
  <link rel="stylesheet" type="text/css" href="https://www.ucloud.cn/yun/static/js/neweditor/code/styles/tomorrow-night-eighties.css">
    <script src="https://www.ucloud.cn/yun/static/js/neweditor/code/highlight.pack.js" type="text/javascript"></script>
    <script src="https://www.ucloud.cn/yun/static/js/clipboard.js"></script>

<script>hljs.initHighlightingOnLoad();</script>

<script>
    function setcode(){
        var _html='';
    	  document.querySelectorAll('pre code').forEach((block) => {
        	  var _tmptext=$.trim($(block).text());
        	  if(_tmptext!=''){
        		  _html=_html+_tmptext;
        		  console.log(_html);
        	  }
    		 
    		  
    		 
      	  });
    	 

    }

</script>

<script>
function payread(){
  layer.open({
      type: 1,
      title:"付费阅读",
      shadeClose: true,
      content: $('#payread')
    });
}
// 举报
function jupao_tip(){
  layer.open({
      type: 1,
      title:false,
      shadeClose: true,
      content: $('#jubao')
    });

}
$(".getcommentlist").click(function(){
var _id=$(this).attr("dataid");
var _tid=$(this).attr("datatid");
$("#articlecommentlist"+_id).toggleClass("hide");
var flag=$("#articlecommentlist"+_id).attr("dataflag");
if(flag==1){
flag=0;
}else{
flag=1;
//加载评论
loadarticlecommentlist(_id,_tid);
}
$("#articlecommentlist"+_id).attr("dataflag",flag);

})
$(".add-comment-btn").click(function(){
var _id=$(this).attr("dataid");
$(".formcomment"+_id).toggleClass("hide");
})
$(".btn-sendartcomment").click(function(){
var _aid=$(this).attr("dataid");
var _tid=$(this).attr("datatid");
var _content=$.trim($(".commenttext"+_aid).val());
if(_content==''){
alert("评论内容不能为空");
return false;
}
var touid=$("#btnsendcomment"+_aid).attr("touid");
if(touid==null){
touid=0;
}
addarticlecomment(_tid,_aid,_content,touid);
})
 $(".button_agree").click(function(){
 var supportobj = $(this);
         var tid = $(this).attr("id");
         $.ajax({
         type: "GET",
                 url:"https://www.ucloud.cn/yun/index.php?topic/ajaxhassupport/" + tid,
                 cache: false,
                 success: function(hassupport){
                 if (hassupport != '1'){






                         $.ajax({
                         type: "GET",
                                 cache:false,
                                 url: "https://www.ucloud.cn/yun/index.php?topic/ajaxaddsupport/" + tid,
                                 success: function(comments) {

                                 supportobj.find("span").html(comments+"人赞");
                                 }
                         });
                 }else{
                	 alert("您已经赞过");
                 }
                 }
         });
 });
 function attenquestion(_tid,_rs){
    	$.ajax({
    //提交数据的类型 POST GET
    type:"POST",
    //提交的网址
    url:"https://www.ucloud.cn/yun/favorite/topicadd.html",
    //提交的数据
    data:{tid:_tid,rs:_rs},
    //返回数据的格式
    datatype: "json",//"xml", "html", "script", "json", "jsonp", "text".
    //在请求之前调用的函数
    beforeSend:function(){},
    //成功返回之后调用的函数
    success:function(data){
    	var data=eval("("+data+")");
    	console.log(data)
       if(data.code==2000){
    	layer.msg(data.msg,function(){
    	  if(data.rs==1){
    	      //取消收藏
    	      $(".layui-layer-tips").attr("data-tips","收藏文章");
    	      $(".layui-layer-tips").html('<i class="fa fa-heart-o"></i>');
    	  }
    	   if(data.rs==0){
    	      //收藏成功
    	      $(".layui-layer-tips").attr("data-tips","已收藏文章");
    	      $(".layui-layer-tips").html('<i class="fa fa-heart"></i>')
    	  }
    	})
    	 
       }else{
    	layer.msg(data.msg)
       }


    }   ,
    //调用执行后调用的函数
    complete: function(XMLHttpRequest, textStatus){
     	postadopt=true;
    },
    //调用出错执行的函数
    error: function(){
        //请求出错处理
    	postadopt=false;
    }
 });
}
</script>
<footer>
        <div class="layui-container">
            <div class="flex_box_zd">
              <div class="left-footer">
                    <h6><a href="https://www.ucloud.cn/"><img src="https://www.ucloud.cn/yun/static/theme/ukd//images/logo.png" alt="UCloud （优刻得科技股份有限公司）"></a></h6>
                    <p>UCloud （优刻得科技股份有限公司）是中立、安全的云计算服务平台，坚持中立，不涉足客户业务领域。公司自主研发IaaS、PaaS、大数据流通平台、AI服务平台等一系列云计算产品，并深入了解互联网、传统企业在不同场景下的业务需求，提供公有云、混合云、私有云、专有云在内的综合性行业解决方案。</p>
              </div>
              <div class="right-footer layui-hidemd">
                  <ul class="flex_box_zd">
                      <li>
                        <h6>UCloud与云服务</h6>
                         <p><a href="https://www.ucloud.cn/site/about/intro/">公司介绍</a></p>
                         <p><a href="https://zhaopin.ucloud.cn/" >加入我们</a></p>
                         <p><a href="https://www.ucloud.cn/site/ucan/onlineclass/">UCan线上公开课</a></p>
                         <p><a href="https://www.ucloud.cn/site/solutions.html" >行业解决方案</a></p>                                                  <p><a href="https://www.ucloud.cn/site/pro-notice/">产品动态</a></p>
                      </li>
                      <li>
                        <h6>友情链接</h6>                                             <p><a href="https://www.compshare.cn/?ytag=seo">GPU算力平台</a></p>                                             <p><a href="https://www.ucloudstack.com/?ytag=seo">UCloud私有云</a></p>
                                             <p><a href="https://www.surfercloud.com/">SurferCloud</a></p>                                             <p><a href="https://www.uwin-link.com/">工厂仿真软件</a></p>                                                                                       <p><a href="https://www.picpik.ai/zh">AI绘画</a></p>
                                              <p><a href="https://wavespeed.ai/">Wavespeed AI</a></p> 
                                             
                      </li>
                      <li>
                        <h6>社区栏目</h6>
                         <p><a href="https://www.ucloud.cn/yun/column/index.html">专栏文章</a></p>
                     <p><a href="https://www.ucloud.cn/yun/udata/">专题地图</a></p>                      </li>
                      <li>
                        <h6>常见问题</h6>
                         <p><a href="https://www.ucloud.cn/site/ucsafe/notice.html" >安全中心</a></p>
                         <p><a href="https://www.ucloud.cn/site/about/news/recent/" >新闻动态</a></p>
                         <p><a href="https://www.ucloud.cn/site/about/news/report/">媒体动态</a></p>                                                  <p><a href="https://www.ucloud.cn/site/cases.html">客户案例</a></p>                                                
                         <p><a href="https://www.ucloud.cn/site/notice/">公告</a></p>
                      </li>
                      <li>
                          <span><img src="https://static.ucloud.cn/7a4b6983f4b94bcb97380adc5d073865.png" alt="优刻得"></span>
                          <p>扫扫了解更多</p></div>
            </div>
            <div class="copyright">Copyright © 2012-2025 UCloud 优刻得科技股份有限公司<i>｜</i><a rel="nofollow" href="http://beian.miit.gov.cn/">沪公网安备 31011002000058号</a><i>｜</i><a rel="nofollow" href="http://beian.miit.gov.cn/"></a> 沪ICP备12020087号-3</a><i>｜</i> <script type="text/javascript" src="https://gyfk12.kuaishang.cn/bs/ks.j?cI=197688&fI=125915" charset="utf-8"></script>
<script>
var _hmt = _hmt || [];
(function() {
  var hm = document.createElement("script");
  hm.src = "https://hm.baidu.com/hm.js?290c2650b305fc9fff0dbdcafe48b59d";
  var s = document.getElementsByTagName("script")[0]; 
  s.parentNode.insertBefore(hm, s);
})();
</script>
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=G-DZSMXQ3P9N"></script>
<script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'G-DZSMXQ3P9N');
</script>
<script>
(function(){
var el = document.createElement("script");
el.src = "https://lf1-cdn-tos.bytegoofy.com/goofy/ttzz/push.js?99f50ea166557aed914eb4a66a7a70a4709cbb98a54ecb576877d99556fb4bfc3d72cd14f8a76432df3935ab77ec54f830517b3cb210f7fd334f50ccb772134a";
el.id = "ttzz";
var s = document.getElementsByTagName("script")[0];
s.parentNode.insertBefore(el, s);
})(window)
</script></div> 
        </div>
    </footer>
</body>
<script src="https://www.ucloud.cn/yun/static/theme/ukd/js/common.js"></script>
<<script type="text/javascript">
$(".site-seo-depict *,.site-content-answer-body *,.site-body-depict *").css("max-width","100%");
</script>
</html>

资讯专栏INFORMATION COLUMN

上云采购季！| 2核2G4M爆款云服务器低至59元/年，更有多台、长期优惠，快来选购！

Python Scrapy爬虫框架学习