基于Docker的Scrapy+Scrapyd+Scrapydweb部署

defcon 发布于2019-06-28 16:57 / 2242人阅读

摘要：如需远程访问，则需在配置文件中设置，然后重启。详见如果是同个容器，直接使用即可，这里是演示了不同容器或主机下的情况访问即可

文章开始，先摘录一下文中各软件的官方定义
Scrapy

An open source and collaborative framework for extracting the data you
need from websites.In a fast, simple, yet extensible way.

Scrapyd

Scrapy comes with a built-in service, called “Scrapyd”, which allows
you to deploy (aka. upload) your projects and control their spiders
using a JSON web service.

Scrapydweb

A full-featured web UI for Scrapyd cluster management,
with Scrapy log analysis & visualization supported.

Docker

Docker Container： A container is a standard unit of software that packages up code and
all its dependencies so the application runs quickly and reliably from
one computing environment to another. A Docker container image is a
lightweight, standalone, executable package of software that includes
everything needed to run an application: code, runtime, system tools,
system libraries and settings.

整套系统的运行并不依赖docker， docker为我们提供的是标准化的系统运行环境，降低了运维成本，并且可以在将来分布式部署的时候提供快速统一的方案；scrapyd+scrapydweb的作用也仅仅是可以提供一个UI界面来观察测试

scrapy，scrapyd，scrapydweb也可以拆分成三个独立的镜像，不过这里为了解释方便就统一使用了一个docker镜像配置

scrapy工程向scrapyd部署的时候可以使用命令行工具scrapyd-deploy, 也可以在scrapydweb管理后台的deploy控制台进行，但前提都是要启动scrapyd监听服务（默认6800端口）

scrapyd的服务可以只运行在内网环境中，scrapydweb可以通过内网地址访问到SCRAPYD_SERVERS设定的服务，而自身向外网暴露监听端口（默认5000）即可

dockerfile的内容基于 aciobanu/scrapy 修改

FROM alpine:latest

RUN echo "https://mirror.tuna.tsinghua.edu.cn/alpine/latest-stable/main/" > /etc/apk/repositories

#RUN apk update && apk upgrade 

RUN apk -U add 
gcc 
bash 
bash-doc 
bash-completion 
libffi-dev 
libxml2-dev 
libxslt-dev 
libevent-dev 
musl-dev 
openssl-dev 
python-dev 
py-imaging 
py-pip 
redis 
curl ca-certificates 
&& update-ca-certificates 
&& rm -rf /var/cache/apk/*

RUN pip install --upgrade pip 
&& pip install Scrapy

RUN pip install scrapyd 
&& pip install scrapyd-client 
&& pip install scrapydweb

RUN pip install fake_useragent 
&& pip install scrapy_proxies 
&& pip install sqlalchemy 
&& pip install mongoengine 
&& pip install redis

WORKDIR /runtime/app

EXPOSE 5000 

COPY launch.sh /runtime/launch.sh
RUN chmod +x /runtime/launch.sh

# 测试正常后可以打开下面的注释
# ENTRYPOINT ["/runtime/launch.sh"]

如果是把scrapy+scrapyd+scrapydweb拆分成三个独立的镜像，就把下面启动服务的部分拆分即可，通过容器启动时的link选项来通信

#!/bin/sh

# kill any existing scrapyd process if any
kill -9 $(pidof scrapyd)

# enter directory where configure file lies and launch scrapyd
cd /runtime/app/scrapyd && nohup /usr/bin/scrapyd > ./scrapyd.log 2>&1 &

cd /runtime/app/scrapydweb && /usr/bin/scrapydweb

/runtime/app的目录结构为

根目录（/usr/local/src/scrapy-d-web【实际目录】:/runtime/app【容器内的目录】）

   Dockerfile - 编辑完后需要执行[docker build -t scrapy-d-web:v1 .]生成镜像， 笔者编译的时候一开始使用了阿里云1cpu-1G内存的实例，但是lxml始终报错，后来升级为2G内存即可正常编译
   scrapyd - 存放scrapyd的配置文件和其他目录
   scrapydweb - 存放scrapydweb的配置文件
   knowsmore - scrapy startproject 新建的工程目录1
   pxn - scrapy startproject 新建的工程目录2

现在我们手动启动各个服务来逐步解释，首先启动容器并进入bash

docker network create --subnet=192.168.0.0/16 mynetwork #新建一个自定义网络（如果容器没拆分这一步可以忽略，因为监听的是localhost，如果拆分后，就需要设定IP地址，方便下文中scrapyd+scrapydweb的配置）
docker run -it --rm --net mynetwork --ip 192.168.1.100 --name scrapyd -p 5000:5000 -v /usr/local/src/scrapy-d-web/:/runtime/app scrapy-d-web:v1 /bin/sh #定义网络地址，容器名称；建立目录映射，端口映射

进入scrapyd.conf文件所在目录(/runtime/app/scrapyd)，这里我选择了当前目录中的scarpyd.conf, 至于启动scrapyd配置文件的生效顺序请查阅scrapyd官方文档,下文为官方配置文件示例

[scrapyd]
eggs_dir    = eggs
logs_dir    = logs
items_dir   = 
jobs_to_keep = 5
dbs_dir     = dbs
max_proc    = 0
max_proc_per_cpu = 4
finished_to_keep = 100
poll_interval = 5.0
bind_address = 127.0.0.1（因为不需要外网访问，所以没有改成0.0.0.0）
http_port   = 6800（这里如果修改了端口号，要记得同时修改scrapydweb的配置）
debug       = off
runner      = scrapyd.runner
application = scrapyd.app.application
launcher    = scrapyd.launcher.Launcher
webroot     = scrapyd.website.Root

[services]
schedule.json     = scrapyd.webservice.Schedule
cancel.json       = scrapyd.webservice.Cancel
addversion.json   = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json  = scrapyd.webservice.ListSpiders
delproject.json   = scrapyd.webservice.DeleteProject
delversion.json   = scrapyd.webservice.DeleteVersion
listjobs.json     = scrapyd.webservice.ListJobs

再次打开一个终端进入上面的docker容器，进入scrapydweb配置文件所在的目录（/runtime/app/scrapydweb），启动scrapydweb

 docker exec -it   scrapyd /bin/bash

scrapydweb的项目详细内容请查看github地址，下文为我的部分配置内容

############################## ScrapydWeb #####################################
# Setting SCRAPYDWEB_BIND to "0.0.0.0" or IP-OF-CURRENT-HOST would make
# ScrapydWeb server visible externally, otherwise, set it to "127.0.0.1".
# The default is "0.0.0.0".
SCRAPYDWEB_BIND = "0.0.0.0"
# Accept connections on the specified port, the default is 5000.
SCRAPYDWEB_PORT = 5000

# The default is False, set it to True to enable basic auth for web UI.
ENABLE_AUTH = True
# In order to enable basic auth, both USERNAME and PASSWORD should be non-empty strings.
USERNAME = "user"
PASSWORD = "pass"


############################## Scrapy #########################################
# ScrapydWeb is able to locate projects in the SCRAPY_PROJECTS_DIR,
# so that you can simply select a project to deploy, instead of eggifying it in advance.
# e.g., "C:/Users/username/myprojects/" or "/home/username/myprojects/"
SCRAPY_PROJECTS_DIR = "/runtime/app/"


############################## Scrapyd ########################################
# Make sure that [Scrapyd](https://github.com/scrapy/scrapyd) has been installed
# and started on all of your hosts.
# Note that for remote access, you have to manually set "bind_address = 0.0.0.0"
# in the configuration file of Scrapyd and restart Scrapyd to make it visible externally.
# Check out "https://scrapyd.readthedocs.io/en/latest/config.html#example-configuration-file" for more info.
# ------------------------------ Chinese --------------------------------------
# 请先确保所有主机都已经安装和启动 [Scrapyd](https://github.com/scrapy/scrapyd)。
# 如需远程访问 Scrapyd，则需在 Scrapyd 配置文件中设置 "bind_address = 0.0.0.0"，然后重启 Scrapyd。
# 详见 https://scrapyd.readthedocs.io/en/latest/config.html#example-configuration-file

# - the string format: username:password@ip:port#group
#   - The default port would be 6800 if not provided,
#   - Both basic auth and group are optional.
#   - e.g., "127.0.0.1" or "username:password@192.168.123.123:6801#group"
# - the tuple format: (username, password, ip, port, group)
#   - When the username, password, or group is too complicated (e.g., contains ":@#"),
#   - or if ScrapydWeb fails to parse the string format passed in,
#   - it"s recommended to pass in a tuple of 5 elements.
#   - e.g., ("", "", "127.0.0.1", "", "") or ("username", "password", "192.168.123.123", "6801", "group")
SCRAPYD_SERVERS = [
    "192.168.1.100:6800",# 如果是同个容器，直接使用localhost即可，这里是演示了不同容器或主机下的情况 
    # "username:password@localhost:6801#group",
    # ("username", "password", "localhost", "6801", "group"),
]

# If the IP part of a Scrapyd server is added as "127.0.0.1" in the SCRAPYD_SERVERS above,
# ScrapydWeb would try to read Scrapy logs directly from disk, instead of making a request
# to the Scrapyd server.
# Check out this link to find out where the Scrapy logs are stored:
# https://scrapyd.readthedocs.io/en/stable/config.html#logs-dir
# e.g., "C:/Users/username/logs/" or "/home/username/logs/"
SCRAPYD_LOGS_DIR = "/runtime/app/scrapyd/logs/"

访问 http://[YOUR IP ADDRESS]:5000 即可

云服务器 GPU云服务器基于docker的私有云部署基于docker的混合云部署方案基于docker快速部署微服务基于云服务器的部署

文章版权归作者所有，未经允许请勿转载,若此文章存在违规行为，您可以联系管理员删除。

转载请注明本文地址：https://www.ucloud.cn/yun/27664.html

基于Docker的Scrapy+Scrapyd+Scrapydweb部署

摘要：如需远程访问，则需在配置文件中设置，然后重启。详见如果是同个容器，直接使用即可，这里是演示了不同容器或主机下的情况访问即可文章开始，先摘录一下文中各软件的官方定义Scrapy An open source and collaborative framework for extracting the data youneed from websites.In a fast, simpl...

陈伟 2019-07-30 18:43 评论0 收藏0
部署Scrapy分布式爬虫项目

摘要：以上示例代表当发现条或条以上的级别的时，自动停止当前任务，如果当前时间在邮件工作时间内，则同时发送通知邮件。 showImg(https://segmentfault.com/img/remote/1460000018052810); 一、需求分析初级用户：只有一台开发主机能够通过 Scrapyd-client 打包和部署 Scrapy 爬虫项目，以及通过 Scrapyd JS...

techstay 2019-07-31 10:01 评论0 收藏0
如何通过 Scrapyd + ScrapydWeb 简单高效地部署和监控分布式爬虫项目

摘要：支持一键部署项目到集群。添加邮箱帐号设置邮件工作时间和基本触发器，以下示例代表每隔小时或当某一任务完成时，并且当前时间是工作日的点，点和点，将会发送通知邮件。除了基本触发器，还提供了多种触发器用于处理不同类型的，包括和等。 showImg(https://segmentfault.com/img/remote/1460000018772067?w=1680&h=869); 安装和配置 ...

zsirfs 2019-07-30 18:34 评论0 收藏0
时隔五年，Scrapyd 终于原生支持 basic auth

摘要：试用安装更新配置文件，其余配置项详见官方文档启动由于的最新提交已经重构了页面，如果正在使用管理，则需同步更新 Issue in 2014 scrapy/scrapyd/issues/43showImg(https://segmentfault.com/img/remote/1460000019125253?w=790&h=400); Pull request in 2019 scrap...

keithxiaoy 2019-07-31 10:23 评论0 收藏0
基于Celery的分布式爬虫管理平台: Crawlab

摘要：基于的爬虫分布式爬虫管理平台，支持多种编程语言以及多种爬虫框架。后台程序会自动发现这些爬虫项目并储存到数据库中。每一个节点需要启动应用来支持爬虫部署。任务将以环境变量的形式存在于爬虫任务运行的进程中，并以此来关联抓取数据。 Crawlab 基于Celery的爬虫分布式爬虫管理平台，支持多种编程语言以及多种爬虫框架。 Github: https://github.com/tikazyq/...

legendaryedu 2019-07-31 10:08 评论0 收藏0