puppeteer 多URL爬取

focusj 发布于2019-08-23 17:06 / 2797人阅读

摘要：基本使用返回的是一个集合需要重新遍历为了显示的图片引入了原尺寸为的图片顺序不能变启动打开监听事件跳转页面关闭顺序改变监听事件将无法监听多个的使用方法爬取数组上的所有图片，并返回其真实宽高此方法大致参考了上的答案

基本使用

</>复制代码 
"use strict";
const puppeteer = require("puppeteer");
(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  let imgArr = [];
  page.on("domcontentloaded", async () => {
    imgArr = await page.$$eval("img", img => {
      const arr = [];
      // 返回的是一个集合需要重新遍历
      for (let i = 0; i < img.length; i++) {
        const obj = {
          width: img[i].width,
          naturalWidth: img[i].naturalWidth,
          height: img[i].height,
          naturalHeight: img[i].naturalHeight,
          isStandard: !((img[i].width * 10 <= img[i].naturalWidth || img[i].height * 10 <= img[i].naturalHeight)),
          url: img[i].src,
          level: 3,
          imageUrl: img[i].src,
          describeUrl: "",
          summary: `为了显示${img[i].width}x${img[i].height}的图片引入了原尺寸为${img[i].naturalWidth}x${img[i].naturalHeight}的图片`,
        };
        if (obj.width && obj.height) {
          arr.push(obj);
        }
      }
      return arr;
    });
  });
  await page.goto("https://www.npmjs.com/package/puppeteer", { waitUntil: "networkidle0" });
  await browser.close();
  console.log("imgArr: ", imgArr);
})();

顺序不能变：

await puppeteer.launch() 启动

await browser.newPage() 打开page

page.on 监听事件

await page.goto 跳转页面

await browser.close() 关闭

顺序改变,page.on() 监听事件将无法监听

多个URL的使用方法

爬取数组url上的所有图片，并返回其真实宽高.

</>复制代码 
/* eslint-disable no-undef */
"use strict";
const puppeteer = require("puppeteer");
(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  let arr = [];
  const html = [ "https://www.npmjs.com/package/puppeteer", "https://www.iconfont.cn/search/index?searchType=icon&q=test" ];
  for (let i = 0; i < html.length; i++) {
    await page.goto(html[i], { waitUntil: "domcontentloaded" });
    await page.waitForSelector("img", { timeout: 3000 });
    // eslint-disable-next-line no-loop-func
    const doms = await page.evaluate(() => {
      const arr = [ ...document.querySelectorAll("img") ];
      return arr.map(v => {
        return {
          naturalWidth: v.naturalWidth,
          naturalHeight: v.naturalHeight,
          width: v.width,
          height: v.height,
        };
      });
    });
    arr = [ ...arr, ...doms ];
  }
  await browser.close();
})();

此方法大致参考了overflow上的答案：

Crawling multiple URL in a loop using puppeteer

Looping through a set of urls in Puppeteer

Puppeteer - Proper way to loop through multiple URLs

云服务器 GPU云服务器 puppet Puppeteer puppeteer_node puppeteer关闭webrtc

文章版权归作者所有，未经允许请勿转载,若此文章存在违规行为，您可以联系管理员删除。

转载请注明本文地址：https://www.ucloud.cn/yun/103798.html

使用Node.js爬取任意网页资源并输出高质量PDF文件到本地~

摘要：上面只爬取了京东首页的图片内容，假设我的需求进一步扩大，需要爬取京东首页中的所有标签对应的跳转网页中的所有的文字内容，最后放到一个数组中。 showImg(https://segmentfault.com/img/bVbtVeV?w=3840&h=2160); 本文适合无论是否有爬虫以及Node.js基础的朋友观看~ 需求：使用Node.js爬取网页资源，开箱即用的配置将爬取到的...

seasonley 2019-08-30 11:12 评论0 收藏0
使用Node.js爬取任意网页资源并输出高质量PDF文件到本地~

摘要：上面只爬取了京东首页的图片内容，假设我的需求进一步扩大，需要爬取京东首页中的所有标签对应的跳转网页中的所有的文字内容，最后放到一个数组中。 showImg(https://segmentfault.com/img/bVbtVeV?w=3840&h=2160); 本文适合无论是否有爬虫以及Node.js基础的朋友观看~ 需求：使用Node.js爬取网页资源，开箱即用的配置将爬取到的...

xiaoxiaozi 2019-08-02 15:18 评论0 收藏0
使用Node.js爬取任意网页资源并输出高质量PDF文件到本地~

摘要：上面只爬取了京东首页的图片内容，假设我的需求进一步扩大，需要爬取京东首页中的所有标签对应的跳转网页中的所有的文字内容，最后放到一个数组中。 showImg(https://segmentfault.com/img/bVbtVeV?w=3840&h=2160); 本文适合无论是否有爬虫以及Node.js基础的朋友观看~ 需求：使用Node.js爬取网页资源，开箱即用的配置将爬取到的...

wangym 2019-08-23 18:07 评论0 收藏0
puppeteer爬虫

摘要：爬虫爬虫又称网络机器人。每天或许你都会使用搜索引擎，爬虫便是搜索引擎重要的组成部分，爬取内容做索引。那我萌就来探讨一下网络爬虫吧。对后关于不仅仅可以用来做爬虫，因为可以编程，无头浏览器，可以用来自动化测试等等。 @(爬虫)[puppeteer|] 爬虫又称网络机器人。每天或许你都会使用搜索引擎，爬虫便是搜索引擎重要的组成部分，爬取内容做索引。现如今大数据，数据分析很火，那数据哪里来呢，...

felix0913 2019-08-26 12:23 评论0 收藏0