NodeJs爬蟲框架-Spider 詳情 - javascript,node.js,nodejs爬蟲,網頁爬蟲,框架 GeoffZhu 博客

gz-spider

一個基於Puppeteer和Axios的NodeJs爬蟲框架源碼倉庫

為什麼需要爬蟲框架

爬蟲框架可以簡化開發流程，提供統一規範，提升效率。一套優秀的爬蟲框架會利用多線程，多進程，分佈式，IP池等能力，幫助開發者快速開發出易於維護的工業級爬蟲，長期受用。

特性

可配置代理
支持任務重試
支持Puppeteer
異步隊列服務友好
多進程友好

安裝

npm i gz-spider --save

使用

const spider = require('gz-spider');

// 每個爬蟲是一個方法，需要通過setProcesser註冊
spider.setProcesser({
  ['getGoogleSearchResult']: async (fetcher, params) => {
    // fetcher.page是原始的puppeteer page，可以直接用於打開頁面
    let resp = await fetcher.axios.get(`https://www.google.com/search?q=${params}`);

    // throw 'Retry', will retry this processer
    // throw 'ChangeProxy', will retry this processer use new proxy
    // throw 'Fail', will finish this processer with message(fail) Immediately

    if (resp.status === 200) {
      // Data processing start
      let result = resp.data + 1;
      // Data processing end
      return result;
    } else {
      throw 'retry';
    }
  }
});

// 開始爬取
spider.getData('getGoogleSearchResult', params).then(userInfo => {
  console.log(userInfo);
});

配置

框架由三部分組成，fetcher、strategy、processer。

Fetcher

spider.setFetcher({
  axiosTimeout: 5000,
  proxyTimeout: 180 * 1000
  proxy() {
    // 支持返回Promise，可以從遠端拉取代理的配置
    return {
      host: '127.0.0.1',
      port: '9000'
    }
  }
});

axiosTimeout: [Number] 每次爬蟲請求的超時時間
proxyTimeout: [Number] 更新代理IP時間，代理IP有超時的場景使用，會重新執行proxy function，使用新的代理IP
proxy: [Object | Function] 當 proxy是[Function], 支持異步，可以從遠端拉取代理的配置
- proxy.host [String]
- proxy.port [String]

Strategy

spider.setStrategy({
  retryTimes: 2
});

retryTimes: [Number] 最大重試次數

與任務隊列結合使用

流程

獲取任務 -> `spider.getData(processerKey, processerIn)` -> 完成任務並帶上處理好的數據

用MySql模擬任務隊列

創建spider-task表, 至少包含'id', 'status', 'processer_key', 'processer_input', 'processer_output'
寫一個拉取未完成任務的接口, 例如 GET /spider/task
寫一個完成任務的接口，例如 PUT /spider/task

const axios = require('axios');

while (true) {
  // 獲取任務
  let resp = await axios.get('http://127.0.0.1:8080/spider/task');

  if (!resp.data.task) break;
  
  let { id, processerKey, processerInput } = resp.data.task;
  let processerOutput = await spider.getData(processerKey, processerInput);

  // 完成任務並帶上處理好的數據
  await axios.put('http://127.0.0.1:8080/spider/task', {
    id, processerOutput,
    status: 'success'
  });
}

對爬蟲的一些理解

爬蟲的運行方式就決定了它無法做到長久穩定和實時。在設計爬蟲框架的時候，圍繞的點是異步任務隊列。工程上爬蟲框架會提供一個高效的數據處理流水線，並可適配多種任務隊列。

gz-spider分為三個組成部分，fetcher，strategy和processer。
fetcher抓取器，其中包含常用的http和puppeteer，並且可以掛各種類型的代理。
strategy策略中心，負責配置爬取失敗後的各種策略。
processer負責從原始數據結構處理為目標數據的過程，也是爬蟲框架用户要寫的部分

License

MIT

GeoffZhu 博客

GeoffZhu 博客

博客 / 詳情

NodeJs爬蟲框架-Spider

gz-spider

為什麼需要爬蟲框架

特性

安裝

使用

配置

Fetcher

Strategy

與任務隊列結合使用

流程

用MySql模擬任務隊列

對爬蟲的一些理解

License

發佈評論

Product

Company

Support

Company

博客 / 詳情

NodeJs爬蟲框架-Spider

gz-spider

為什麼需要爬蟲框架

特性

安裝

使用

配置

Fetcher

Strategy

與任務隊列結合使用

流程

用MySql模擬任務隊列

對爬蟲的一些理解

License

發佈 評論

發佈評論