python网络数据采集——链接爬取和跳转

网页跳转

在网络数据采集中，如果只对单页面进行操作，那么这个爬虫将毫无意义。因为采集的数据往往是分散在不同的网页的，所以对于单页面而言，爬虫需要采集页面中的链接，并且进行跳转。

网页跳转，在正在采集的页面中收集新的链接，然后将新的链接传入到采集程序中，如此之后，采集程序就已经在新的页面采集信息了。

举个例子

我们以维基百科的一个页面为例，从主页面开始，采集 id 为 mw-content-text 的段落 p 和 id 为 ca-edit 的 span 中的 a 链接标签。读者可以参考以下代码：

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

pages = set()

def getLinks(pageUrl):
    global pages
    html = urlopen("http://en.wikipedia.org"+pageUrl)
    bsObj = BeautifulSoup(html, "html.parser")
    # print(bsObj)
    try:
        print(bsObj.h1.get_text())
        print(bsObj.find(id="mw-content-text").find_all("p")[0])
        print(bsObj.find(id="ca-edit").find("span").find("a").attrs['href'])
    except AttributeError:
        print("页面缺少一些属性!不过不用担心!")

    for link in bsObj.find_all("a", href=re.compile("^(/wiki/)")):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                # 我们遇到了新页面
                newPage = link.attrs['href']
                print("----------------\n"+newPage)
                pages.add(newPage)
                getLinks(newPage)

getLinks("")

程序先从 http://en.wikipedia.org 网页开始，然后查找以 /wiki/ 开头的链接，之后拼接到 http://en.wikipedia.org 形成新的链接。对于重复的链接程序不进行采集，程序中体现了递归的概念，理论上你很难等到程序的结束。

六度空间理论

所谓六度空间理论是数学界中的一种猜想，对于一个陌生人，你可以通过至多6个人认识他，即你与陌生人中间的连接点不会超过6个人。

我们可以将这个猜想应用到爬虫程序中，即通过一个人的简介，我们可以爬取到另一个人的简介。这里还是以维基百科为例，从 Kevin_Bacon 开始，爬取其他人的信息：

from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import re

random.seed(datetime.datetime.now())

def getLinks(articleUrl):
    html = urlopen("http://en.wikipedia.org"+articleUrl)
    bsObj = BeautifulSoup(html, "html.parser")
    return bsObj.find("div", {"id": "bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))

links = getLinks("/wiki/Kevin_Bacon")
while len(links) > 0:
    newArticle = links[random.randint(0, len(links)-1)].attrs["href"]
    print(newArticle)
    links = getLinks(newArticle)

程序中使用随机数字，从新的链接集合中随机选取，并使用新的链接再次爬取。

scrapy

写网络爬虫，你不得不重复一些简单的操作：找出页面上所有的链接，区分内链与外链，跳转到新的页面。像这种重复的工作可以交给第三方工具类来处理，scrapy就是这样一款工具。

曾经的 scrapy 是一个傲娇的工具，只支持 python2.7 版本，软件的不普遍支持性会导致软件的不可用。好消息就是，现在的 scrapy 在 python3.x 的环境下也是支持的。$pip3 install scrapy 即可下载。如果是python2.x版本的，务必使用pip命令。

scrapy 的使用需要重新创建一个工程(在这里演示如何获取网页的title)：

$scrapy startproject wikiSpider 创建新工程
在 spiders 目录下创建 ArticleSpider.py文件，名字也可自取
在 item.py 文件中定义类 Article
scrapy crawl article 运行程序(这行命令会用条目名称 article 来调用爬虫(不是类名，也不是文件名，而是由 ArticleSpider 的 name = “article” 决定的))

item.py文件中应如此定义：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

# import scrapy
from scrapy import Item, Field


# class WikispiderItem(scrapy.Item):
#     # define the fields for your item here like:
#     # name = scrapy.Field()
#     pass


class Article(Item):
    # define the fields for your item here like: # name = scrapy.Field()
    title = Field()

scrapy的每个Item(条目)对象表示网站上的一个页面。当然，你可以根据需要定义不同的条目(比如url、content、header image等)，但是现在我只演示收集每页的title字段 (field)。

ArticleSpider.py 文件中写入如下程序：

#! /usr/local/bin/python3
# encoding:utf-8

from scrapy import Spider
# 这里的引用简直有毒 scrapy crawl article
# 像这样 from wikiSpider.wikiSpider.items import Article 编译器是正确的，但是终端执行是错误的
# 要如此引用 from wikiSpider.items import Article 但是编译器会报错
# 所以建议使用相对路径
from .. items import Article
# from wikiSpider.wikiSpider.items import Article


class ArticleSpider(Spider):
    name = "article"
    allowed_domains = ["en.wikipedia.org"]
    start_urls = ["http://en.wikipedia.org/wiki/Main_Page",
                  "http://en.wikipedia.org/wiki/Python_%28programming_language%29"]

    def parse(self, response):
        item = Article()
        title = response.xpath('//h1/text()')[0].extract()
        print("Title is: "+title)
        item['title'] = title
        return item

注意文件中引入项目的注释。

如果正确，终端中运行的结果应该是：

1 2	Title is: Main Page Title is: Python (programming language)