Scrapy 爬虫框架基本使用

ClownF原创2025/5/1大约 4 分钟

Scrapy 爬虫框架基本使用

由于工作需要，需要写大量爬虫，于是就开始研究有没有什么爬虫框架比较适合。看到了scrapy

scrapy 是 python 中最流行的爬虫框架。

安装

# 使用 pip 安装
pip install scrapy

# 验证安装
scrapy version

创建项目

# 创建新项目
scrapy startproject myspider

# 项目结构
myspider/
├── scrapy.cfg
└── myspider/
    ├── __init__.py
    ├── items.py          # 数据模型定义
    ├── middlewares.py    # 中间件
    ├── pipelines.py      # 数据处理管道 爬的数据给谁
    ├── settings.py       # 设置
    └── spiders/          # 爬虫目录
        └── __init__.py

创建爬虫

cd myspider
# 创建一个基础爬虫
scrapy genspider example example.com

生成的爬虫文件 spiders/example.py:

import scrapy

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["example.com"]
    start_urls = ["https://example.com"]

    def parse(self, response):
        # 在这里解析页面
        pass

由于现在大部分的网站反爬虫、动态加载、需要 JS 执行。所以可以再装上 selenium 和 undetected_chromedriver

selenium 是浏览器自动化工具，用来模拟真人操作，在服务器上我建议装个桌面

undetected_chromedriver 是用来绕过反爬虫的

import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

# 可选：添加自定义参数
options = uc.ChromeOptions()
# options.add_argument("--headless=new")   # 如果你在服务器，要无头模式就打开这个
options.add_argument("--disable-gpu")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")

# 启动 undetected Chrome
driver = uc.Chrome(options=options)

# 打开网页
driver.get("https://www.baidu.com")

# 找到搜索框并输入
search_box = driver.find_element(By.ID, "kw")
search_box.send_keys("selenium undetected_chromedriver 示例")
search_box.send_keys(Keys.ENTER)

# 打印页面标题
print(driver.title)

# 关闭
driver.quit()

核心概念

1. Spider（爬虫）

Spider 是爬虫的核心，定义如何抓取和解析页面。

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com"]

    def parse(self, response):
        # 提取所有引用
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }

        # 跟踪下一页
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)

2. Selector（选择器）

Scrapy 支持 CSS 和 XPath 两种选择器：

# CSS 选择器
response.css("div.content")
response.css("a::attr(href)").get()      # 获取属性
response.css("span::text").get()          # 获取文本
response.css("div.item::text").getall()   # 获取所有匹配

# XPath 选择器
response.xpath("//div[@class='content']")
response.xpath("//a/@href").get()
response.xpath("//span/text()").get()

3. Item（数据模型）

在 items.py 中定义数据结构：

import scrapy

class ArticleItem(scrapy.Item):
    title = scrapy.Field()
    content = scrapy.Field()
    author = scrapy.Field()
    publish_date = scrapy.Field()
    url = scrapy.Field()

在 Spider 中使用：

from myspider.items import ArticleItem

def parse(self, response):
    item = ArticleItem()
    item["title"] = response.css("h1::text").get()
    item["content"] = response.css("div.content::text").get()
    item["url"] = response.url
    yield item

4. Pipeline（管道）

在 pipelines.py 中处理抓取的数据：是把它落盘还是存到数据库中

import json
import pymysql

# 落盘
class FilePipeline:
    def open_spider(self, spider):
        self.file = open("items.json", "a", encoding="utf-8")

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(line)
        return item

# 存数据库
class MySQLPipeline:
    def open_spider(self, spider):
        self.db = pymysql.connect(
            host='localhost',
            user='root',
            password='123456',
            database='scrapy_db',
            charset='utf8mb4'
        )
        self.cur = self.db.cursor()

    def close_spider(self, spider):
        self.db.close()

    def process_item(self, item, spider):
        sql = """
        INSERT INTO mytable (title, url, content)
        VALUES (%s, %s, %s)
        """
        self.cur.execute(sql, (
            item.get("title"),
            item.get("url"),
            item.get("content")
        ))
        self.db.commit()
        return item

在 settings.py 中启用：

ITEM_PIPELINES = {
    "myspider.pipelines.FilePipeline": 300,
    "myspider.pipelines.MySQLPipeline": 400,
}

常用设置

我的 settings.py 中常用的配置：

# 下载延迟（秒）如果抓取特殊网站（ZF）需要，可别把特殊网站爬挂了。
DOWNLOAD_DELAY = 1

# 并发请求数
CONCURRENT_REQUESTS = 16

# User-Agent
# 使用了 selenium + undetected_chromedriver 就可以不用管
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"

# 禁用 Cookie
COOKIES_ENABLED = False

# 请求头
# selenium + undetected_chromedriver 就可以不用管
DEFAULT_REQUEST_HEADERS = {
    "Accept": "text/html,application/xhtml+xml",
    "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
}

# 日志级别
LOG_LEVEL = "INFO"

运行爬虫

# 运行爬虫
scrapy crawl quotes

# 保存数据到文件
scrapy crawl quotes -o quotes.json
scrapy crawl quotes -o quotes.csv

# 在代码中运行
from scrapy.crawler import CrawlerProcess
from myspider.spiders.quotes import QuotesSpider

process = CrawlerProcess()
process.crawl(QuotesSpider)
process.start()

关于 scrapy 的一些思考

在我写第一个爬虫项目的时候，有小几百个爬虫。于是我就每个爬虫一个项目，这导致了gitlab中有几百个项目，而且非常难维护。

如果换一个数据库密码或者es密码，需要修改几百个文件。

后面，我想通了。按照项目为单位，每个项目一个scrapy项目。将一个项目的所有爬虫放在同个scrapy项目中。

# 项目结构
myspider/
├── scrapy.cfg
└── myspider/
    ├── __init__.py
    ├── items.py          # 数据模型定义
    ├── middlewares.py    # 中间件
    ├── pipelines.py      # 数据处理管道 爬的数据给谁
    ├── settings.py       # 设置
    └── spiders/          # 爬虫目录
        └── __init__.py
        └── quotes_spider.py   # 爬虫1
        └── quotes2_spider.py  # 爬虫2
        └── news
            └── news_spider.py # 新闻爬虫
        └── social
            └── social_spider.py # 社交爬虫

如果 news 和 social 的入库表或者es索引不一样，那么可以

# 获取索引名：优先从爬虫的 index_type 属性获取，否则根据模块路径判断
if hasattr(spider, 'index_type'):
    index_type = spider.index_type
else:
    # 根据爬虫模块路径判断：spiders.news -> news, spiders.thinktank -> thinktank
    module_path = spider.__module__
    if '.news.' in module_path or module_path.endswith('.news'):
        index_type = 'news'
    elif '.thinktank.' in module_path or module_path.endswith('.thinktank'):
        index_type = 'thinktank'
    else:
        index_type = 'news'  # 默认

总结

scrapy 是个非常不错的爬虫框架。但是在爬虫工程中，框架不重要，解决反爬才重要～