Scrapy 框架中的数据流
尽管文档中这样提到:Scrapy中的数据流由执行引擎控制,如下所示
- The Engine gets the initial Requests to crawl from the Spider.
- The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl.
- The Scheduler returns the next Requests to the Engine.
- The Engine sends the Requests to the Downloader, passing through the Downloader Middlewares (see
process_request()
). - Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middlewares (see
process_response()
). - The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (see
process_spider_input()
). - The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine, passing through the Spider Middleware (see
process_spider_output()
). - The Engine sends processed items to Item Pipelines, then send processed Requests to the Scheduler and asks for possible next Requests to crawl.
- The process repeats (from step 1) until there are no more requests from the Scheduler.
但是具体到程序中是如何体现的呢?在项目运行时,控制台中就有输出提示信息。如果要更直观的体现,不妨在每步对应的函数中打印自己设置的提示信息。例如:在自己的项目管道中可以这样做
1 | class MyPipeline(object): |
项目运行后,就可以看见他们的输出顺序了:
1 | ------------ from_crawler -------------- |
了解框架的处理逻辑对我们编写高效代码是很有好处的。