Scrapy - Shell

描述

Scrapy shell 可以用來在沒有錯誤程式碼的情況下抓取資料，無需使用爬蟲。Scrapy shell 的主要目的是測試提取的程式碼、XPath 或 CSS 表示式。它還有助於指定您要從中抓取資料的網頁。

配置 Shell

可以透過安裝 IPython（用於互動式計算）控制檯來配置 shell，它是一個強大的互動式 shell，可以提供自動補全、彩色輸出等功能。

如果您在 Unix 平臺上工作，最好安裝 IPython。如果無法訪問 IPython，也可以使用 bpython。

您可以透過設定名為 SCRAPY_PYTHON_SHELL 的環境變數或透過如下定義 scrapy.cfg 檔案來配置 shell：

[settings]
shell = bpython

啟動 Shell

可以使用以下命令啟動 Scrapy shell：

scrapy shell <url>

url 指定需要抓取資料的 URL。

使用 Shell

Shell 提供了一些額外的快捷方式和 Scrapy 物件，如下表所述：

可用快捷方式

Shell 在專案中提供了以下可用快捷方式：

序號	快捷方式 & 描述
1	shelp() 它提供了可用的物件和快捷方式以及幫助選項。
2	fetch(request_or_url) 它從請求或 URL 中收集響應，並且關聯的物件將被正確更新。
3	view(response) 您可以檢視給定請求的響應，以便在本地瀏覽器中觀察，併為了正確顯示外部連結，它會將一個 base 標籤附加到響應正文中。

序號

快捷方式 & 描述

shelp()

它提供了可用的物件和快捷方式以及幫助選項。

fetch(request_or_url)

它從請求或 URL 中收集響應，並且關聯的物件將被正確更新。

view(response)

您可以檢視給定請求的響應，以便在本地瀏覽器中觀察，併為了正確顯示外部連結，它會將一個 base 標籤附加到響應正文中。

可用 Scrapy 物件

Shell 在專案中提供了以下可用 Scrapy 物件：

序號	物件 & 描述
1	crawler 它指定當前的爬蟲物件。
2	spider 如果沒有當前 URL 的爬蟲，則它將透過定義新的爬蟲來處理 URL 或爬蟲物件。
3	request 它指定最後一個收集頁面的請求物件。
4	response 它指定最後一個收集頁面的響應物件。
5	settings 它提供了當前的 Scrapy 設定。

Shell 會話示例

讓我們嘗試抓取 scrapy.org 網站，然後開始從 reddit.com 抓取資料，如下所述。

在繼續之前，首先我們將啟動 shell，如下面的命令所示：

scrapy shell 'http://scrapy.org' --nolog

在使用上述 URL 時，Scrapy 將顯示可用的物件：

[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x1e16b50>
[s]   item       {}
[s]   request    <GET http://scrapy.org >
[s]   response   <200 http://scrapy.org >
[s]   settings   <scrapy.settings.Settings object at 0x2bfd650>
[s]   spider     <Spider 'default' at 0x20c6f50>
[s] Useful shortcuts:
[s]   shelp()           Provides available objects and shortcuts with help option
[s]   fetch(req_or_url) Collects the response from the request or URL and associated 
objects will get update
[s]   view(response)    View the response for the given request

接下來，開始使用物件，如下所示：

>> response.xpath('//title/text()').extract_first() 
u'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework'  
>> fetch("http://reddit.com") 
[s] Available Scrapy objects: 
[s]   crawler     
[s]   item       {} 
[s]   request     
[s]   response   <200 https://www.reddit.com/> 
[s]   settings    
[s]   spider      
[s] Useful shortcuts: 
[s]   shelp()           Shell help (print this help) 
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects 
[s]   view(response)    View response in a browser  
>> response.xpath('//title/text()').extract() 
[u'reddit: the front page of the internet']  
>> request = request.replace(method="POST")  
>> fetch(request) 
[s] Available Scrapy objects: 
[s]   crawler     
...

從爬蟲中呼叫 Shell 以檢查響應

您只能在期望獲得該響應時檢查從爬蟲處理的響應。

例如：

import scrapy 

class SpiderDemo(scrapy.Spider): 
   name = "spiderdemo" 
   start_urls = [ 
      "http://mysite.com", 
      "http://mysite1.org", 
      "http://mysite2.net", 
   ]  
   
   def parse(self, response): 
      # You can inspect one specific response 
      if ".net" in response.url: 
         from scrapy.shell import inspect_response 
         inspect_response(response, self)

如上程式碼所示，您可以使用以下函式從爬蟲中呼叫 shell 以檢查響應：

scrapy.shell.inspect_response

現在執行爬蟲，您將看到以下螢幕：

2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200)  (referer: None) 
2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200)  (referer: None) 
2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200)  (referer: None) 
[s] Available Scrapy objects: 
[s]   crawler     
...  
>> response.url 
'http://mysite2.org'

您可以使用以下程式碼檢查提取的程式碼是否有效：

>> response.xpath('//div[@class = "val"]')

它顯示輸出為

[]

以上行僅顯示空白輸出。現在您可以呼叫 shell 以檢查響應，如下所示：

>> view(response)

它顯示響應為

True

列印頁面