使用Python和Scrapy進行網頁抓取?
Scrapy是開發爬蟲的最佳框架之一。Scrapy是一個流行的網頁抓取和爬取框架,它利用高階功能使抓取網站更容易。
安裝
在Windows上安裝Scrapy很容易:我們可以使用pip或conda(如果您有Anaconda)。Scrapy可以在Python 2和3版本上執行。
pip install Scrapy
或者
conda install –c conda-forge scrapy
如果Scrapy安裝正確,現在終端中將可以使用scrapy命令。
C:\Users\rajesh>scrapy Scrapy 1.6.0 - no active project Usage: scrapy <command> [options] [args] Available commands: bench Run quick benchmark test fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy [ more ] More commands available when run from project directory Use "scrapy <command> -h" to see more info about a command.
啟動專案
現在Scrapy已安裝,我們可以執行**startproject**命令來生成第一個Scrapy專案的預設結構。
為此,請開啟終端並導航到要儲存Scrapy專案的目錄,然後執行**scrapy startproject <專案名稱>**。下面我使用scrapy_example作為專案名稱。
C:\Users\rajesh>scrapy startproject scrapy_example New Scrapy project 'scrapy_example', using template directory 'c:\python\python361\lib\site-packages\scrapy\templates\project', created in: C:\Users\rajesh\scrapy_example You can start your first spider with: cd scrapy_example scrapy genspider example example.com C:\Users\rajesh>cd scrapy_example C:\Users\rajesh\scrapy_example>tree /F Folder PATH listing Volume serial number is 8CD6-8D39 C:. │ scrapy.cfg │ └───scrapy_example │ items.py │ middlewares.py │ pipelines.py │ settings.py │ __init__.py │ ├───spiders │ │ __init__.py │ │ │ └───__pycache__ └───__pycache__
另一種方法是執行scrapy shell並進行網頁抓取,如下所示:
In [18]: fetch ("https://www.wsj.com/india") 019-02-04 22:38:53 [scrapy.core.engine] DEBUG: Crawled (200)https://www.wsj.com/india> (referer: None)
Scrapy爬蟲將返回一個包含已下載資訊的“response”物件。讓我們檢查一下上面的爬蟲包含什麼:
In [19]: view(response) Out[19]: True
在您的預設瀏覽器中,網頁連結將開啟,您將看到類似的內容:
很好,這看起來與我們的網頁有點相似,因此爬蟲已成功下載整個網頁。
現在讓我們看看我們的爬蟲包含什麼:
In [22]: print(response.text) <!DOCTYPE html> <html data-region = "asia,india" data-protocol = "https" data-reactid = ".2316x0ul96e" data-react-checksum = "851122071"> <head data-reactid = ".2316x0ul96e.0"> <title data-reactid = ".2316x0ul96e.0.0">The Wall Street Journal & Breaking News, Business, Financial and Economic News, World News and Video</title> <meta http-equiv = "X-UA-Compatible" content = "IE = edge" data-reactid = ".2316x0ul96e.0.1"/> <meta http-equiv = "Content-Type" content = "text/html; charset = UTF-8" data-reactid = ".2316x0ul96e.0.2"/> <meta name = "viewport" content = "initial-scale = 1.0001, minimum-scale = 1.0001, maximum-scale = 1.0001, user-scalable = no" data-reactid = ".2316x0ul96e.0.3"/> <meta name = "description" content = "WSJ online coverage of breaking news and current headlines from the US and around the world. Top stories, photos, videos, detailed analysis and in-depth reporting." data-reactid = ".2316x0ul96e.0.4"/> <meta name = "keywords" content = "News, breaking news, latest news, US news, headlines, world news, business, finances, politics, WSJ, WSJ news, WSJ.com, Wall Street Journal" data-reactid = ".2316x0ul96e.0.5"/> <meta name = "page.site" content = "wsj" data-reactid = ".2316x0ul96e.0.7"/> <meta name = "page.site.product" content = "WSJ" data-reactid = ".2316x0ul96e.0.8"/> <meta name = "stack.name" content = "dj01:vir:prod-sections" data-reactid = ".2316x0ul96e.0.9"/> <meta name = "referrer" content = "always" data-reactid = ".2316x0ul96e.0.a"/> <link rel = "canonical" href = "https://www.wsj.com/india/" data-reactid = ".2316x0ul96e.0.b"/> <meta nameproperty = "og:url" content = "https://www.wsj.com/india/" data-reactid = ".2316x0ul96e.0.c:$0"/> <meta nameproperty = "og:title" content = "The Wall Street Journal & Breaking News, Business, Financial and Economic News, World News and Video" data-reactid = ".2316x0ul96e.0.c:$1"/> <meta nameproperty = "og:description" content = "WSJ online coverage of breaking news and current headlines from the US and around the world. Top stories, photos, videos, detailed analysis and in-depth reporting." data-reactid = ".2316x0ul96e.0.c:$2"/> <meta nameproperty = "og:type" content = "website" data-reactid = ".2316x0ul96e.0.c:$3"/> <meta nameproperty = "og:site_name" content = "The Wall Street Journal" data-reactid = ".2316x0ul96e.0.c:$4"/> <meta nameproperty = "og:image" content = "https://s.wsj.net/img/meta/wsj-social-share.png" data-reactid = ".2316x0ul96e.0.c:$5"/> <meta name = "twitter:site" content = "@wsj" data-reactid = ".2316x0ul96e.0.c:$6"/> <meta name = "twitter:app:name:iphone" content = "The Wall Street Journal" data-reactid = ".2316x0ul96e.0.c:$7"/> <meta name = "twitter:app:name:googleplay" content = "The Wall Street Journal" data-reactid = " "/> …& so much more:
讓我們嘗試從該網頁中提取一些重要資訊:
提取網頁標題:
Scrapy提供了一種基於CSS選擇器(如類、ID等)從HTML中提取資訊的方法。要查詢任何網頁標題的CSS選擇器,只需右鍵單擊並單擊“檢查”,如下所示
這將在您的瀏覽器視窗中開啟開發者工具:
可以看到,CSS類“wsj-headline-link”應用於所有具有標題的錨點(<a>)標籤。有了這些資訊,我們將嘗試從response物件中的其餘內容中查詢所有標題:
response.css()是根據傳遞給它的CSS選擇器(如上面的錨點標籤)提取內容的函式。讓我們看看response.css函式的更多示例。
In [24]: response.css(".wsj-headline-link::text").extract_first() Out[24]: 'China Fears Loom Over Stocks After January Surge'
和
In [25]: response.css(".wsj-headline-link").extract_first() Out[25]: '<a class="wsj-headline-link" href = "https://www.wsj.com/articles/china-fears-loom-over-stocks-after-january-surge-11549276200" data-reactid=".2316x0ul96e.1.1.5.1.0.3.3.0.0.0:$0.1.0">China Fears Loom Over Stocks After January Surge</a>'
要獲取網頁上的所有連結:
links = response.css('a::attr(href)').extract()
輸出
['https://www.google.com/intl/en_us/chrome/browser/desktop/index.html', 'https://support.apple.com/downloads/', 'https://www.mozilla.org/en-US/firefox/new/', 'https://windows.microsoft.com/en-us/internet-explorer/download-ie', 'https://www.barrons.com', 'http://bigcharts.marketwatch.com', 'https://www.wsj.com/public/page/wsj-x-marketing.html', 'https://www.dowjones.com/', 'https://global.factiva.com/factivalogin/login.asp?productname=global', 'https://www.fnlondon.com/', 'https://www.mansionglobal.com/', 'https://www.marketwatch.com', 'https://newsplus.wsj.com', 'https://privatemarkets.dowjones.com', 'https://djlogin.dowjones.com/login.asp?productname=rnc', 'https://www.wsj.com/conferences', 'https://www.wsj.com/pro/centralbanking', 'https://www.wsj.com/video/', 'https://www.wsj.com', 'http://www.bigdecisions.com/', 'https://www.businessspectator.com.au/', 'https://www.checkout51.com/?utm_source=wsj&utm_medium=digitalhousead&utm_campaign=wsjspotlight', 'https://www.harpercollins.com/', 'https://housing.com/', 'https://www.makaan.com/', 'https://nypost.com/', 'https://www.newsamerica.com/', 'https://www.proptiger.com', 'https://www.rea-group.com/', …… ……
要獲取wsj(華爾街日報)網頁上的評論數:
In [38]: response.css(".wsj-comment-count::text").extract() Out[38]: ['71', '59']
以上只是透過scrapy進行網頁抓取的介紹,我們可以用scrapy做更多的事情。