BeautifulSoup 和 Scrapy 爬蟲的區別


Beautiful Soup 和 Scrapy 爬蟲都用於 Python 中的網頁抓取。這兩個工具的用例相同,但功能不同。網頁抓取在研究、營銷和商業智慧等領域的資料收集和分析中非常有用。本文將瞭解 Beautiful Soup 和 Scrapy 爬蟲之間的區別以及它們在網頁抓取中的應用。

特性

Beautiful Soup

Scrapy

解析

用於解析 HTML 和 XML 文件

結合解析和爬取從網站提取資料。

易用性

易於使用的庫

更復雜的庫,使用者需要良好的程式設計能力。

併發性

不支援併發,一次只能抓取一個頁面。

支援併發,可以同時抓取多個頁面,這使得它對於大型網頁抓取專案更快更高效。

中介軟體

不提供任何中介軟體系統

提供中介軟體系統,允許開發者在抓取過程的不同階段定製蜘蛛的行為。

資料儲存

不提供內建資料儲存支援,需要開發者手動處理資料儲存

提供內建支援,可以將資料儲存在各種格式中,例如 CSV、JSON 和 XML,並且還支援與 MySQL 和 MongoDB 等資料庫整合

健壯性

與 Scrapy 相比,健壯性和容錯性較差

更健壯,具有內建的錯誤處理機制,例如重試失敗的請求、處理超時以及避免常見的錯誤,例如 404 和 403

社群

有一個社群,但不如 Scrapy 的社群龐大活躍

擁有龐大而活躍的開發者和貢獻者社群,他們不斷改進和更新框架。

Beautiful Soup

Beautiful Soup 是一個開源的 Python 庫,用於解析 HTML 和 XML 頁面。HTML 頁面的解析有助於從網頁提取資料。該庫包含各種函式,可用於在 HTML 文件中搜索特定的標籤、連結和其他屬性。如果要抓取的資料在一個頁面上,Beautiful Soup 是最佳選擇。

示例

在下面的示例中,使用 Beautiful Soup 和 requests 庫列印網頁中存在的所有連結。首先,您需要匯入 requests 庫和 Beautiful Soup,然後向頁面的 URL 發出 get 請求,並使用 Beautiful Soup 解析收到的 HTML 內容。頁面解析後,您可以使用 Beautiful Soup 方法找到頁面上的所有連結。

import requests
from bs4 import BeautifulSoup

# Make a request to the webpage
url = 'https://tutorialspoint.tw/index.htm'
response = requests.get(url)

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Find all links on the page
links = soup.find_all('a')

# Print the links
for link in links:
   print(link.get('href'))

輸出

https://tutorialspoint.tw/index.htm
https://tutorialspoint.tw/codingground.htm
https://tutorialspoint.tw/about/about_careers.htm
https://tutorialspoint.tw/whiteboard.htm
https://tutorialspoint.tw/online_dev_tools.htm
https://tutorialspoint.tw/business/index.asp
https://tutorialspoint.tw/market/teach_with_us.jsp
https://#/tutorialspointindia
https://www.instagram.com/tutorialspoint_/
https://twitter.com/tutorialspoint
https://www.youtube.com/channel/UCVLbzhxVTiTLiVKeGV7WEBg
https://www.linkedin.com/authwall?trk=bf&trkInfo=AQEkqX2eckF__gAAAX-wMwEYvrsjBVbEtWQd4pgEdVSzkL22Nik1KEpY_ECWLKDGc41z8IOZWr2Bb0fvJplT60NPBtSw87J6QCpc7wD4qQ3iU13n6xJtBxME5o05Wmpg5JPm5YY=&originalReferer=&sessionRedirect=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Ftutorialspoint
index.htm
None
https://tutorialspoint.tw/categories/development
https://tutorialspoint.tw/categories/it_and_software
https://tutorialspoint.tw/categories/data_science_and_ai_ml
https://tutorialspoint.tw/categories/cyber_security
https://tutorialspoint.tw/categories/marketing
https://tutorialspoint.tw/categories/office_productivity
https://tutorialspoint.tw/categories/business
https://tutorialspoint.tw/categories/lifestyle
https://tutorialspoint.tw/latest/prime-packs
https://tutorialspoint.tw/market/index.asp
https://tutorialspoint.tw/latest/ebooks
https://tutorialspoint.tw/tutorialslibrary.htm
https://tutorialspoint.tw/articles/index.php
https://tutorialspoint.tw/market/login.asp
https://tutorialspoint.tw/latest/prime-packs
https://tutorialspoint.tw/market/index.asp
https://tutorialspoint.tw/latest/ebooks
https://tutorialspoint.tw/tutorialslibrary.htm
https://tutorialspoint.tw/articles/index.php
https://tutorialspoint.tw/codingground.htm

Scrapy

Scrapy 也是一個 Python 框架,用於網路爬取和網頁抓取。當我們需要進行大規模專案的資料抓取時,使用 Scrapy。它還提供各種功能來提取、儲存和處理資料。當您需要從多個頁面抓取複雜資料時,Scrapy 是最佳選擇。

示例

在下面的示例中,我們使用 Scrapy 從一個報價網站的多個頁面抓取資料。為此,您需要定義一個 Scrapy 蜘蛛,它開始向網站的第一頁發出請求,然後解析頁面並從頁面提取資料,然後按照下一頁連結繼續,直到沒有更多頁面可抓取。

import scrapy

class QuotesSpider(scrapy.Spider):
   name = "quotes"
   start_urls = [
      'http://quotes.toscrape.com/page/1/',
   ]

   def parse(self, response):
      for quote in response.css('div.quote'):
         yield {
            'text': quote.css('span.text::text').get(),
            'author': quote.css('span small::text').get(),
            'tags': quote.css('div.tags a.tag::text').getall(),
         }
      next_page = response.css('li.next a::attr(href)').get()
      if next_page is not None:
         yield response.follow(next_page, self.parse)

# Create a Scrapy process
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())

# Start the spider
process.crawl(QuotesSpider)

# Run the spider and display the output
process.start()
for item in QuotesSpider().parse(response=None):
   print(item)

輸出

2023-04-17 00:53:00 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: http://quotes.toscrape.com/page/8/)
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': '“Anyone who has never made a mistake has never tried anything new.”', 'author': 'Albert Einstein', 'tags': ['mistakes']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': "“A lady's imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.”", 'author': 'Jane Austen', 'tags': ['humor', 'love', 'romantic', 'women']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': '“Remember, if the time should come when you have to make a choice between what is right and what is easy, remember what happened to a boy who was good, and kind, and brave, because he strayed across the path of Lord Voldemort. Remember Cedric Diggory.”', 'author': 'J.K. Rowling', 'tags': ['integrity']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': '“I declare after all there is no enjoyment like reading! How much sooner one tires of any thing than of a book! -- When I have a house of my own, I shall be miserable if I have not an excellent library.”', 'author': 'Jane Austen', 'tags': ['books', 'library', 'reading']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': '“There are few people whom I really love, and still fewer of whom I think well. The more I see of the world, the more am I dissatisfied with it; and every day confirms my belief of the inconsistency of all human characters, and of the little dependence that can be placed on the appearance of merit or sense.”', 'author': 'Jane Austen', 'tags': ['elizabeth-bennet', 'jane-austen']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': '“Some day you will be old enough to start reading fairy tales again.”', 'author': 'C.S. Lewis', 'tags': ['age', 'fairytales', 'growing-up']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': '“We are not necessarily doubting that God will do the best for us; we are wondering how painful the best will turn out to be.”', 'author': 'C.S. Lewis', 'tags': ['god']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': '“The fear of death follows from the fear of life. A man who lives fully is prepared to die at any time.”', 'author': 'Mark Twain', 'tags': ['death', 'life']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': '“A lie can travel half way around the world while the truth is putting on its shoes.”', 'author': 'Mark Twain', 'tags': ['misattributed-mark-twain', 'truth']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': '“I believe in Christianity as I believe that the sun has risen: not only because I see it, but because by it I see everything else.”', 'author': 'C.S. Lewis', 'tags': ['christianity', 'faith', 'religion', 'sun']}
2023-04-17 00:53:00 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: http://quotes.toscrape.com/page/9/)
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': '“The truth." Dumbledore sighed. "It is a beautiful and terrible thing, and should therefore be treated with great caution.”', 'author': 'J.K. Rowling', 'tags': ['truth']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': "“I'm the one that's got to die when it's time for me to die, so let me live my life the way I want to.”", 'author': 'Jimi Hendrix', 'tags': ['death', 'life']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': '“To die will be an awfully big adventure.”', 'author': 'J.M. Barrie', 'tags': ['adventure', 'love']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': '“It takes courage to grow up and become who you really are.”', 'author': 'E.E. Cummings', 'tags': ['courage']}     
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': '“But better to get hurt by the truth than comforted with a lie.”', 'author': 'Khaled Hosseini', 'tags': ['life']}  
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': '“You never really understand a person until you consider things from his point of view... Until you climb inside of his skin and walk around in it.”', 'author': 'Harper Lee', 'tags': ['better-life-empathy']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': '“You have to write the book that wants to be written. And if the book will be too difficult for grown-ups, then you write it for children.”', 'author': "Madeleine L'Engle", 'tags': ['books', 'children', 'difficult', 'grown-ups', 'write', 'writers', 'writing']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': '“Never tell the truth to people who are not worthy of it.”', 'author': 'Mark Twain', 'tags': ['truth']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': "“A person's a person, no matter how small.”", 'author': 'Dr. Seuss', 'tags': ['inspirational']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': '“... a mind needs books as a sword needs a whetstone, if it is to keep its edge.”', 'author': 'George R.R. Martin', 'tags': ['books', 'mind']}
2023-04-17 00:53:00 [scrapy.core.engine] INFO: Closing spider (finished)

結論

在本文中,我們討論了 Python 中 Beautiful Soup 和 Scrapy 之間的區別。儘管兩者都用於網頁抓取,但它們的功能不同。當我們需要從單個頁面抓取資料時,使用 Beautiful Soup;當我們需要從多個頁面抓取大量資料時,使用 Scrapy。

更新於:2023年7月6日

瀏覽量:159

開啟您的職業生涯

完成課程獲得認證

開始學習
廣告