Python 網路爬蟲工具


在計算機科學中,網路爬蟲是指從網站提取資料的過程。使用這項技術可以將網路上的非結構化資料轉換為結構化資料。

Python3中最常用的網路爬蟲工具包括:

  • Urllib2
  • Requests
  • BeautifulSoup
  • Lxml
  • Selenium
  • MechanicalSoup

**Urllib2** - 此工具預裝在 Python 中。此模組用於提取 URL。使用 urlopen() 函式,透過不同的協議(FTP、HTTP 等)獲取 URL。

示例程式碼

from urllib.request import urlopen
my_html = urlopen("https://tutorialspoint.tw/")
print(my_html.read())

輸出

b'<!DOCTYPE html<\r\n
<!--[if IE 8]<
<html class="ie ie8"<
<![endif]--<
\r\n<!--[if IE 9]<
<html class="ie ie9"<
<![endif]-->\r\n<!--[if gt IE 9]><!--<
\r\n<html lang="en-US"<
<!--<![endif]--<
\r\n<head>\r\n<!-- Basic --<
\r\n<meta charset="utf-8"<
\r\n<title>Parallax Scrolling, Java Cryptography, YAML, Python Data Science, Java i18n, GitLab, TestRail, VersionOne, DBUtils, Common CLI, Seaborn, Ansible, LOLCODE, Current Affairs 2018, Apache Commons Collections</title<
\r\n<meta name="Description" content="Parallax Scrolling, Java Cryptography, YAML, Python Data Science, Java i18n, GitLab, TestRail, VersionOne, DBUtils, Common CLI, Seaborn, Ansible, LOLCODE, Current Affairs 2018, Intellij Idea, Apache Commons Collections, Java 9, GSON, TestLink, Inter Process Communication (IPC), Logo, PySpark, Google Tag Manager, Free IFSC Code, SAP Workflow"/<
\r\n<meta name="Keywords" content="Python Data Science, Java i18n, GitLab, TestRail, VersionOne, DBUtils, Common CLI, Seaborn, Ansible, LOLCODE, Gson, TestLink, Inter Process Communication (IPC), Logo"/<\r\n
<meta http-equiv="X-UA-Compatible" content="IE=edge">\r\n<meta name="viewport" content="width=device-width,initial-scale=1.0,user-scalable=yes">\r\n<link href="https://cdn.muicss.com/mui-0.9.39/extra/mui-rem.min.css" rel="stylesheet" type="text/css" /<\r\n
<link rel="stylesheet" href="/questions/css/home.css?v=3" /< \r\n
<script src="/questions/js/jquery.min.js"<
</script<
\r\n<script src="/questions/js/fontawesome.js"<
</script<\r\n
<script src="https://cdn.muicss.com/mui-0.9.39/js/mui.min.js"<
</script>\r\n
</head>\r\n
<body>\r\n
<!-- Start of Body Content --> \r\n
<div class="mui-appbar-home">\r\n
<div class="mui-container">\r\n
<div class="tp-primary-header mui-top-home">\r\n
<a href="https://tutorialspoint.tw/index.htm" target="_blank" title="TutorialsPoint - Home">
<i class="fa fa-home">
</i><span>Home</span></a>\r\n
</div>\r\n
<div class="tp-primary-header mui-top-qa">\r\n
<a href="https://tutorialspoint.tw/questions/index.php" target="_blank" title="Questions & Answers - The Best Technical Questions and Answers - TutorialsPoint"><i class="fa fa-location-arrow"></i>
<span>
Q/A</span></a>\r\n
</div>\r\n
<div class="tp-primary-header mui-top-tools">\r\n
<a href="https://tutorialspoint.tw/online_dev_tools.htm" target="_blank" title="Tools - Online Development and Testing Tools">
<i class="fa fa-cogs"></i><span>Tools</span></a>\r\n
</div>\r\n
<div class="tp-primary-header mui-top-coding-ground">\r\n
<a href="https://tutorialspoint.tw/codingground.htm" target="_blank" title="Coding Ground - Free Online IDE and Terminal">
<i class="fa fa-code">
</i>
<span>
Coding Ground </span>
</a> \r\n
</div>\r\n
<div class="tp-primary-header mui-top-current-affairs">\r\n
<a href="https://tutorialspoint.tw/current_affairs/index.htm" target="_blank" title="Current Affairs - 2016, 2017 and 2018 | General Knowledge for Competitive Exams"><i class="fa fa-globe">
</i><span>Current Affairs</span>
</a>\r\n
</div>\r\n
<div class="tp-primary-header mui-top-upsc">\r\n
<a href="https://tutorialspoint.tw/upsc_ias_exams.htm" target="_blank" title="UPSC IAS Exams Notes - TutorialsPoint"><i class="fa fa-user-tie"></i><span>UPSC Notes</span></a>\r\n
</div>\r\n
<div class="tp-primary-header mui-top-tutors">\r\n
<a href="https://tutorialspoint.tw/tutor_connect/index.php" target="_blank" title="Top Online Tutors - Tutor Connect">
<i class="fa fa-user">
</i>
<span>Online Tutors</span>
</a>\r\n
</div>\r\n
<div class="tp-primary-header mui-top-examples">\r\n
….

**Requests** - 此模組未預安裝,需要在命令提示符中輸入命令進行安裝。**Requests** 傳送 HTTP/1.1 請求。

pip install requests

示例

import requests
# get URL
my_req = requests.get('https://tutorialspoint.tw/')
   print(my_req.encoding)
   print(my_req.status_code)
   print(my_req.elapsed)
   print(my_req.url)
   print(my_req.history)
print(my_req.headers['Content-Type'])

輸出

UTF-8
200
0:00:00.205727
https://tutorialspoint.tw/
[]
text/html; charset=UTF-8

**BeautifulSoup** - 這是一個解析庫,用於不同的解析器。Python 的標準庫提供 BeautifulSoup 的預設解析器。它構建一個解析樹,用於從 HTML 頁面提取資料。

要安裝此模組,請在命令提示符中輸入以下命令:

pip install beautifulsoup4

示例

from bs4 import BeautifulSoup
# importing requests
import requests
# get URL
my_req = requests.get("https://tutorialspoint.tw/")
my_data = my_req.text
my_soup = BeautifulSoup(my_data)
for my_link in my_soup.find_all('a'):
print(my_link.get('href'))

輸出

https://tutorialspoint.tw/index.htm
https://tutorialspoint.tw/questions/index.php
https://tutorialspoint.tw/online_dev_tools.htm
https://tutorialspoint.tw/codingground.htm
https://tutorialspoint.tw/current_affairs/index.htm
https://tutorialspoint.tw/upsc_ias_exams.htm
https://tutorialspoint.tw/tutor_connect/index.php
https://tutorialspoint.tw/programming_examples/
https://tutorialspoint.tw/whiteboard.htm
https://tutorialspoint.tw/netmeeting.php
https://tutorialspoint.tw/articles/
https://tutorialspoint.tw/index.htm
https://tutorialspoint.tw/tutorialslibrary.htm
https://tutorialspoint.tw/videotutorials/index.htm
https://store.tutorialspoint.com
https://tutorialspoint.tw/html_online_training/index.asp
https://tutorialspoint.tw/css_online_training/index.asp
https://tutorialspoint.tw/3d_animation_online_training/index.asp
https://tutorialspoint.tw/swift_4_online_training/index.asp
https://tutorialspoint.tw/blockchain_online_training/index.asp
https://tutorialspoint.tw/reactjs_online_training/index.asp
https://tutorialspoint.tw/tutorialslibrary.htm
https://tutorialspoint.tw/computer_fundamentals/index.htm
https://tutorialspoint.tw/compiler_design/index.htm
https://tutorialspoint.tw/operating_system/index.htm
https://tutorialspoint.tw/data_structures_algorithms/index.htm
https://tutorialspoint.tw/dbms/index.htm
https://tutorialspoint.tw/data_communication_computer_network/index.htm
https://tutorialspoint.tw/academic_tutorials.htm
https://tutorialspoint.tw/html/index.htm
https://tutorialspoint.tw/css/index.htm
https://tutorialspoint.tw/javascript/index.htm
https://tutorialspoint.tw/php/index.htm
https://tutorialspoint.tw/angular4/index.htm
https://tutorialspoint.tw/mysql/index.htm
https://tutorialspoint.tw/web_development_tutorials.htm
https://tutorialspoint.tw/cprogramming/index.htm
https://tutorialspoint.tw/cplusplus/index.htm
https://tutorialspoint.tw/java8/index.htm
https://tutorialspoint.tw/python/index.htm
https://tutorialspoint.tw/scala/index.htm
https://tutorialspoint.tw/csharp/index.htm
https://tutorialspoint.tw/computer_programming_tutorials.htm
https://tutorialspoint.tw/java8/index.htm
https://tutorialspoint.tw/jdbc/index.htm
https://tutorialspoint.tw/servlets/index.htm
https://tutorialspoint.tw/spring/index.htm
https://tutorialspoint.tw/hibernate/index.htm
https://tutorialspoint.tw/swing/index.htm
https://tutorialspoint.tw/java_technology_tutorials.htm
https://tutorialspoint.tw/android/index.htm
https://tutorialspoint.tw/swift/index.htm
https://tutorialspoint.tw/ios/index.htm
https://tutorialspoint.tw/kotlin/index.htm
https://tutorialspoint.tw/react_native/index.htm
https://tutorialspoint.tw/xamarin/index.htm
https://tutorialspoint.tw/mobile_development_tutorials.htm
https://tutorialspoint.tw/mongodb/index.htm
https://tutorialspoint.tw/plsql/index.htm
https://tutorialspoint.tw/sql/index.htm
https://tutorialspoint.tw/db2/index.htm
https://tutorialspoint.tw/mysql/index.htm
https://tutorialspoint.tw/memcached/index.htm
https://tutorialspoint.tw/database_tutorials.htm
https://tutorialspoint.tw/asp.net/index.htm
https://tutorialspoint.tw/entity_framework/index.htm
https://tutorialspoint.tw/vb.net/index.htm
https://tutorialspoint.tw/ms_project/index.htm
https://tutorialspoint.tw/excel/index.htm
https://tutorialspoint.tw/word/index.htm
https://tutorialspoint.tw/microsoft_technologies_tutorials.htm
https://tutorialspoint.tw/big_data_analytics/index.htm
https://tutorialspoint.tw/hadoop/index.htm
https://tutorialspoint.tw/sas/index.htm
https://tutorialspoint.tw/qlikview/index.htm
https://tutorialspoint.tw/power_bi/index.htm
https://tutorialspoint.tw/tableau/index.htm
https://tutorialspoint.tw/big_data_tutorials.htm
https://tutorialspoint.tw/tutorialslibrary.htm
https://tutorialspoint.tw/codingground.htm
https://tutorialspoint.tw/coding_platform_for_websites.htm
https://tutorialspoint.tw/developers_best_practices/index.htm
https://tutorialspoint.tw/effective_resume_writing.htm
https://tutorialspoint.tw/computer_glossary.htm
https://tutorialspoint.tw/computer_whoiswho.htm
https://tutorialspoint.tw/questions_and_answers.htm
https://tutorialspoint.tw/multi_language_tutorials.htm
https://itunes.apple.com/us/app/tutorials-point/id914891263?ls=1&mt=8
https://play.google.com/store/apps/details?id=com.tutorialspoint.onlineviewer
http://www.windowsphone.com/s?appid=91249671-7184-4ad6-8a5f-d11847946b09
/about/index.htm
/about/about_team.htm
/about/about_careers.htm
/about/about_privacy.htm
/about/about_terms_of_use.htm
https://tutorialspoint.tw/articles/
https://tutorialspoint.tw/online_dev_tools.htm
https://tutorialspoint.tw/free_web_graphics.htm
https://tutorialspoint.tw/online_file_conversion.htm
https://tutorialspoint.tw/shared-tutorials.php
https://tutorialspoint.tw/netmeeting.php
https://tutorialspoint.tw/free_online_whiteboard.htm
https://tutorialspoint.tw
https://#/tutorialspointindia
https://plus.google.com/u/0/+tutorialspoint
http://www.twitter.com/tutorialspoint
http://www.linkedin.com/company/tutorialspoint
https://www.youtube.com/channel/UCVLbzhxVTiTLiVKeGV7WEBg
https://tutorialspoint.tw/index.htm
/about/about_privacy.htm#cookies
/about/faq.htm
/about/about_helping.htm
/about/contact_us.htm

**Lxml** - 這是一個高效能、生產級的 HTML 和 XML 解析庫。如果需要高質量和最大速度,則應使用此庫。它有很多模組可以從網站提取資料。

安裝方法:在命令提示符中輸入

pip install lxml

示例

from lxml import etree
my_root_elem = etree.Element('html')
etree.SubElement(my_root_elem, 'head')
etree.SubElement(my_root_elem, 'title')
etree.SubElement(my_root_elem, 'body')
print(etree.tostring(my_root_elem, pretty_print = True).decode("utf-8"))

輸出

<html>
<head/>
<title/>
<body/>
</html>

**Selenium** - 這是一種自動化瀏覽器工具,也稱為 web 驅動程式。當使用任何網站時,有時需要等待一段時間,例如單擊按鈕或滾動頁面時,此時需要 Selenium。

安裝 Selenium 使用以下命令:

pip install selenium

示例

from selenium import webdriver
my_path_to_chromedriver ='/Users/Admin/Desktop/chromedriver'
my_browser = webdriver.Chrome(executable_path = my_path_to_chromedriver)
my_url = 'https://tutorialspoint.tw/'
my_browser.get(my_url)

輸出

tutorialspoint

**MechanicalSoup** - 這是另一個用於自動化與網站互動的 Python 庫。使用它可以自動儲存和傳送 Cookie,可以跟蹤重定向,可以跟蹤連結並提交表單。它不執行 JavaScript。

安裝方法:使用以下命令

pip install MechanicalSoup

示例

import mechanicalsoup
my_browser = mechanicalsoup.StatefulBrowser()
my_value = my_browser.open("https://tutorialspoint.tw/")
print(my_value)
my_val = my_browser.get_url()
print(my_val)
my_va = my_browser.follow_link("forms")
print(my_va)
my_value1 = my_browser.get_url()
print(my_value1)

更新於:2020年6月26日

260 次瀏覽

開啟你的職業生涯

完成課程獲得認證

開始學習
廣告