Python 網路爬蟲工具
在計算機科學中,網路爬蟲是指從網站提取資料的過程。使用這項技術可以將網路上的非結構化資料轉換為結構化資料。
Python3中最常用的網路爬蟲工具包括:
- Urllib2
- Requests
- BeautifulSoup
- Lxml
- Selenium
- MechanicalSoup
**Urllib2** - 此工具預裝在 Python 中。此模組用於提取 URL。使用 urlopen() 函式,透過不同的協議(FTP、HTTP 等)獲取 URL。
示例程式碼
from urllib.request import urlopen my_html = urlopen("https://tutorialspoint.tw/") print(my_html.read())
輸出
b'<!DOCTYPE html<\r\n <!--[if IE 8]< <html class="ie ie8"< <![endif]--< \r\n<!--[if IE 9]< <html class="ie ie9"< <![endif]-->\r\n<!--[if gt IE 9]><!--< \r\n<html lang="en-US"< <!--<![endif]--< \r\n<head>\r\n<!-- Basic --< \r\n<meta charset="utf-8"< \r\n<title>Parallax Scrolling, Java Cryptography, YAML, Python Data Science, Java i18n, GitLab, TestRail, VersionOne, DBUtils, Common CLI, Seaborn, Ansible, LOLCODE, Current Affairs 2018, Apache Commons Collections</title< \r\n<meta name="Description" content="Parallax Scrolling, Java Cryptography, YAML, Python Data Science, Java i18n, GitLab, TestRail, VersionOne, DBUtils, Common CLI, Seaborn, Ansible, LOLCODE, Current Affairs 2018, Intellij Idea, Apache Commons Collections, Java 9, GSON, TestLink, Inter Process Communication (IPC), Logo, PySpark, Google Tag Manager, Free IFSC Code, SAP Workflow"/< \r\n<meta name="Keywords" content="Python Data Science, Java i18n, GitLab, TestRail, VersionOne, DBUtils, Common CLI, Seaborn, Ansible, LOLCODE, Gson, TestLink, Inter Process Communication (IPC), Logo"/<\r\n <meta http-equiv="X-UA-Compatible" content="IE=edge">\r\n<meta name="viewport" content="width=device-width,initial-scale=1.0,user-scalable=yes">\r\n<link href="https://cdn.muicss.com/mui-0.9.39/extra/mui-rem.min.css" rel="stylesheet" type="text/css" /<\r\n <link rel="stylesheet" href="/questions/css/home.css?v=3" /< \r\n <script src="/questions/js/jquery.min.js"< </script< \r\n<script src="/questions/js/fontawesome.js"< </script<\r\n <script src="https://cdn.muicss.com/mui-0.9.39/js/mui.min.js"< </script>\r\n </head>\r\n <body>\r\n <!-- Start of Body Content --> \r\n <div class="mui-appbar-home">\r\n <div class="mui-container">\r\n <div class="tp-primary-header mui-top-home">\r\n <a href="https://tutorialspoint.tw/index.htm" target="_blank" title="TutorialsPoint - Home"> <i class="fa fa-home"> </i><span>Home</span></a>\r\n </div>\r\n <div class="tp-primary-header mui-top-qa">\r\n <a href="https://tutorialspoint.tw/questions/index.php" target="_blank" title="Questions & Answers - The Best Technical Questions and Answers - TutorialsPoint"><i class="fa fa-location-arrow"></i> <span> Q/A</span></a>\r\n </div>\r\n <div class="tp-primary-header mui-top-tools">\r\n <a href="https://tutorialspoint.tw/online_dev_tools.htm" target="_blank" title="Tools - Online Development and Testing Tools"> <i class="fa fa-cogs"></i><span>Tools</span></a>\r\n </div>\r\n <div class="tp-primary-header mui-top-coding-ground">\r\n <a href="https://tutorialspoint.tw/codingground.htm" target="_blank" title="Coding Ground - Free Online IDE and Terminal"> <i class="fa fa-code"> </i> <span> Coding Ground </span> </a> \r\n </div>\r\n <div class="tp-primary-header mui-top-current-affairs">\r\n <a href="https://tutorialspoint.tw/current_affairs/index.htm" target="_blank" title="Current Affairs - 2016, 2017 and 2018 | General Knowledge for Competitive Exams"><i class="fa fa-globe"> </i><span>Current Affairs</span> </a>\r\n </div>\r\n <div class="tp-primary-header mui-top-upsc">\r\n <a href="https://tutorialspoint.tw/upsc_ias_exams.htm" target="_blank" title="UPSC IAS Exams Notes - TutorialsPoint"><i class="fa fa-user-tie"></i><span>UPSC Notes</span></a>\r\n </div>\r\n <div class="tp-primary-header mui-top-tutors">\r\n <a href="https://tutorialspoint.tw/tutor_connect/index.php" target="_blank" title="Top Online Tutors - Tutor Connect"> <i class="fa fa-user"> </i> <span>Online Tutors</span> </a>\r\n </div>\r\n <div class="tp-primary-header mui-top-examples">\r\n ….
**Requests** - 此模組未預安裝,需要在命令提示符中輸入命令進行安裝。**Requests** 傳送 HTTP/1.1 請求。
pip install requests
示例
import requests # get URL my_req = requests.get('https://tutorialspoint.tw/') print(my_req.encoding) print(my_req.status_code) print(my_req.elapsed) print(my_req.url) print(my_req.history) print(my_req.headers['Content-Type'])
輸出
UTF-8 200 0:00:00.205727 https://tutorialspoint.tw/ [] text/html; charset=UTF-8
**BeautifulSoup** - 這是一個解析庫,用於不同的解析器。Python 的標準庫提供 BeautifulSoup 的預設解析器。它構建一個解析樹,用於從 HTML 頁面提取資料。
要安裝此模組,請在命令提示符中輸入以下命令:
pip install beautifulsoup4
示例
from bs4 import BeautifulSoup # importing requests import requests # get URL my_req = requests.get("https://tutorialspoint.tw/") my_data = my_req.text my_soup = BeautifulSoup(my_data) for my_link in my_soup.find_all('a'): print(my_link.get('href'))
輸出
https://tutorialspoint.tw/index.htm https://tutorialspoint.tw/questions/index.php https://tutorialspoint.tw/online_dev_tools.htm https://tutorialspoint.tw/codingground.htm https://tutorialspoint.tw/current_affairs/index.htm https://tutorialspoint.tw/upsc_ias_exams.htm https://tutorialspoint.tw/tutor_connect/index.php https://tutorialspoint.tw/programming_examples/ https://tutorialspoint.tw/whiteboard.htm https://tutorialspoint.tw/netmeeting.php https://tutorialspoint.tw/articles/ https://tutorialspoint.tw/index.htm https://tutorialspoint.tw/tutorialslibrary.htm https://tutorialspoint.tw/videotutorials/index.htm https://store.tutorialspoint.com https://tutorialspoint.tw/html_online_training/index.asp https://tutorialspoint.tw/css_online_training/index.asp https://tutorialspoint.tw/3d_animation_online_training/index.asp https://tutorialspoint.tw/swift_4_online_training/index.asp https://tutorialspoint.tw/blockchain_online_training/index.asp https://tutorialspoint.tw/reactjs_online_training/index.asp https://tutorialspoint.tw/tutorialslibrary.htm https://tutorialspoint.tw/computer_fundamentals/index.htm https://tutorialspoint.tw/compiler_design/index.htm https://tutorialspoint.tw/operating_system/index.htm https://tutorialspoint.tw/data_structures_algorithms/index.htm https://tutorialspoint.tw/dbms/index.htm https://tutorialspoint.tw/data_communication_computer_network/index.htm https://tutorialspoint.tw/academic_tutorials.htm https://tutorialspoint.tw/html/index.htm https://tutorialspoint.tw/css/index.htm https://tutorialspoint.tw/javascript/index.htm https://tutorialspoint.tw/php/index.htm https://tutorialspoint.tw/angular4/index.htm https://tutorialspoint.tw/mysql/index.htm https://tutorialspoint.tw/web_development_tutorials.htm https://tutorialspoint.tw/cprogramming/index.htm https://tutorialspoint.tw/cplusplus/index.htm https://tutorialspoint.tw/java8/index.htm https://tutorialspoint.tw/python/index.htm https://tutorialspoint.tw/scala/index.htm https://tutorialspoint.tw/csharp/index.htm https://tutorialspoint.tw/computer_programming_tutorials.htm https://tutorialspoint.tw/java8/index.htm https://tutorialspoint.tw/jdbc/index.htm https://tutorialspoint.tw/servlets/index.htm https://tutorialspoint.tw/spring/index.htm https://tutorialspoint.tw/hibernate/index.htm https://tutorialspoint.tw/swing/index.htm https://tutorialspoint.tw/java_technology_tutorials.htm https://tutorialspoint.tw/android/index.htm https://tutorialspoint.tw/swift/index.htm https://tutorialspoint.tw/ios/index.htm https://tutorialspoint.tw/kotlin/index.htm https://tutorialspoint.tw/react_native/index.htm https://tutorialspoint.tw/xamarin/index.htm https://tutorialspoint.tw/mobile_development_tutorials.htm https://tutorialspoint.tw/mongodb/index.htm https://tutorialspoint.tw/plsql/index.htm https://tutorialspoint.tw/sql/index.htm https://tutorialspoint.tw/db2/index.htm https://tutorialspoint.tw/mysql/index.htm https://tutorialspoint.tw/memcached/index.htm https://tutorialspoint.tw/database_tutorials.htm https://tutorialspoint.tw/asp.net/index.htm https://tutorialspoint.tw/entity_framework/index.htm https://tutorialspoint.tw/vb.net/index.htm https://tutorialspoint.tw/ms_project/index.htm https://tutorialspoint.tw/excel/index.htm https://tutorialspoint.tw/word/index.htm https://tutorialspoint.tw/microsoft_technologies_tutorials.htm https://tutorialspoint.tw/big_data_analytics/index.htm https://tutorialspoint.tw/hadoop/index.htm https://tutorialspoint.tw/sas/index.htm https://tutorialspoint.tw/qlikview/index.htm https://tutorialspoint.tw/power_bi/index.htm https://tutorialspoint.tw/tableau/index.htm https://tutorialspoint.tw/big_data_tutorials.htm https://tutorialspoint.tw/tutorialslibrary.htm https://tutorialspoint.tw/codingground.htm https://tutorialspoint.tw/coding_platform_for_websites.htm https://tutorialspoint.tw/developers_best_practices/index.htm https://tutorialspoint.tw/effective_resume_writing.htm https://tutorialspoint.tw/computer_glossary.htm https://tutorialspoint.tw/computer_whoiswho.htm https://tutorialspoint.tw/questions_and_answers.htm https://tutorialspoint.tw/multi_language_tutorials.htm https://itunes.apple.com/us/app/tutorials-point/id914891263?ls=1&mt=8 https://play.google.com/store/apps/details?id=com.tutorialspoint.onlineviewer http://www.windowsphone.com/s?appid=91249671-7184-4ad6-8a5f-d11847946b09 /about/index.htm /about/about_team.htm /about/about_careers.htm /about/about_privacy.htm /about/about_terms_of_use.htm https://tutorialspoint.tw/articles/ https://tutorialspoint.tw/online_dev_tools.htm https://tutorialspoint.tw/free_web_graphics.htm https://tutorialspoint.tw/online_file_conversion.htm https://tutorialspoint.tw/shared-tutorials.php https://tutorialspoint.tw/netmeeting.php https://tutorialspoint.tw/free_online_whiteboard.htm https://tutorialspoint.tw https://#/tutorialspointindia https://plus.google.com/u/0/+tutorialspoint http://www.twitter.com/tutorialspoint http://www.linkedin.com/company/tutorialspoint https://www.youtube.com/channel/UCVLbzhxVTiTLiVKeGV7WEBg https://tutorialspoint.tw/index.htm /about/about_privacy.htm#cookies /about/faq.htm /about/about_helping.htm /about/contact_us.htm
**Lxml** - 這是一個高效能、生產級的 HTML 和 XML 解析庫。如果需要高質量和最大速度,則應使用此庫。它有很多模組可以從網站提取資料。
安裝方法:在命令提示符中輸入
pip install lxml
示例
from lxml import etree my_root_elem = etree.Element('html') etree.SubElement(my_root_elem, 'head') etree.SubElement(my_root_elem, 'title') etree.SubElement(my_root_elem, 'body') print(etree.tostring(my_root_elem, pretty_print = True).decode("utf-8"))
輸出
<html> <head/> <title/> <body/> </html>
**Selenium** - 這是一種自動化瀏覽器工具,也稱為 web 驅動程式。當使用任何網站時,有時需要等待一段時間,例如單擊按鈕或滾動頁面時,此時需要 Selenium。
安裝 Selenium 使用以下命令:
pip install selenium
示例
from selenium import webdriver my_path_to_chromedriver ='/Users/Admin/Desktop/chromedriver' my_browser = webdriver.Chrome(executable_path = my_path_to_chromedriver) my_url = 'https://tutorialspoint.tw/' my_browser.get(my_url)
輸出
**MechanicalSoup** - 這是另一個用於自動化與網站互動的 Python 庫。使用它可以自動儲存和傳送 Cookie,可以跟蹤重定向,可以跟蹤連結並提交表單。它不執行 JavaScript。
安裝方法:使用以下命令
pip install MechanicalSoup
示例
import mechanicalsoup my_browser = mechanicalsoup.StatefulBrowser() my_value = my_browser.open("https://tutorialspoint.tw/") print(my_value) my_val = my_browser.get_url() print(my_val) my_va = my_browser.follow_link("forms") print(my_va) my_value1 = my_browser.get_url() print(my_value1)
廣告