使用BeautifulSoup在Python中實現網頁抓取?
BeautifulSoup是Python的bs4模組中的一個類。構建BeautifulSoup的基本目的是解析HTML或XML文件。
安裝bs4(簡稱BeautifulSoup)
使用pip模組很容易安裝BeautifulSoup。只需在命令列中執行以下命令。
pip install bs4
在您的終端執行上述命令,您將看到類似於以下內容的螢幕 -
C:\Users\rajesh>pip install bs4 Collecting bs4 Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz Requirement already satisfied: beautifulsoup4 in c:\python\python361\lib\site-packages (from bs4) (4.6.0) Building wheels for collected packages: bs4 Building wheel for bs4 (setup.py) ... done Stored in directory: C:\Users\rajesh\AppData\Local\pip\Cache\wheels\a0\b0\b2\4f80b9456b87abedbc0bf2d52235414c3467d8889be38dd472 Successfully built bs4 Installing collected packages: bs4 Successfully installed bs4-0.0.1
要驗證BeautifulSoup是否已成功安裝在您的機器上,只需在同一終端中執行以下命令:
C:\Users\rajesh>python Python 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 17:54:52) [MSC v.1900 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> from bs4 import BeautifulSoup >>>
成功,太棒了!
示例1
從HTML文件中查詢所有連結 現在,假設我們有一個HTML文件,我們想要收集文件中的所有參考連結。所以首先我們將文件儲存為如下所示的字串:
html_doc='''<a href='wwww.Tutorialspoint.com.com'/a> <a href='wwww.nseindia.com.com'/a> <a href='wwww.codesdope.com'/a> <a href='wwww.google.com'/a> <a href='wwww.facebook.com'/a> <a href='wwww.wikipedia.org'/a> <a href='wwww.twitter.com'/a> <a href='wwww.microsoft.com'/a> <a href='wwww.github.com'/a> <a href='wwww.nytimes.com'/a> <a href='wwww.youtube.com'/a> <a href='wwww.reddit.com'/a> <a href='wwww.python.org'/a> <a href='wwww.stackoverflow.com'/a> <a href='wwww.amazon.com'/a> <a href=‘wwww.linkedin.com'/a> <a href='wwww.finace.google.com'/a>'''
現在,我們將透過將上述變數html_doc傳遞到BeautifulSoup的初始化函式中來建立一個soup物件。
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser')
現在我們有了soup物件,我們可以在其上應用BeautifulSoup類的各種方法。現在我們可以找到標籤的所有屬性以及html_doc中給定屬性中的值。
for tag in soup.find_all('a'): print(tag.get('href'))
從上面的程式碼中,我們試圖透過迴圈來獲取html_doc字串中的所有連結,以獲取文件中的每個<a>標籤並獲取href屬性。
以下是我們從html_doc字串中獲取所有連結的完整程式碼。
from bs4 import BeautifulSoup html_doc='''<a href='www.Tutorialspoint.com'/a> <a href='www.nseindia.com.com'/a> <a href='www.codesdope.com'/a> <a href='www.google.com'/a> <a href='www.facebook.com'/a> <a href='www.wikipedia.org'/a> <a href='www.twitter.com'/a> <a href='www.microsoft.com'/a> <a href='www.github.com'/a> <a href='www.nytimes.com'/a> <a href='www.youtube.com'/a> <a href='www.reddit.com'/a> <a href='www.python.org'/a> <a href='www.stackoverflow.com'/a> <a href='www.amazon.com'/a> <a href='www.rediff.com'/a>''' soup = BeautifulSoup(html_doc, 'html.parser') for tag in soup.find_all('a'): print(tag.get('href'))
結果
www.Tutorialspoint.com www.nseindia.com.com www.codesdope.com www.google.com www.facebook.com www.wikipedia.org www.twitter.com www.microsoft.com www.github.com www.nytimes.com www.youtube.com www.reddit.com www.python.org www.stackoverflow.com www.amazon.com www.rediff.com
示例2
列印來自包含特定元素(例如:python)的連結的網站的所有連結。
下面的程式將列印來自特定網站的所有URL,這些URL的連結中包含“python”。
from bs4 import BeautifulSoup from urllib.request import urlopen import re html = urlopen("https://python.club.tw") content = html.read() soup = BeautifulSoup(content) for a in soup.findAll('a',href=True): if re.findall('python', a['href']): print("Python URL:", a['href'])
結果
Python URL: https://docs.python.club.tw Python URL: https://pypi.python.org/ Python URL: https://#/pythonlang?fref=ts Python URL: http://brochure.getpython.info/ Python URL: https://docs.python.club.tw/3/license.html Python URL: https://wiki.python.org/moin/BeginnersGuide Python URL: https://devguide.python.org/ Python URL: https://docs.python.club.tw/faq/ Python URL: http://wiki.python.org/moin/Languages Python URL: https://python.club.tw/dev/peps/ Python URL: https://wiki.python.org/moin/PythonBooks Python URL: https://wiki.python.org/moin/ Python URL: https://python.club.tw/psf/codeofconduct/ Python URL: http://planetpython.org/ Python URL: /events/python-events Python URL: /events/python-user-group/ Python URL: /events/python-events/past/ Python URL: /events/python-user-group/past/ Python URL: https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event Python URL: //docs.python.club.tw/3/tutorial/controlflow.html#defining-functions Python URL: //docs.python.club.tw/3/tutorial/introduction.html#lists Python URL: https://docs.python.club.tw/3/tutorial/introduction.html#using-python-as-a-calculator Python URL: //docs.python.club.tw/3/tutorial/ Python URL: //docs.python.club.tw/3/tutorial/controlflow.html Python URL: /downloads/release/python-373/ Python URL: https://docs.python.club.tw Python URL: //jobs.python.org Python URL: http://blog.python.org Python URL: http://feedproxy.google.com/~r/PythonInsider/~3/Joo0vg55HKo/python-373-is-now-available.html Python URL: http://feedproxy.google.com/~r/PythonInsider/~3/N5tvkDIQ47g/python-3410-is-now-available.html Python URL: http://feedproxy.google.com/~r/PythonInsider/~3/n0mOibtx6_A/python-3.html Python URL: /events/python-events/805/ Python URL: /events/python-events/817/ Python URL: /events/python-user-group/814/ Python URL: /events/python-events/789/ Python URL: /events/python-events/831/ Python URL: /success-stories/building-an-open-source-and-cross-platform-azure-cli-with-python/ Python URL: /success-stories/building-an-open-source-and-cross-platform-azure-cli-with-python/ Python URL: http://wiki.python.org/moin/TkInter Python URL: http://www.wxpython.org/ Python URL: http://ipython.org Python URL: #python-network Python URL: http://brochure.getpython.info/ Python URL: https://docs.python.club.tw/3/license.html Python URL: https://wiki.python.org/moin/BeginnersGuide Python URL: https://devguide.python.org/ Python URL: https://docs.python.club.tw/faq/ Python URL: http://wiki.python.org/moin/Languages Python URL: https://python.club.tw/dev/peps/ Python URL: https://wiki.python.org/moin/PythonBooks Python URL: https://wiki.python.org/moin/ Python URL: https://python.club.tw/psf/codeofconduct/ Python URL: http://planetpython.org/ Python URL: /events/python-events Python URL: /events/python-user-group/ Python URL: /events/python-events/past/ Python URL: /events/python-user-group/past/ Python URL: https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event Python URL: https://devguide.python.org/ Python URL: https://bugs.python.org/ Python URL: https://mail.python.org/mailman/listinfo/python-dev Python URL: #python-network Python URL: https://github.com/python/pythondotorg/issues Python URL: https://status.python.org/
廣告