使用Python中的Selenium和Beautiful Soup抓取LinkedIn資料

由於其豐富的庫和工具生態系統，Python已成為最流行的網頁抓取程式語言之一。Selenium和Beautiful Soup就是兩個這樣的強大庫，它們結合起來，為從網站抓取資料提供了一種強大的解決方案。在本教程中，我們將深入探討使用Python進行網頁抓取的世界，特別關注使用Selenium和Beautiful Soup抓取LinkedIn資料。

在本文中，我們將探討使用Selenium自動化網路互動以及使用Beautiful Soup解析HTML內容的過程。這些工具結合在一起，使我們能夠從LinkedIn（世界上最大的專業社交網路平臺）抓取資料。我們將學習如何登入LinkedIn，瀏覽其頁面，從使用者個人資料中提取資訊以及處理分頁和滾動。那麼，讓我們開始吧。

安裝Python和必要的庫（Selenium、Beautiful Soup等）

為了開始我們的LinkedIn抓取之旅，我們需要在我們的機器上設定必要的環境。首先，我們需要確保已安裝Python。

成功安裝Python後，我們可以繼續安裝所需的庫。在本教程中，我們將使用兩個關鍵庫：Selenium和Beautiful Soup。Selenium是一個強大的工具，用於自動化Web瀏覽器互動，而Beautiful Soup是一個用於解析HTML內容的庫。要安裝這些庫，我們可以使用Python的包管理器pip，它通常與Python一起安裝。

開啟命令提示符或終端並執行以下命令：

pip install selenium
pip install beautifulsoup4

這些命令將把必要的軟體包下載並安裝到您的系統中。安裝過程可能需要一些時間，請耐心等待。

配置Web驅動程式（例如，ChromeDriver）

為了使用Selenium自動化瀏覽器互動，我們需要配置一個Web驅動程式。Web驅動程式是一個允許Selenium控制特定瀏覽器的特定驅動程式。在本教程中，我們將使用ChromeDriver，它是Google Chrome瀏覽器的Web驅動程式。

要配置ChromeDriver，我們必須下載與我們的Chrome瀏覽器匹配的相應版本。您可以訪問ChromeDriver下載頁面(https://sites.google.com/a/chromium.org/chromedriver/downloads)並下載與您的Chrome瀏覽器版本相對應的版本。請確保也為您的作業系統選擇正確的版本（例如，Windows、macOS、Linux）。

下載ChromeDriver可執行檔案後，您可以將其放在您選擇的目錄中。建議將其放在易於訪問且可以在Python指令碼中引用的位置。

登入LinkedIn

在我們可以使用Selenium自動化LinkedIn上的登入過程之前，我們需要識別與登入表單關聯的HTML元素。要在Chrome中訪問瀏覽器檢查工具，請右鍵單擊登入表單或頁面上的任何元素，然後從上下文選單中選擇“檢查”。這將開啟開發者工具面板。

在開發者工具面板中，您將看到頁面的HTML原始碼。透過將滑鼠懸停在HTML程式碼中的不同元素上或單擊它們，您可以看到頁面本身突出顯示的相應部分。找到使用者名稱/電子郵件和密碼的輸入欄位以及登入按鈕。記下它們的HTML屬性，例如`id`、`class`或`name`，因為我們將在Python指令碼中使用這些屬性來定位這些元素。

在我們的例子中，使用者名稱欄位的id為“username”，密碼欄位的id為“password”。現在我們已經識別了登入元素，我們可以使用Selenium自動化LinkedIn上的登入過程。我們將從建立一個Web驅動程式例項開始，指定ChromeDriver作為驅動程式。這將開啟一個由Selenium控制的Chrome瀏覽器視窗。

接下來，我們將指示Selenium使用其唯一屬性查詢使用者名稱/電子郵件和密碼輸入欄位。我們可以使用`find_element_by_id()`、`find_element_by_name()`或`find_element_by_class_name()`等方法來定位這些元素。找到這些元素後，我們可以使用`send_keys()`方法模擬使用者輸入來輸入使用者名稱/電子郵件和密碼。

最後，我們將使用Selenium的`find_element_by_*()`方法以及`click()`方法找到並單擊登入按鈕。這將模擬單擊登入按鈕，從而觸發LinkedIn上的登入過程。

示例

# Importing the necessary libraries
from selenium import webdriver

# Create an instance of the Chrome web driver
driver = webdriver.Chrome('/path/to/chromedriver')

# Navigate to the LinkedIn login page
driver.get('https://www.linkedin.com/login')

# Locate the username/email and password input fields
username_field = driver.find_element_by_id('username')
password_field = driver.find_element_by_id('password')

# Enter the username/email and password
username_field.send_keys('your_username')
password_field.send_keys('your_password')

# Find and click the login button
login_button = driver.find_element_by_xpath("//button[@type='submit']")
login_button.click()

執行上述程式碼時，將開啟一個瀏覽器例項，並使用使用者詳細資訊登入LinkedIn。在文章的下一部分，我們將探討如何使用Selenium瀏覽LinkedIn頁面並從個人資料中提取資料。

瀏覽LinkedIn頁面

個人資料頁面包含各種部分，例如姓名、標題、摘要、經驗、教育等等。透過檢查個人資料頁面的HTML程式碼，我們可以識別包含所需資訊的HTML元素。

例如，要從個人資料中抓取資料，我們可以使用Selenium找到相關的HTML元素，並使用Beautiful Soup提取資料。

這是一個示例程式碼片段，演示瞭如何從LinkedIn上的多個個人資料中提取個人資料資訊

示例

from selenium import webdriver
from bs4 import BeautifulSoup

# Create an instance of the Chrome web driver
driver = webdriver.Chrome('/path/to/chromedriver')

# Visit a LinkedIn profile
profile_url = 'https://www.linkedin.com/in/princeyadav05/'
driver.get(profile_url)

# Extract profile information
soup = BeautifulSoup(driver.page_source, 'html.parser')
name = soup.find('li', class_='inline t-24 t-black t-normal break-words').text.strip()
headline = soup.find('h2', class_='mt1 t-18 t-black t-normal break-words').text.strip()
summary = soup.find('section', class_='pv-about-section').find('div', class_='pv-about-section__summary-text').text.strip()

# Print the extracted information
print("Name:", name)
print("Headline:", headline)
print("Summary:", summary)

輸出

Name: Prince Yadav
Headline: Senior Software Developer at Tata AIG General Insurance Company Limited
Summary: Experienced software engineer with a passion for building scalable and efficient solutions using Python and related technologies.

現在我們知道了如何使用Selenium和BeautifulSoup抓取單個LinkedIn個人資料的資料，讓我們瞭解如何對多個個人資料執行此操作。

為了從多個個人資料抓取資料，我們可以自動化訪問個人資料頁面、提取資料以及將其儲存以供進一步分析的過程。

這是一個示例指令碼，演示瞭如何從多個個人資料抓取個人資料資訊

示例

from selenium import webdriver
from bs4 import BeautifulSoup
import csv

# Create an instance of the Chrome web driver
driver = webdriver.Chrome('/path/to/chromedriver')

# List of profile URLs to scrape
profile_urls = [
    'https://www.linkedin.com/in/princeyadav05',
    'https://www.linkedin.com/in/mukullatiyan',
]

# Open a CSV file for writing the extracted data
with open('profiles.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['Name', 'Headline', 'Summary'])

    # Visit each profile URL and extract profile information
    for profile_url in profile_urls:
        driver.get(profile_url)
        soup = BeautifulSoup(driver.page_source, 'html.parser')

        name = soup.find('li', class_='inline t-24 t-black t-normal break-words').text.strip()
        headline = soup.find('h2', class_='mt1 t-18 t-black t-normal break-words').text.strip()
        summary = soup.find('section', class_='pv-about-section').find('div', class_='pv-about-section__summary-text').text.strip()
        # Print the extracted information
        print("Name:", name)
        print("Headline:", headline)
        print("Summary:", summary)

輸出

Name: Prince Yadav
Headline: Software Engineer | Python Enthusiast
Summary: Experienced software engineer with a passion for building scalable and efficient solutions using Python and related technologies.

Name: Mukul Latiyan
Headline: Data Scientist | Machine Learning Engineer
Summary: Data scientist and machine learning engineer experienced in developing and deploying predictive models for solving complex business problems.

如上輸出所示，我們已成功使用Python中的Selenium和BeautifulSoup同時抓取多個LinkedIn個人資料。程式碼片段允許我們訪問每個個人資料URL，提取所需的個人資料資訊，並將其列印到控制檯。

透過這種方法，我們已經成功地展示瞭如何使用Python中的Selenium和BeautifulSoup高效地抓取LinkedIn個人資料。

結論

在本教程中，我們探討了使用Python中的Selenium和BeautifulSoup抓取LinkedIn個人資料的過程。透過利用這兩個庫的強大組合，我們能夠自動化網路互動、解析HTML內容並從LinkedIn頁面提取有價值的資訊。我們學習瞭如何登入LinkedIn，瀏覽個人資料以及提取姓名、標題和摘要等資料。提供的程式碼示例演示了該過程的每個步驟，使初學者更容易理解。

Prince Yadav

更新於：2023年7月26日

996 次瀏覽

啟動你的職業生涯

完成課程獲得認證

開始