Llama 資料準備

良好的資料準備對於訓練任何高效能語言模型（例如 Llama）都至關重要。資料準備包括收集和清理資料、準備 Llama 可用的資料以及使用不同的資料預處理器。NLTK、spaCy 和 Hugging Face 分詞器等工具共同幫助將資料準備好用於 Llama 的訓練流程。一旦您瞭解了這些資料預處理階段，您就能確保提高 Llama 模型的效能。

資料準備被認為是機器學習模型中最關鍵的階段之一，尤其是在處理大型語言模型時。本章討論如何準備可用於 Llama 的資料，並涵蓋以下主題。

資料收集和清理
為 Llama 格式化資料
資料預處理中使用的工具

所有這些過程確保資料得到良好清理並進行適當的結構化，從而最佳化 Llama 的管道訓練。

資料收集和清理

資料收集

與訓練 Llama 等模型相關的最關鍵點是高質量的多樣化資料。換句話說，執行語言模型時用於訓練的主要文字資料來源是來自各種文字的片段，包括書籍、文章、部落格文章、社交媒體內容、論壇和其他公開可用的文字資料。

使用 Python 抓取網站文字資料

import requests
from bs4 import BeautifulSoup
# URL to fetch data from
url = 'https://tutorialspoint.tw/Llama/index.htm'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Now, extract text data
text_data = soup.get_text()
# Now, save data to the file
with open('raw_data.txt', 'w', encoding='utf-8') as file:
    file.write(text_data)

輸出

執行指令碼後，它會將抓取的文字儲存到名為 raw_data.txt 的檔案中，然後將原始文字清理成資料。

資料清理

原始資料充滿了噪聲，包括 HTML 標籤、特殊字元和原始資料中不相關的資料，因此必須在呈現給 Llama 之前進行清理。資料清理可能包括：

刪除 HTML 標籤
特殊字元
大小寫敏感性
分詞
去除停用詞

示例：使用 Python 預處理文字資料

import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

import nltk
nltk.download('punkt')
nltk.download('stopwords')

# Load raw data
with open('/raw_data.txt', 'r', encoding='utf-8') as file:
    text_data = file.read()

# Clean HTML tags
clean_data = re.sub(r'<.*?>', '', text_data)

# Clean special characters
clean_data = re.sub(r'[^A-Za-z0-9\\\\\\s]', '', clean_data)

# Split text into tokens
tokens = word_tokenize(clean_data)

stop_words = set(stopwords.words('english'))

# Filter out stop words from tokens
filtered_tokens = [w for w in tokens if not w.lower() in stop_words]

# Save cleaned data
with open('cleaned_data.txt', 'w', encoding='utf-8') as file:
    file.write(' '.join(filtered_tokens))

print("Data cleaned and saved to cleaned_data.txt")

輸出

Data cleaned and saved to cleaned_data.txt

清理後的資料將儲存到 cleaned_data.txt。該檔案現在包含分詞和清理後的資料，已準備好用於 Llama 的進一步格式化和預處理。

預處理資料以與 Llama 配合使用

Llama 需要預先結構化的輸入資料進行訓練。資料應該被分詞，也可以根據其將要使用的架構轉換為 JSON 或 CSV 等格式。

文字分詞

文字分詞是將句子分成較小部分（通常是單詞或子詞）的行為，以便 Llama 可以處理它們。您可以使用預構建的庫，包括 Hugging Face 的分詞器庫。

from transformers import LlamaTokenizer

# token = "your_token"
# Sample sentence
text = "Llama is an innovative language model."

#Load Llama tokenizer
tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", token=token)

#Tokenize
encoded_input = tokenizer(text)

print("Original Text:", text)
print("Tokenized Output:", encoded_input)

輸出

Original Text: Llama is an innovative language model.
Tokenized Output: {'input_ids': [1, 365, 29880, 3304, 338, 385, 24233, 1230, 4086, 1904, 29889], 
   'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

將資料轉換為 JSON 格式

JSON 格式與 Llama 相關，因為它以結構化的方式儲存文字資料。

import json
    
# Data structure
data = {
"id": "1",
"text": "Llama is a powerful language model for AI research."
}
# Save data as JSON
with open('formatted_data.json', 'w', encoding='utf-8') as json_file:
    json.dump(data, json_file, indent=4)
    
print("Data formatted and saved to formatted_data.json")

輸出

Data formatted and saved to formatted_data.json

程式將列印一個名為 formatted_data.json 的檔案，其中包含 JSON 格式的格式化文字資料。

資料預處理工具

資料清理、分詞和格式化工具適用於 Llama。最常用的工具組是使用 Python 庫、文字處理框架和命令找到的。以下是 Llama 資料準備中一些廣泛應用的工具列表。

1. NLTK (自然語言工具包)

最著名的自然語言處理庫是 NLTK。該庫支援的功能包括文字資料的清理、分詞和詞幹提取。

示例：使用 NLTK 去除停用詞

import nltk
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')

# Test Data
text = "This is a simple sentence with stopwords."
 
# Tokenization
words = nltk.word_tokenize(text)

# Stopwords
stop_words = set(stopwords.words('english'))

filtered_text = [w for w in words if not w.lower() in stop_words] # This line is added to filter the words and assign to the variable
print("Original Text:", text)
print("Filtered Text:", filtered_text)

輸出

Original Text: This is a simple sentence with stopwords.
Filtered Text: ['simple', 'sentence', 'stopwords', '.']

2. spaCy

另一個為資料預處理設計的、高階的庫。它也快速、高效，並且構建用於 NLP 任務的實際應用。

示例：使用 spaCy 進行分詞

import spacy

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Sample sentence
text = "Llama is an innovative language model."

# Process the text
doc = nlp(text)

# Tokenize
tokens = [token.text for token in doc]

print("Tokens:", tokens)

輸出

Tokens: ['Llama', 'is', 'an', 'innovative', 'language', 'model', '.']

3. Hugging Face 分詞器

Hugging Face 提供了一些高效能的分詞器，這些分詞器主要用於訓練語言模型，而不是 Llama 本身。

示例：使用 Hugging Face 分詞器

from transformers import AutoTokenizer
token = "your_token"
# Sample sentence
text = "Llama is an innovative language model."

#Load Llama tokenizer
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf', token=token)

#Tokenize
encoded_input = tokenizer(text)
print("Original Text:", text)
print("Tokenized Output:", encoded_input)

輸出

Original Text: Llama is an innovative language model.
Tokenized Output: {'input_ids': [1, 365, 29880, 3304, 338, 385, 24233, 1230, 4086, 1904, 29889], 
   'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

4. Pandas 用於資料格式化

在處理結構化資料時使用。您可以使用 Pandas 將資料格式化為 CSV 或 JSON，然後再將其傳遞給 Llama。

import pandas as pd

# Data structure
data = {
"id": "1",
"text": "Llama is a powerful language model for AI research."
}

# Create DataFrame with an explicit index
df = pd.DataFrame([data], index=[0]) # Creating a list of dictionary and passing an index [0]

# Save DataFrame to CSV
df.to_csv('formatted_data.csv', index=False)

print("Data saved to formatted_data.csv")

輸出

Data saved to formatted_data.csv

格式化的文字資料將儲存在 CSV 檔案 formatted_data.csv 中。

列印頁面