從零開始訓練Llama

訓練Llama從零開始非常消耗資源，但也很有意義。在做好訓練資料集的準備和正確設定訓練引數的情況下執行訓練迴圈，將確保您能夠生成足夠可靠的語言模型，應用於許多NLP任務。成功的秘訣在於訓練過程中的適當預處理、引數調整和最佳化。

與其他GPT風格的模型相比，Llama的版本是一個開源版本。該模型需要大量資源、周密的準備工作，以及更多才能從零開始訓練。本章報告了從零開始訓練Llama的過程。該方法包括從準備訓練資料集到配置訓練引數並實際進行訓練的所有內容。

Llama旨在支援幾乎所有NLP應用，包括但不限於文字生成、翻譯和摘要。大型語言模型可以透過三個關鍵步驟從零開始訓練：

準備訓練資料集
合適的訓練引數
管理過程並確保實施正確的最佳化

所有步驟都將透過程式碼片段和輸出含義逐步進行說明。

準備您的訓練資料集

訓練任何LLM最重要的第一步是為其提供一個優秀、多樣化且廣泛的資料集。Llama需要海量的文字資料來捕捉人類語言的豐富性。

收集資料

訓練Llama需要一個龐大的資料集，其中包含來自各個領域的各種文字樣本。一些用於訓練LLM的示例資料集包括Common Crawl、維基百科、BooksCorpus和OpenWebText。

示例：下載資料集

import requests
import os

# Create a directory for datasets
os.makedirs("datasets", exist_ok=True)

# URL to dataset
url = "https://example.com/openwebtext.zip"
output = "datasets/openwebtext.zip"

# Download the dataset
response = requests.get(url)
with open(output, "wb") as file:
    file.write(response.content)
print(f"Dataset downloaded and saved at {output}")

輸出

Dataset downloaded and saved at datasets/openwebtext.zip

下載資料集後，您需要在訓練前預處理文字資料。大多數預處理涉及標記化、小寫化、去除特殊字元以及設定資料以適應給定的結構。

示例：預處理資料集

from transformers import LlamaTokenizer

# Load pre-trained tokenizer 
tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", token=token)

# Load raw text
with open('/content/raw_data.txt', 'r') as file:
    raw_text = file.read()

# Tokenize the text
tokens = tokenizer.encode(raw_text, add_special_tokens=True)

# Save tokens to a file
with open('/tokenized_text.txt', 'w') as token_file:
    token_file.write(str(tokens))
    
print(f"Text tokenized and saved as tokens.")

輸出

Text tokenized and saved as tokens.

設定模型訓練引數

現在，我們將繼續設定訓練引數。這些引數設定了您的模型將如何從資料集中學習；因此，它們直接影響模型的效能。

主要訓練引數

批次大小 - 在更新模擬權重之前經過的樣本數量。
學習率 - 設定根據損失梯度更新模型引數的程度。
輪次 - 模型遍歷整個資料集的次數。
最佳化器 - 用於透過更改權重來最小化損失函式。

您可以使用AdamW作為最佳化器，並使用預熱學習率排程器來訓練Llama。

示例：訓練引數配置

import torch
from transformers import LlamaForCausalLM, AdamW, get_linear_schedule_with_warmup
# token="you_token"

# Load the model
model = LlamaForCausalLM.from_pretrained('meta-llama/Llama-2-7b-chat-hf', token=token)

model = model.to("cuda") if torch.cuda.is_available() else model.to("cpu")
# Training parameters
epochs = 3
batch_size = 8
learning_rate = 5e-5
warmup_steps = 200

# Set the optimizer and scheduler
optimizer = AdamW(model.parameters(), lr=learning_rate)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=epochs)
print("Training parameters set.")

輸出

Training parameters set.

批次資料載入器

訓練需要批次資料。這可以透過PyTorch的DataLoader輕鬆實現。

from torch.utils.data import DataLoader, Dataset
# Custom dataset class
class TextDataset(Dataset):
    def __init__(self, tokenized_text):
       self.data = tokenized_text
    def __len__(self): 
        return len(self.data) // batch_size 
    def __getitem__(self, idx): 
        return self.data[idx * batch_size : (idx + 1) * batch_size]

with open("/tokenized_text.txt", 'r') as f:
  tokens_str = f.read()
tokens = eval(tokens_str)  # Evaluate the string to get the list

# DataLoader definition
train_data = TextDataset(tokens)
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)

print(f"DataLoader created with batch size {batch_size}.")

輸出

DataLoader created with batch size 8.

現在，學習過程的要求和資料載入過程已經建立，是時候進入實際的訓練階段了。

訓練模型

所有這些準備工作在執行訓練迴圈中協同工作。訓練資料集只不過是將模型分批饋送，然後使用損失函式更新其引數。

執行訓練迴圈

現在到了整個訓練過程的階段，所有這些準備工作都將與現實世界相結合。分階段地向演算法提供資料集合，以便可以根據其變數的損失函式對其進行更新。

import tqdm

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

for epoch in range(epochs):
   print(f"Epoch {epoch + 1}/{epochs}")
   model.train
   total_loss = 0  
   for batch in tqdm.tqdm(train_loader
      batch = [torch.tensor(sub_batch, device=device) for sub_batch in batch]
      max_len = max(len(seq) for seq in batch)
      padded_batch = torch.zeros((len(batch), max_len), dtype=torch.long, device=device)
      for i, seq in enumerate(batch):
         padded_batch[i, :len(seq)] = seq

       # Forward pass, use padded_batch 
       outputs = model(padded_batch, labels=padded_batch
       loss = outputs.loss  
       # Backward pass
       optimizer.zero_grad()  # Reset gradients.
       loss.backward()  # Calculate gradients.
       optimizer.step()  # Update model parameters.
       scheduler.step()  # Update learning rate.
        
       total_loss += loss.item()  # Accumulate loss.

   print(f"Epoch {epoch + 1} completed. Loss: {total_loss:.4f}")

輸出

Epoch 1 completed. Loss: 424.4011
Epoch 2 completed. Loss: 343.4245
Epoch 3 completed. Loss: 328.7054

儲存模型

訓練完成後，儲存模型；否則，每次訓練時都需要儲存。

# Save the trained model
model.save_pretrained('trained_Llama_model')
print("Model saved successfully.")

輸出

Model saved successfully.

現在我們已經從零開始訓練了模型並儲存了它。我們可以使用該模型來預測新的字元/單詞。我們將在接下來的章節中詳細介紹。

列印頁面