Biopython - 序列 I/O 操作

Biopython 提供了一個模組 Bio.SeqIO，分別用於從檔案（任何流）讀取和寫入序列。它支援生物資訊學中幾乎所有可用的檔案格式。大多數軟體針對不同的檔案格式提供不同的方法。但是，Biopython 有意識地遵循單一方法，透過其 SeqRecord 物件向用戶呈現解析後的序列資料。

讓我們在下一節中進一步瞭解 SeqRecord。

SeqRecord

Bio.SeqRecord 模組提供 SeqRecord 來儲存序列的元資訊以及序列資料本身，如下所示：

seq - 它是實際的序列。
id - 它是給定序列的主識別符號。預設型別為字串。
name - 它是序列的名稱。預設型別為字串。
description - 它顯示關於序列的人類可讀資訊。
annotations - 它是關於序列的其他資訊的字典。

SeqRecord 可以按如下所示匯入

from Bio.SeqRecord import SeqRecord

讓我們在接下來的章節中瞭解使用實際序列檔案解析序列檔案的細微差別。

解析序列檔案格式

本節說明如何解析兩種最流行的序列檔案格式，FASTA 和 GenBank。

FASTA

FASTA 是用於儲存序列資料的最基本的檔案格式。最初，FASTA 是一種用於 DNA 和蛋白質序列比對的軟體包，在生物資訊學的早期發展過程中開發，主要用於搜尋序列相似性。

Biopython 提供了一個示例 FASTA 檔案，可以在 https://github.com/biopython/biopython/blob/master/Doc/examples/ls_orchid.fasta 訪問。

下載並將此檔案儲存到您的 Biopython 示例目錄中，命名為“orchid.fasta”。

Bio.SeqIO 模組提供 parse() 方法來處理序列檔案，可以按如下方式匯入：

from Bio.SeqIO import parse

parse() 方法包含兩個引數，第一個是檔案控制代碼，第二個是檔案格式。

>>> file = open('path/to/biopython/sample/orchid.fasta') 
>>> for record in parse(file, "fasta"): 
...    print(record.id) 
... 
gi|2765658|emb|Z78533.1|CIZ78533 
gi|2765657|emb|Z78532.1|CCZ78532 
.......... 
.......... 
gi|2765565|emb|Z78440.1|PPZ78440 
gi|2765564|emb|Z78439.1|PBZ78439 
>>>

這裡，parse() 方法返回一個可迭代物件，每次迭代都返回 SeqRecord。作為可迭代物件，它提供了許多複雜且簡單的方法，讓我們看看其中的一些功能。

next()

next() 方法返回可迭代物件中可用的下一個專案，我們可以使用它來獲取第一個序列，如下所示：

>>> first_seq_record = next(SeqIO.parse(open('path/to/biopython/sample/orchid.fasta'),'fasta')) 
>>> first_seq_record.id 'gi|2765658|emb|Z78533.1|CIZ78533' 
>>> first_seq_record.name 'gi|2765658|emb|Z78533.1|CIZ78533' 
>>> first_seq_record.seq Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet()) 
>>> first_seq_record.description 'gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA' 
>>> first_seq_record.annotations 
{} 
>>>

這裡，seq_record.annotations 為空，因為 FASTA 格式不支援序列註釋。

列表推導式

我們可以使用列表推導式將可迭代物件轉換為列表，如下所示

>>> seq_iter = SeqIO.parse(open('path/to/biopython/sample/orchid.fasta'),'fasta') 
>>> all_seq = [seq_record for seq_record in seq_iter] >>> len(all_seq) 
94 
>>>

這裡，我們使用了 len 方法來獲取總數。我們可以獲取長度最大的序列，如下所示：

>>> seq_iter = SeqIO.parse(open('path/to/biopython/sample/orchid.fasta'),'fasta') 
>>> max_seq = max(len(seq_record.seq) for seq_record in seq_iter) 
>>> max_seq 
789 
>>>

我們也可以使用以下程式碼過濾序列：

>>> seq_iter = SeqIO.parse(open('path/to/biopython/sample/orchid.fasta'),'fasta') 
>>> seq_under_600 = [seq_record for seq_record in seq_iter if len(seq_record.seq) < 600] 
>>> for seq in seq_under_600: 
...    print(seq.id) 
... 
gi|2765606|emb|Z78481.1|PIZ78481 
gi|2765605|emb|Z78480.1|PGZ78480 
gi|2765601|emb|Z78476.1|PGZ78476 
gi|2765595|emb|Z78470.1|PPZ78470 
gi|2765594|emb|Z78469.1|PHZ78469 
gi|2765564|emb|Z78439.1|PBZ78439 
>>>

將 SqlRecord 物件（解析後的資料）集合寫入檔案就像呼叫 SeqIO.write 方法一樣簡單，如下所示：

file = open("converted.fasta", "w) 
SeqIO.write(seq_record, file, "fasta")

此方法可以有效地用於轉換如下所示的格式：

file = open("converted.gbk", "w) 
SeqIO.write(seq_record, file, "genbank")

GenBank

它是一種更豐富的基因序列格式，包含各種註釋的欄位。Biopython 提供了一個示例 GenBank 檔案，可以在 https://github.com/biopython/biopython/blob/master/Doc/examples/ls_orchid.fasta 訪問。

下載並將檔案儲存到您的 Biopython 示例目錄中，命名為“orchid.gbk”

由於 Biopython 提供了一個單一函式 parse 來解析所有生物資訊學格式，因此解析 GenBank 格式就像在 parse 方法中更改格式選項一樣簡單。

相應的程式碼如下所示：

>>> from Bio import SeqIO 
>>> from Bio.SeqIO import parse 
>>> seq_record = next(parse(open('path/to/biopython/sample/orchid.gbk'),'genbank')) 
>>> seq_record.id 
'Z78533.1' 
>>> seq_record.name 
'Z78533' 
>>> seq_record.seq Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', IUPACAmbiguousDNA()) 
>>> seq_record.description 
'C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA' 
>>> seq_record.annotations {
   'molecule_type': 'DNA', 
   'topology': 'linear', 
   'data_file_division': 'PLN', 
   'date': '30-NOV-2006', 
   'accessions': ['Z78533'], 
   'sequence_version': 1, 
   'gi': '2765658', 
   'keywords': ['5.8S ribosomal RNA', '5.8S rRNA gene', 'internal transcribed spacer', 'ITS1', 'ITS2'], 
   'source': 'Cypripedium irapeanum', 
   'organism': 'Cypripedium irapeanum', 
   'taxonomy': [
      'Eukaryota', 
      'Viridiplantae', 
      'Streptophyta', 
      'Embryophyta', 
      'Tracheophyta', 
      'Spermatophyta', 
      'Magnoliophyta', 
      'Liliopsida', 
      'Asparagales', 
      'Orchidaceae', 
      'Cypripedioideae', 
      'Cypripedium'], 
   'references': [
      Reference(title = 'Phylogenetics of the slipper orchids (Cypripedioideae:
      Orchidaceae): nuclear rDNA ITS sequences', ...), 
      Reference(title = 'Direct Submission', ...)
   ]
}

列印頁面