Biopython - Entrez 資料庫

Entrez 是 NCBI 提供的一個線上搜尋系統。它提供對幾乎所有已知的分子生物學資料庫的訪問，並具有整合的全域性查詢，支援布林運算子和欄位搜尋。它返回來自所有資料庫的結果，其中包含來自每個資料庫的命中次數、帶有指向原始資料庫連結的記錄等資訊。

下面列出了一些可以透過 Entrez 訪問的常用資料庫：

PubMed
PubMed Central
核酸序列資料庫 (GenBank)
蛋白質序列資料庫
基因組資料庫 (全基因組資料庫)
結構資料庫 (三維大分子結構)
分類資料庫 (GenBank 中的生物)
SNP 資料庫 (單核苷酸多型性)
UniGene 資料庫 (基因導向的轉錄序列簇)
CDD 資料庫 (保守蛋白質結構域資料庫)
3D 結構域資料庫 (來自 Entrez 結構的結構域)

除了上述資料庫之外，Entrez 還提供更多資料庫來執行欄位搜尋。

Biopython 提供了一個 Entrez 專用模組 Bio.Entrez 來訪問 Entrez 資料庫。讓我們在本節中學習如何使用 Biopython 訪問 Entrez：

資料庫連線步驟

要新增 Entrez 的功能，請匯入以下模組：

>>> from Bio import Entrez

接下來，設定您的電子郵件以識別誰已連線到下面給出的程式碼：

>>> Entrez.email = '<youremail>'

然後，設定 Entrez 工具引數，預設為 Biopython。

>>> Entrez.tool = 'Demoscript'

現在，呼叫 einfo 函式以查詢每個資料庫的索引項計數、上次更新和可用連結，如下所示：

>>> info = Entrez.einfo()

einfo 方法返回一個物件，該物件透過其 read 方法提供對資訊的訪問，如下所示：

>>> data = info.read() 
>>> print(data) 
<?xml version = "1.0" encoding = "UTF-8" ?>
<!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD einfo 20130322//EN" 
   "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20130322/einfo.dtd"> 
<eInfoResult>
   <DbList>
      <DbName>pubmed</DbName> 
      <DbName>protein</DbName>
      <DbName>nuccore</DbName> 
      <DbName>ipg</DbName> 
      <DbName>nucleotide</DbName>
      <DbName>nucgss</DbName> 
      <DbName>nucest</DbName>
      <DbName>structure</DbName>
      <DbName>sparcle</DbName>
      <DbName>genome</DbName>
      <DbName>annotinfo</DbName>
      <DbName>assembly</DbName> 
      <DbName>bioproject</DbName>
      <DbName>biosample</DbName>
      <DbName>blastdbinfo</DbName>
      <DbName>books</DbName> 
      <DbName>cdd</DbName>
      <DbName>clinvar</DbName> 
      <DbName>clone</DbName> 
      <DbName>gap</DbName> 
      <DbName>gapplus</DbName> 
      <DbName>grasp</DbName> 
      <DbName>dbvar</DbName>
      <DbName>gene</DbName> 
      <DbName>gds</DbName> 
      <DbName>geoprofiles</DbName>
      <DbName>homologene</DbName> 
      <DbName>medgen</DbName> 
      <DbName>mesh</DbName>
      <DbName>ncbisearch</DbName> 
      <DbName>nlmcatalog</DbName>
      <DbName>omim</DbName>
      <DbName>orgtrack</DbName>
      <DbName>pmc</DbName>
      <DbName>popset</DbName>
      <DbName>probe</DbName>
      <DbName>proteinclusters</DbName>
      <DbName>pcassay</DbName>
      <DbName>biosystems</DbName> 
      <DbName>pccompound</DbName> 
      <DbName>pcsubstance</DbName> 
      <DbName>pubmedhealth</DbName> 
      <DbName>seqannot</DbName> 
      <DbName>snp</DbName> 
      <DbName>sra</DbName> 
      <DbName>taxonomy</DbName> 
      <DbName>biocollections</DbName> 
      <DbName>unigene</DbName>
      <DbName>gencoll</DbName> 
      <DbName>gtr</DbName>
   </DbList> 
</eInfoResult>

資料採用 XML 格式，要將資料作為 Python 物件獲取，請在呼叫 Entrez.einfo() 方法後立即使用 Entrez.read 方法：

>>> info = Entrez.einfo() 
>>> record = Entrez.read(info)

此處，record 是一個字典，它只有一個鍵 DbList，如下所示：

>>> record.keys() 
[u'DbList']

訪問 DbList 鍵將返回資料庫名稱列表，如下所示：

>>> record[u'DbList'] 
['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'nucgss', 
   'nucest', 'structure', 'sparcle', 'genome', 'annotinfo', 'assembly', 
   'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 
   'clone', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 
   'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim', 
   'orgtrack', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay', 
   'biosystems', 'pccompound', 'pcsubstance', 'pubmedhealth', 'seqannot', 
   'snp', 'sra', 'taxonomy', 'biocollections', 'unigene', 'gencoll', 'gtr'] 
>>>

基本上，Entrez 模組解析 Entrez 搜尋系統返回的 XML 並將其作為 Python 字典和列表提供。

搜尋資料庫

要搜尋任何一個 Entrez 資料庫，我們可以使用 Bio.Entrez.esearch() 模組。其定義如下：

>>> info = Entrez.einfo() 
>>> info = Entrez.esearch(db = "pubmed",term = "genome") 
>>> record = Entrez.read(info) 
>>>print(record) 
DictElement({u'Count': '1146113', u'RetMax': '20', u'IdList':
['30347444', '30347404', '30347317', '30347292', 
'30347286', '30347249', '30347194', '30347187', 
'30347172', '30347088', '30347075', '30346992', 
'30346990', '30346982', '30346980', '30346969', 
'30346962', '30346954', '30346941', '30346939'], 
u'TranslationStack': [DictElement({u'Count': 
'927819', u'Field': 'MeSH Terms', u'Term': '"genome"[MeSH Terms]', 
u'Explode': 'Y'}, attributes = {})
, DictElement({u'Count': '422712', u'Field': 
'All Fields', u'Term': '"genome"[All Fields]', u'Explode': 'N'}, attributes = {}), 
'OR', 'GROUP'], u'TranslationSet': [DictElement({u'To': '"genome"[MeSH Terms] 
OR "genome"[All Fields]', u'From': 'genome'}, attributes = {})], u'RetStart': '0', 
u'QueryTranslation': '"genome"[MeSH Terms] OR "genome"[All Fields]'}, 
attributes = {})
>>>

如果指定了不正確的資料庫，則會返回錯誤資訊。

>>> info = Entrez.esearch(db = "blastdbinfo",term = "books")
>>> record = Entrez.read(info) 
>>> print(record) 
DictElement({u'Count': '0', u'RetMax': '0', u'IdList': [], 
u'WarningList': DictElement({u'OutputMessage': ['No items found.'], 
   u'PhraseIgnored': [], u'QuotedPhraseNotFound': []}, attributes = {}), 
   u'ErrorList': DictElement({u'FieldNotFound': [], u'PhraseNotFound': 
      ['books']}, attributes = {}), u'TranslationSet': [], u'RetStart': '0', 
      u'QueryTranslation': '(books[All Fields])'}, attributes = {})

如果要跨資料庫搜尋，則可以使用 Entrez.egquery。這類似於 Entrez.esearch，只是只需指定關鍵字並跳過資料庫引數即可。

>>>info = Entrez.egquery(term = "entrez") 
>>> record = Entrez.read(info) 
>>> for row in record["eGQueryResult"]: 
... print(row["DbName"], row["Count"]) 
... 
pubmed 458 
pmc 12779 mesh 1 
... 
... 
... 
biosample 7 
biocollections 0

獲取記錄

Entrez 提供了一種特殊方法 efetch 來搜尋和下載 Entrez 中記錄的完整詳細資訊。請考慮以下簡單示例：

>>> handle = Entrez.efetch(
   db = "nucleotide", id = "EU490707", rettype = "fasta")

現在，我們可以簡單地使用 SeqIO 物件讀取記錄

>>> record = SeqIO.read( handle, "fasta" ) 
>>> record 
SeqRecord(seq = Seq('ATTTTTTACGAACCTGTGGAAATTTTTGGTTATGACAATAAATCTAGTTTAGTA...GAA', 
SingleLetterAlphabet()), id = 'EU490707.1', name = 'EU490707.1', 
description = 'EU490707.1 
Selenipedium aequinoctiale maturase K (matK) gene, partial cds; chloroplast', 
dbxrefs = [])

列印頁面