Python數字網路取證-II

上一章討論了使用Python進行網路取證的一些概念。在本章中，讓我們更深入地瞭解使用Python進行網路取證。

使用Beautiful Soup儲存網頁

全球資訊網（WWW）是一個獨特的的資訊資源。然而，由於內容以驚人的速度丟失，其遺產面臨著巨大的風險。許多文化遺產和學術機構、非營利組織和私營企業都探索了所涉及的問題，併為網頁歸檔的技術解決方案的開發做出了貢獻。

網頁儲存或網頁歸檔是從全球資訊網上收集資料，確保資料儲存在檔案中，並使其可供未來的研究人員、歷史學家和公眾使用。在進一步探討網頁儲存之前，讓我們討論一些與網頁儲存相關的重要問題，如下所示：

網頁資源的變化 - 網頁資源每天都在變化，這對網頁儲存提出了挑戰。
大量資源 - 網頁儲存相關的另一個問題是需要儲存的大量資源。
完整性 - 網頁必須受到保護，防止未經授權的修改、刪除或移除，以保護其完整性。
處理多媒體資料 - 在儲存網頁時，我們也需要處理多媒體資料，這可能會在處理過程中導致問題。
提供訪問 - 除了儲存之外，還需要解決提供對網頁資源的訪問以及處理所有權問題等問題。

在本章中，我們將使用名為Beautiful Soup的Python庫來儲存網頁。

什麼是Beautiful Soup？

Beautiful Soup是一個用於從HTML和XML檔案中提取資料的Python庫。它可以與urlib一起使用，因為它需要一個輸入（文件或URL）來建立soup物件，因為它本身無法獲取網頁。您可以在www.crummy.com/software/BeautifulSoup/bs4/doc/詳細瞭解。

請注意，在使用它之前，我們必須使用以下命令安裝第三方庫：

pip install bs4

接下來，使用Anaconda包管理器，我們可以如下安裝Beautiful Soup：

conda install -c anaconda beautifulsoup4

儲存網頁的Python指令碼

這裡討論了使用名為Beautiful Soup的第三方庫儲存網頁的Python指令碼：

首先，匯入所需的庫，如下所示：

from __future__ import print_function
import argparse

from bs4 import BeautifulSoup, SoupStrainer
from datetime import datetime

import hashlib
import logging
import os
import ssl
import sys
from urllib.request import urlopen

import urllib.error
logger = logging.getLogger(__name__)

請注意，此指令碼將接受兩個位置引數，一個是將要儲存的URL，另一個是所需的輸出目錄，如下所示：

if __name__ == "__main__":
   parser = argparse.ArgumentParser('Web Page preservation')
   parser.add_argument("DOMAIN", help="Website Domain")
   parser.add_argument("OUTPUT_DIR", help="Preservation Output Directory")
   parser.add_argument("-l", help="Log file path",
   default=__file__[:-3] + ".log")
   args = parser.parse_args()

現在，透過指定一個檔案和流處理程式來設定指令碼的日誌記錄，以便迴圈並記錄採集過程，如下所示：

logger.setLevel(logging.DEBUG)
msg_fmt = logging.Formatter("%(asctime)-15s %(funcName)-10s""%(levelname)-8s %(message)s")
strhndl = logging.StreamHandler(sys.stderr)
strhndl.setFormatter(fmt=msg_fmt)
fhndl = logging.FileHandler(args.l, mode='a')
fhndl.setFormatter(fmt=msg_fmt)

logger.addHandler(strhndl)
logger.addHandler(fhndl)
logger.info("Starting BS Preservation")
logger.debug("Supplied arguments: {}".format(sys.argv[1:]))
logger.debug("System " + sys.platform)
logger.debug("Version " + sys.version)

現在，讓我們對所需的輸出目錄進行輸入驗證，如下所示：

if not os.path.exists(args.OUTPUT_DIR):
   os.makedirs(args.OUTPUT_DIR)
main(args.DOMAIN, args.OUTPUT_DIR)

現在，我們將定義main()函式，該函式將透過刪除實際名稱之前的非必要元素來提取網站的基本名稱，以及對輸入URL進行其他驗證，如下所示：

def main(website, output_dir):
   base_name = website.replace("https://", "").replace("http://", "").replace("www.", "")
   link_queue = set()
   
   if "http://" not in website and "https://" not in website:
      logger.error("Exiting preservation - invalid user input: {}".format(website))
      sys.exit(1)
   logger.info("Accessing {} webpage".format(website))
   context = ssl._create_unverified_context()

現在，我們需要使用urlopen()方法開啟與URL的連線。讓我們使用try-except塊，如下所示：

try:
   index = urlopen(website, context=context).read().decode("utf-8")
except urllib.error.HTTPError as e:
   logger.error("Exiting preservation - unable to access page: {}".format(website))
   sys.exit(2)
logger.debug("Successfully accessed {}".format(website))

接下來的幾行程式碼包含三個函式，如下所述：

write_output() 將第一個網頁寫入輸出目錄
find_links() 函式識別此網頁上的連結
recurse_pages() 函式迭代並發現網頁上的所有連結。

write_output(website, index, output_dir)
link_queue = find_links(base_name, index, link_queue)
logger.info("Found {} initial links on webpage".format(len(link_queue)))
recurse_pages(website, link_queue, context, output_dir)
logger.info("Completed preservation of {}".format(website))

現在，讓我們定義write_output()方法，如下所示：

def write_output(name, data, output_dir, counter=0):
   name = name.replace("http://", "").replace("https://", "").rstrip("//")
   directory = os.path.join(output_dir, os.path.dirname(name))
   
   if not os.path.exists(directory) and os.path.dirname(name) != "":
      os.makedirs(directory)

我們需要記錄有關網頁的一些詳細資訊，然後我們使用hash_data()方法記錄資料的雜湊值，如下所示：

logger.debug("Writing {} to {}".format(name, output_dir)) logger.debug("Data Hash: {}".format(hash_data(data)))
path = os.path.join(output_dir, name)
path = path + "_" + str(counter)
with open(path, "w") as outfile:
   outfile.write(data)
logger.debug("Output File Hash: {}".format(hash_file(path)))

現在，定義hash_data()方法，藉助該方法，我們讀取UTF-8編碼的資料，然後生成其SHA-256雜湊值，如下所示：

def hash_data(data):
   sha256 = hashlib.sha256()
   sha256.update(data.encode("utf-8"))
   return sha256.hexdigest()
def hash_file(file):
   sha256 = hashlib.sha256()
   with open(file, "rb") as in_file:
      sha256.update(in_file.read())
return sha256.hexdigest()

現在，讓我們在find_links()方法下從網頁資料建立一個Beautifulsoup物件，如下所示：

def find_links(website, page, queue):
   for link in BeautifulSoup(page, "html.parser",parse_only = SoupStrainer("a", href = True)):
      if website in link.get("href"):
         if not os.path.basename(link.get("href")).startswith("#"):
            queue.add(link.get("href"))
   return queue

現在，我們需要定義recurse_pages()方法，並向其提供網站URL、當前連結佇列、未經驗證的SSL上下文和輸出目錄的輸入，如下所示：

def recurse_pages(website, queue, context, output_dir):
   processed = []
   counter = 0
   
   while True:
      counter += 1
      if len(processed) == len(queue):
         break
      for link in queue.copy(): if link in processed:
         continue
	   processed.append(link)
      try:
      page = urlopen(link,      context=context).read().decode("utf-8")
      except urllib.error.HTTPError as e:
         msg = "Error accessing webpage: {}".format(link)
         logger.error(msg)
         continue

現在，透過傳遞連結名稱、頁面資料、輸出目錄和計數器，將每個訪問的網頁的輸出寫入檔案中，如下所示：

write_output(link, page, output_dir, counter)
queue = find_links(website, page, queue)
logger.info("Identified {} links throughout website".format(
   len(queue)))

現在，當我們透過提供網站的URL、輸出目錄和日誌檔案的路徑來執行此指令碼時，我們將獲得有關該網頁的詳細資訊，這些資訊可用於將來使用。

病毒狩獵

您是否曾經想過取證分析師、安全研究人員和事件響應人員如何瞭解有用軟體和惡意軟體之間的區別？答案就在問題本身中，因為如果不研究駭客快速生成的惡意軟體，研究人員和專家幾乎不可能分辨有用軟體和惡意軟體之間的區別。在本節中，讓我們討論VirusShare，這是一個完成此任務的工具。

瞭解VirusShare

VirusShare 是最大的私人擁有的惡意軟體樣本集合，為安全研究人員、事件響應人員和取證分析師提供活動惡意程式碼樣本。它包含超過 3000 萬個樣本。

VirusShare 的好處是可以免費獲得惡意軟體雜湊列表。任何人都可以使用這些雜湊值建立非常全面的雜湊集，並使用它來識別潛在的惡意檔案。但在使用 VirusShare 之前，我們建議您訪問https://virusshare.com以獲取更多詳細資訊。

使用Python從VirusShare建立換行符分隔的雜湊列表

來自 VirusShare 的雜湊列表可用於各種取證工具，例如 X-ways 和 EnCase。在下面討論的指令碼中，我們將自動化從 VirusShare 下載雜湊列表的過程，以建立一個換行符分隔的雜湊列表。

對於此指令碼，我們需要一個名為tqdm的第三方Python庫，可以如下下載：

pip install tqdm

請注意，在此指令碼中，我們首先將讀取 VirusShare 雜湊頁面並動態識別最新的雜湊列表。然後，我們將初始化進度條並在所需的範圍內下載雜湊列表。

首先，匯入以下庫：

from __future__ import print_function

import argparse
import os
import ssl
import sys
import tqdm

from urllib.request import urlopen
import urllib.error

此指令碼將接受一個位置引數，該引數將是雜湊集所需的路徑：

if __name__ == '__main__':
   parser = argparse.ArgumentParser('Hash set from VirusShare')
   parser.add_argument("OUTPUT_HASH", help = "Output Hashset")
   parser.add_argument("--start", type = int, help = "Optional starting location")
   args = parser.parse_args()

現在，我們將執行標準輸入驗證，如下所示：

directory = os.path.dirname(args.OUTPUT_HASH)
if not os.path.exists(directory):
   os.makedirs(directory)
if args.start:
   main(args.OUTPUT_HASH, start=args.start)
else:
   main(args.OUTPUT_HASH)

現在，我們需要定義main()函式，並使用**kwargs作為引數，因為這將建立一個字典，我們可以參考它來支援提供的鍵引數，如下所示：

def main(hashset, **kwargs):
   url = "https://virusshare.com/hashes.4n6"
   print("[+] Identifying hash set range from {}".format(url))
   context = ssl._create_unverified_context()

現在，我們需要使用urlib.request.urlopen()方法開啟VirusShare雜湊頁面。我們將使用try-except塊，如下所示：

try:
   index = urlopen(url, context = context).read().decode("utf-8")
except urllib.error.HTTPError as e:
   print("[-] Error accessing webpage - exiting..")
   sys.exit(1)

現在，從下載的頁面中識別最新的雜湊列表。您可以透過查詢HTMLhref標記到VirusShare雜湊列表的最後一個例項來執行此操作。可以使用以下程式碼行完成：

tag = index.rfind(r'a href = "hashes/VirusShare_')
stop = int(index[tag + 27: tag + 27 + 5].lstrip("0"))

if "start" not in kwa<rgs:
   start = 0
else:
   start = kwargs["start"]

if start < 0 or start > stop:
   print("[-] Supplied start argument must be greater than or equal ""to zero but less than the latest hash list, ""currently: {}".format(stop))
sys.exit(2)
print("[+] Creating a hashset from hash lists {} to {}".format(start, stop))
hashes_downloaded = 0

現在，我們將使用tqdm.trange()方法建立迴圈和進度條，如下所示：

for x in tqdm.trange(start, stop + 1, unit_scale=True,desc="Progress"):
   url_hash = "https://virusshare.com/hashes/VirusShare_"\"{}.md5".format(str(x).zfill(5))
   try:
      hashes = urlopen(url_hash, context=context).read().decode("utf-8")
      hashes_list = hashes.split("\n")
   except urllib.error.HTTPError as e:
      print("[-] Error accessing webpage for hash list {}"" - continuing..".format(x))
   continue

成功執行上述步驟後，我們將以a+模式開啟雜湊集文字檔案，以追加到文字檔案的底部。

with open(hashset, "a+") as hashfile:
   for line in hashes_list:
   if not line.startswith("#") and line != "":
      hashes_downloaded += 1
      hashfile.write(line + '\n')
   print("[+] Finished downloading {} hashes into {}".format(
      hashes_downloaded, hashset))

執行上述指令碼後，您將獲得最新的雜湊列表，其中包含以文字格式表示的MD5雜湊值。

列印頁面