在 Python 中處理 PDF 檔案?
Python 是一種非常通用的語言,因為它提供了大量的庫來滿足不同的需求。我們都使用可移植文件格式 (PDF) 檔案。Python 提供了多種處理 pdf 檔案的方法。在這裡,我們將使用名為 PyPDF2 的 python 庫來處理 pdf 檔案。
PyPDF2 是一個純 Python 的 PDF 庫,能夠分割、合併、裁剪和轉換 PDF 檔案的頁面。它還可以向 PDF 檔案新增自定義資料、檢視選項和密碼。它還可以從 PDF 中檢索文字和元資料,以及將整個檔案合併在一起。
由於我們可以使用 PyPDF2 對 PDF 執行多種操作,因此它就像一把瑞士軍刀。
入門
因為 pypdf2 是一個標準的 python 包,所以我們需要安裝它。好訊息是它非常簡單,我們可以使用 pip 來安裝它。只需在你的命令終端執行以下命令即可
C:\Users\rajesh>pip install pypdf2 Collecting pypdf2 Downloading https://files.pythonhosted.org/packages/b4/01/68fcc0d43daf4c6bdbc6b33cc3f77bda531c86b174cac56ef0ffdb96faab/PyPDF2-1.26.0.tar.gz (77kB) 100% |████████████████████████████████| 81kB 83kB/s Building wheels for collected packages: pypdf2 Building wheel for pypdf2 (setup.py) ... done Stored in directory: C:\Users\rajesh\AppData\Local\pip\Cache\wheels\53\84\19\35bc977c8bf5f0c23a8a011aa958acd4da4bbd7a229315c1b7 Successfully built pypdf2 Installing collected packages: pypdf2 Successfully installed pypdf2-1.26.0
要驗證,請從 python shell 匯入 pypdf2
>>> import PyPDF2 >>> Successful, Great.
提取元資料
我們可以從任何 pdf 中提取一些重要的有用資料。例如,我們可以提取文件作者的資訊、標題、主題以及 pdf 檔案中包含的頁面數量。
以下是使用 pypdf2 包從 pdf 檔案中提取有用資訊的 python 程式。
from PyPDF2 import PdfFileReader def extract_pdfMeta(path): with open(path, 'rb') as f: pdf = PdfFileReader(f) info = pdf.getDocumentInfo() number_of_pages = pdf.getNumPages() print("Author: \t", info.author) print() print("Creator: \t", info.creator) print() print("Producer: \t",info.producer) print() print("Subject: \t", info.subject) print() print("title: \t",info.title) print() print("Number of Pages in pdf: \t",number_of_pages) if __name__ == '__main__': path = 'DeepLearning.pdf' extract_pdfMeta(path)
輸出
Author: Nikhil Buduma,Nicholas Locascio Creator: AH CSS Formatter V6.2 MR4 for Linux64 : 6.2.6.18551 (2014/09/24 15:00JST) Producer: Antenna House PDF Output Library 6.2.609 (Linux64) Subject: None title: Fundamentals of Deep Learning Number of Pages in pdf: 298
因此,無需開啟 pdf 檔案,我們就可以從 pdf 檔案中獲取一些有用的資訊。
從 PDF 中提取文字
我們可以從 pdf 中提取文字。雖然它內建支援提取影像。
讓我們嘗試從我們上面下載的 pdfs 檔案的特定頁面(例如:第 50 頁)中提取文字。
#Import pypdf2 from PyPDF2 import PdfFileReader def extract_pdfText(path): with open(path, 'rb') as f: pdf = PdfFileReader(f) # get the 50th page page = pdf.getPage(50) print(page) print('Page type: {}'.format(str(type(page)))) #Extract text from the 50th page text = page.extractText() print(text) if __name__ == '__main__': path = 'DeepLearning.pdf' extract_pdfText(path)
輸出
{'/Annots': IndirectObject(1421, 0), '/Contents': IndirectObject(179, 0), '/CropBox': [0, 0, 595.3, 841.9], '/Group': {'/CS': '/DeviceRGB', '/S': '/Transparency', '/Type': '/Group'}, '/MediaBox': [0, 0, 504, 661.5], '/Parent': IndirectObject(4863, 0), '/Resources': IndirectObject(1423, 0), '/Rotate': 0, '/Type': '/Page' } Page type: <class 'PyPDF2.pdf.PageObject'> time. In inverted dropout, any neuron whose activation hasn†t been silenced has its output divided by p before the value is propagated to the next layer. With this fix, Eoutput=p⁄xp+1ƒ p⁄0= x, and we can avoid arbitrarily scaling neuronal output at test time. SummaryIn this chapter, we†ve learned all of the basics involved in training feed-forward neural networks. We†ve talked about gradient descent, the backpropagation algorithm, as well as various methods we can use to prevent overfitting. In the next chapter, we†ll put these lessons into practice when we use the TensorFlow library to efficiently implement our first neural networks. Then in Chapter 4 , we†ll return to the problem of optimizing objective functions for training neural networks and design algorithmsto significantly improve performance. These improvements will enable us to process much more data, which means we†ll be able to build more comprehensive models. Summary | 37
雖然我們能夠從第 50 頁獲取一些文字,但它並不那麼幹淨。不幸的是,pypdf2 對從 pdf 中提取文字的支援非常有限。
旋轉 pdf 檔案的特定頁面
>>> import PyPDF2 >>> deeplearningFile = open('DeepLearning.pdf', 'rb') >>> pdfReader = PyPDF2.PdfFileReader(deeplearningFile) >>> page = pdfReader.getPage(0) >>> page.rotateClockwise(90) { '/Contents': [IndirectObject(4870, 0), IndirectObject(4871, 0), IndirectObject(4872, 0), IndirectObject(4873, 0), IndirectObject(4874, 0), IndirectObject(4875, 0), IndirectObject(4876, 0), IndirectObject(4877, 0)], '/CropBox': [0, 0, 595.3, 841.9], '/MediaBox': [0, 0, 504, 661.5], '/Parent': IndirectObject(4862, 0), '/Resources': IndirectObject(4889, 0), '/Rotate': 90, /Type': '/Page' } >>> pdfWriter = PyPDF2.PdfFileWriter() >>> pdfWriter.addPage(page) >>> resultPdfFile = open('rotatedPage.pdf', 'wb') >>> pdfWriter.write(resultPdfFile) >>> resultPdfFile.close() >>> deeplearningFile.close()
輸出
廣告