- TIKA 教程
- TIKA - 主頁
- TIKA - 概述
- TIKA - 架構
- TIKA - 環境
- TIKA - 引用 API
- TIKA - 檔案格式
- TIKA - 文件型別檢測
- TIKA - 內容提取
- TIKA - 元資料提取
- TIKA - 語言檢測
- TIKA - 圖形使用者介面
- TIKA 有用資源
- TIKA - 快速指南
- TIKA - 有用資源
- TIKA - 討論
TIKA - 提取 PDF
下面給出了從 PDF 提取內容和元資料的程式。
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class PdfParse {
public static void main(final String[] args) throws IOException,TikaException {
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("Example.pdf"));
ParseContext pcontext = new ParseContext();
//parsing the document using PDF parser
PDFParser pdfparser = new PDFParser();
pdfparser.parse(inputstream, handler, metadata,pcontext);
//getting the content of the document
System.out.println("Contents of the PDF :" + handler.toString());
//getting metadata of the document
System.out.println("Metadata of the PDF:");
String[] metadataNames = metadata.names();
for(String name : metadataNames) {
System.out.println(name+ " : " + metadata.get(name));
}
}
}
將以上程式碼儲存為 PdfParse.java,並使用以下命令從命令提示符處編譯它 −
javac PdfParse.java java PdfParse
下面給出了 example.pdf 的快照。
我們正在傳遞的 PDF 具有以下屬性 −
編譯程式後,您將獲得如下所示的輸出。
輸出 −
Contents of the PDF: Apache Tika is a framework for content type detection and content extraction which was designed by Apache software foundation. It detects and extracts metadata and structured text content from different types of documents such as spreadsheets, text documents, images or PDFs including audio or video input formats to certain extent. Metadata of the PDF: dcterms:modified : 2014-09-28T12:31:16Z meta:creation-date : 2014-09-28T12:31:16Z meta:save-date : 2014-09-28T12:31:16Z dc:creator : Krishna Kasyap pdf:PDFVersion : 1.5 Last-Modified : 2014-09-28T12:31:16Z Author : Krishna Kasyap dcterms:created : 2014-09-28T12:31:16Z date : 2014-09-28T12:31:16Z modified : 2014-09-28T12:31:16Z creator : Krishna Kasyap xmpTPg:NPages : 1 Creation-Date : 2014-09-28T12:31:16Z pdf:encrypted : false meta:author : Krishna Kasyap created : Sun Sep 28 05:31:16 PDT 2014 dc:format : application/pdf; version = 1.5 producer : Microsoft® Word 2013 Content-Type : application/pdf xmp:CreatorTool : Microsoft® Word 2013 Last-Save-Date : 2014-09-28T12:31:16Z
廣告