- TIKA 教程
- TIKA - 主頁
- TIKA - 概述
- TIKA - 架構
- TIKA - 環境
- TIKA - 引用 API
- TIKA - 檔案格式
- TIKA - 文件型別檢測
- TIKA - 內容解壓
- TIKA - 元資料解壓
- TIKA - 語言檢測
- TIKA - GUI
- TIKA 實用資源
- TIKA - 快速指南
- TIKA - 實用資源
- TIKA - 討論
TIKA - 提取文字檔案
以下是程式,用於摘取文字文件中的內容和元資料 −
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.parser.txt.TXTParser;
import org.xml.sax.SAXException;
public class TextParser {
public static void main(final String[] args) throws IOException,SAXException, TikaException {
//detecting the file type
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("example.txt"));
ParseContext pcontext=new ParseContext();
//Text document parser
TXTParser TexTParser = new TXTParser();
TexTParser.parse(inputstream, handler, metadata,pcontext);
System.out.println("Contents of the document:" + handler.toString());
System.out.println("Metadata of the document:");
String[] metadataNames = metadata.names();
for(String name : metadataNames) {
System.out.println(name + " : " + metadata.get(name));
}
}
}
將以上程式碼另存為 TextParser.java,並使用以下命令透過命令提示符編譯它 −
javac TextParser.java java TextParser
以下是 sample.txt 檔案的快照 −
文字文件具有以下屬性 −
如果你執行以上程式,它將提供以下輸出。
輸出 −
Contents of the document: At tutorialspoint.com, we strive hard to provide quality tutorials for self-learning purpose in the domains of Academics, Information Technology, Management and Computer Programming Languages. The endeavour started by Mohtashim, an AMU alumni, who is the founder and the managing director of Tutorials Point (I) Pvt. Ltd. He came up with the website tutorialspoint.com in year 2006 with the help of handpicked freelancers, with an array of tutorials for computer programming languages. Metadata of the document: Content-Encoding: windows-1252 Content-Type: text/plain; charset = windows-1252
廣告