Apache Pig - 安裝



本章解釋如何在您的系統中下載、安裝和設定Apache Pig

先決條件

在安裝 Apache Pig 之前,務必在您的系統上安裝 Hadoop 和 Java。因此,在安裝 Apache Pig 之前,請按照以下連結中給出的步驟安裝 Hadoop 和 Java:

https://tutorialspoint.tw/hadoop/hadoop_enviornment_setup.htm

下載 Apache Pig

首先,從以下網站下載最新版本的 Apache Pig:https://pig.apache.org/

步驟 1

開啟 Apache Pig 網站的主頁。在新聞部分,點選釋出頁面連結,如下面的快照所示。

Home Page

步驟 2

點選指定的連結後,您將被重定向到Apache Pig 釋出頁面。在此頁面上的下載部分,您將看到兩個連結,即Pig 0.8 及更高版本Pig 0.7 及更早版本。點選Pig 0.8 及更高版本連結,然後您將被重定向到包含一組映象的頁面。

Apache Pig Releases

步驟 3

選擇並點選這些映象中的任何一個,如下所示。

Click Mirrors

步驟 4

這些映象將引導您進入Pig 釋出頁面。此頁面包含 Apache Pig 的各種版本。點選其中最新的版本。

Pig Release

步驟 5

在這些資料夾中,您將擁有各種發行版中 Apache Pig 的原始碼和二進位制檔案。下載 Apache Pig 0.15 的原始碼和二進位制檔案的 tar 檔案,pig0.15.0-src.tar.gzpig-0.15.0.tar.gz

Index

安裝 Apache Pig

下載 Apache Pig 軟體後,請按照以下步驟在 Linux 環境中安裝它。

步驟 1

在與Hadoop、Java和其他軟體的安裝目錄相同的目錄中建立一個名為 Pig 的目錄。(在本教程中,我們在名為 Hadoop 的使用者中建立了 Pig 目錄)。

$ mkdir Pig

步驟 2

解壓下載的 tar 檔案,如下所示。

$ cd Downloads/ 
$ tar zxvf pig-0.15.0-src.tar.gz 
$ tar zxvf pig-0.15.0.tar.gz 

步驟 3

pig-0.15.0-src.tar.gz檔案的內容移動到前面建立的Pig目錄,如下所示。

$ mv pig-0.15.0-src.tar.gz/* /home/Hadoop/Pig/

配置 Apache Pig

安裝 Apache Pig 後,我們需要對其進行配置。要進行配置,我們需要編輯兩個檔案:bashrc 和 pig.properties

.bashrc 檔案

.bashrc檔案中,設定以下變數:

  • PIG_HOME 資料夾指向 Apache Pig 的安裝資料夾;

  • PATH 環境變數指向 bin 資料夾;

  • PIG_CLASSPATH 環境變數指向 Hadoop 安裝的 etc(配置)資料夾(包含 core-site.xml、hdfs-site.xml 和 mapred-site.xml 檔案的目錄)。

export PIG_HOME = /home/Hadoop/Pig
export PATH  = $PATH:/home/Hadoop/pig/bin
export PIG_CLASSPATH = $HADOOP_HOME/conf

pig.properties 檔案

在 Pig 的conf資料夾中,我們有一個名為pig.properties的檔案。在 pig.properties 檔案中,您可以設定如下所示的各種引數。

pig -h properties 

支援以下屬性:

Logging: verbose = true|false; default is false. This property is the same as -v
       switch brief=true|false; default is false. This property is the same 
       as -b switch debug=OFF|ERROR|WARN|INFO|DEBUG; default is INFO.             
       This property is the same as -d switch aggregate.warning = true|false; default is true. 
       If true, prints count of warnings of each type rather than logging each warning.		 
		 
Performance tuning: pig.cachedbag.memusage=<mem fraction>; default is 0.2 (20% of all memory).
       Note that this memory is shared across all large bags used by the application.         
       pig.skewedjoin.reduce.memusagea=<mem fraction>; default is 0.3 (30% of all memory).
       Specifies the fraction of heap available for the reducer to perform the join.
       pig.exec.nocombiner = true|false; default is false.
           Only disable combiner as a temporary workaround for problems.         
       opt.multiquery = true|false; multiquery is on by default.
           Only disable multiquery as a temporary workaround for problems.
       opt.fetch=true|false; fetch is on by default.
           Scripts containing Filter, Foreach, Limit, Stream, and Union can be dumped without MR jobs.         
       pig.tmpfilecompression = true|false; compression is off by default.             
           Determines whether output of intermediate jobs is compressed.         
       pig.tmpfilecompression.codec = lzo|gzip; default is gzip.
           Used in conjunction with pig.tmpfilecompression. Defines compression type.         
       pig.noSplitCombination = true|false. Split combination is on by default.
           Determines if multiple small files are combined into a single map.         
			  
       pig.exec.mapPartAgg = true|false. Default is false.             
           Determines if partial aggregation is done within map phase, before records are sent to combiner.         
       pig.exec.mapPartAgg.minReduction=<min aggregation factor>. Default is 10.             
           If the in-map partial aggregation does not reduce the output num records by this factor, it gets disabled.
			  
Miscellaneous: exectype = mapreduce|tez|local; default is mapreduce. This property is the same as -x switch
       pig.additional.jars.uris=<comma seperated list of jars>. Used in place of register command.
       udf.import.list=<comma seperated list of imports>. Used to avoid package names in UDF.
       stop.on.failure = true|false; default is false. Set to true to terminate on the first error.         
       pig.datetime.default.tz=<UTC time offset>. e.g. +08:00. Default is the default timezone of the host.
           Determines the timezone used to handle datetime datatype and UDFs.
Additionally, any Hadoop property can be specified.

驗證安裝

鍵入 version 命令來驗證 Apache Pig 的安裝。如果安裝成功,您將獲得如下所示的 Apache Pig 版本。

$ pig –version 
 
Apache Pig version 0.15.0 (r1682971)  
compiled Jun 01 2015, 11:44:35
廣告
© . All rights reserved.