Change-Id: Ibb937af1231f6bbed9a2d4eaeabc6e9d4000887f BUG: 797064 Signed-off-by: M S Vishwanath Bhat <vishwanath@gluster.com> Reviewed-on: http://review.gluster.com/2811 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vijay@gluster.com>
183 lines
6.5 KiB
Plaintext
183 lines
6.5 KiB
Plaintext
GlusterFS Hadoop Plugin
|
|
=======================
|
|
|
|
INTRODUCTION
|
|
------------
|
|
|
|
This document describes how to use GlusterFS (http://www.gluster.org/) as a backing store with Hadoop.
|
|
|
|
|
|
REQUIREMENTS
|
|
------------
|
|
|
|
* Supported OS is GNU/Linux
|
|
* GlusterFS and Hadoop installed on all machines in the cluster
|
|
* Java Runtime Environment (JRE)
|
|
* Maven (needed if you are building the plugin from source)
|
|
* JDK (needed if you are building the plugin from source)
|
|
|
|
NOTE: Plugin relies on two *nix command line utilities to function properly. They are:
|
|
|
|
* mount: Used to mount GlusterFS volumes.
|
|
* getfattr: Used to fetch Extended Attributes of a file
|
|
|
|
Make sure they are installed on all hosts in the cluster and their locations are in $PATH
|
|
environment variable.
|
|
|
|
|
|
INSTALLATION
|
|
------------
|
|
|
|
** NOTE: Example below is for Hadoop version 0.20.2 ($GLUSTER_HOME/hdfs/0.20.2) **
|
|
|
|
* Building the plugin from source [Maven (http://maven.apache.org/) and JDK is required to build the plugin]
|
|
|
|
Change to glusterfs-hadoop directory in the GlusterFS source tree and build the plugin.
|
|
|
|
# cd $GLUSTER_HOME/hdfs/0.20.2
|
|
# mvn package
|
|
|
|
On a successful build the plugin will be present in the `target` directory.
|
|
(NOTE: version number will be a part of the plugin)
|
|
|
|
# ls target/
|
|
classes glusterfs-0.20.2-0.1.jar maven-archiver surefire-reports test-classes
|
|
^^^^^^^^^^^^^^^^^^
|
|
|
|
Copy the plugin to lib/ directory in your $HADOOP_HOME dir.
|
|
|
|
# cp target/glusterfs-0.20.2-0.1.jar $HADOOP_HOME/lib
|
|
|
|
Copy the sample configuration file that ships with this source (conf/core-site.xml) to conf
|
|
directory in your $HADOOP_HOME dir.
|
|
|
|
# cp conf/core-site.xml $HADOOP_HOME/conf
|
|
|
|
* Installing the plugin from RPM
|
|
|
|
See the plugin documentation for installing from RPM.
|
|
|
|
|
|
CLUSTER INSTALLATION
|
|
--------------------
|
|
|
|
In case it is tedious to do the above steps(s) on all hosts in the cluster; use the build-and-deploy.py script to
|
|
build the plugin in one place and deploy it (along with the configuration file on all other hosts).
|
|
|
|
This should be run on the host which is that hadoop master [Job Tracker].
|
|
|
|
* STEPS (You would have done Step 1 and 2 anyway while deploying Hadoop)
|
|
|
|
1. Edit conf/slaves file in your hadoop distribution; one line for each slave.
|
|
2. Setup password-less ssh b/w hadoop master and slave(s).
|
|
3. Edit conf/core-site.xml with all glusterfs related configurations (see CONFIGURATION)
|
|
4. Run the following
|
|
# cd $GLUSTER_HOME/hdfs/0.20.2/tools
|
|
# python ./build-and-deploy.py -b -d /path/to/hadoop/home -c
|
|
|
|
This will build the plugin and copy it (and the config file) to all slaves (mentioned in $HADOOP_HOME/conf/slaves).
|
|
|
|
Script options:
|
|
-b : build the plugin
|
|
-d : location of hadoop directory
|
|
-c : deploy core-site.xml
|
|
-m : deploy mapred-site.xml
|
|
-h : deploy hadoop-env.sh
|
|
|
|
|
|
CONFIGURATION
|
|
-------------
|
|
|
|
All plugin configuration is done in a single XML file (core-site.xml) with <name><value> tags in each <property>
|
|
block.
|
|
|
|
Brief explanation of the tunables and the values they accept (change them where-ever needed) are mentioned below
|
|
|
|
name: fs.glusterfs.impl
|
|
value: org.apache.hadoop.fs.glusterfs.GlusterFileSystem
|
|
|
|
The default FileSystem API to use (there is little reason to modify this).
|
|
|
|
name: fs.default.name
|
|
value: glusterfs://server:port
|
|
|
|
The default name that hadoop uses to represent file as a URI (typically a server:port tuple). Use any host
|
|
in the cluster as the server and any port number. This option has to be in server:port format for hadoop
|
|
to create file URI; but is not used by plugin.
|
|
|
|
name: fs.glusterfs.volname
|
|
value: volume-dist-rep
|
|
|
|
The volume to mount.
|
|
|
|
|
|
name: fs.glusterfs.mount
|
|
value: /mnt/glusterfs
|
|
|
|
This is the directory that the plugin will use to mount (FUSE mount) the volume.
|
|
|
|
name: fs.glusterfs.server
|
|
value: 192.168.1.36, hackme.zugzug.org
|
|
|
|
To mount a volume the plugin needs to know the hostname or the IP of a GlusterFS server in the cluster.
|
|
Mention it here.
|
|
|
|
name: quick.slave.io
|
|
value: [On/Off], [Yes/No], [1/0]
|
|
|
|
NOTE: This option is not tested as of now.
|
|
|
|
This is a performance tunable option. Hadoop schedules jobs to hosts that contain the file data part. The job
|
|
then does I/O on the file (via FUSE in case of GlusterFS). When this option is set, the plugin will try to
|
|
do I/O directly from the backed filesystem (ext3, ext4 etc..) the file resides on. Hence read performance
|
|
will improve and job would run faster.
|
|
|
|
|
|
USAGE
|
|
-----
|
|
|
|
Once configured, start Hadoop Map/Reduce daemons
|
|
|
|
# cd $HADOOP_HOME
|
|
# ./bin/start-mapred.sh
|
|
|
|
If the map/reduce job/task trackers are up, all I/O will be done to GlusterFS.
|
|
|
|
|
|
FOR HACKERS
|
|
-----------
|
|
|
|
* Source Layout
|
|
|
|
** version specific: hdfs/<version> **
|
|
./src
|
|
./src/main
|
|
./src/main/java
|
|
./src/main/java/org
|
|
./src/main/java/org/apache
|
|
./src/main/java/org/apache/hadoop
|
|
./src/main/java/org/apache/hadoop/fs
|
|
./src/main/java/org/apache/hadoop/fs/glusterfs
|
|
./src/main/java/org/apache/hadoop/fs/glusterfs/GlusterFSBrickClass.java
|
|
./src/main/java/org/apache/hadoop/fs/glusterfs/GlusterFSXattr.java <--- Fetch/Parse Extended Attributes of a file
|
|
./src/main/java/org/apache/hadoop/fs/glusterfs/GlusterFUSEInputStream.java <--- Input Stream (instantiated during open() calls; quick read from backed FS)
|
|
./src/main/java/org/apache/hadoop/fs/glusterfs/GlusterFSBrickRepl.java
|
|
./src/main/java/org/apache/hadoop/fs/glusterfs/GlusterFUSEOutputStream.java <--- Output Stream (instantiated during creat() calls)
|
|
./src/main/java/org/apache/hadoop/fs/glusterfs/GlusterFileSystem.java <--- Entry Point for the plugin (extends Hadoop FileSystem class)
|
|
./src/test
|
|
./src/test/java
|
|
./src/test/java/org
|
|
./src/test/java/org/apache
|
|
./src/test/java/org/apache/hadoop
|
|
./src/test/java/org/apache/hadoop/fs
|
|
./src/test/java/org/apache/hadoop/fs/glusterfs
|
|
./src/test/java/org/apache/hadoop/fs/glusterfs/AppTest.java <--- Your test cases go here (if any :-))
|
|
./tools/build-deploy-jar.py <--- Build and Deployment Script
|
|
./conf
|
|
./conf/core-site.xml <--- Sample configuration file
|
|
./pom.xml <--- build XML file (used by maven)
|
|
|
|
** toplevel: hdfs/ **
|
|
./COPYING <--- License
|
|
./README <--- This file
|