HDFSFileTransfer

HDFSFileTransfer is a tool written in 'bash' script for a quick transfer any files into HDFS. It allows users to copy files within the same physical machine as well as between two machines. It means to copy files from linux file system with HDFS cluster to another HDFS cluster. E.g. one can have two single Hadoop clusters installed on two different linux machines. The script can copy files from one linux machine with Hadoop installed to the other one. HDFSFileTransfer is dual-licensed under the MIT license and the Boost Software License, and its source code is freely available.
Features
* Local Copy
The script runs on the same machine HDFS was installed. What the script does is to:
- Copy files from local file system
- Paste the files into HDFS
- Archive the copied files on the local file system under the archive folder
* Copy From One Machine To Another With Different HDFS Cluster
The script allows you to copy files from linux file system with HDFS cluster installed to another HDFS cluster. It is, however, not to copy from one HDFS cluster to another one. E.g. you can have two single clustered Hadoop installed on two different linux machines - in my particular situation I have done so using VM. The script can copy files from one linux machine with Hadoop installed on the other one.
* Email errors encountered during the process
The process consists of few validations. Should the validation fails it gets the error status. The error process:
- terminates the transfer process
- sends an email containing the error message to a user/group of people
Diagram Process
The process is broken down into two parts:
- initial validations - there are three checks being done at the very beginning - each time the script starts up
- a 'while' loop - all the copy and archive bits - although it is done in loop (so can be run as a deamon), there is also an option to run the process each time you wish
Validations
The list of all validations carried out within the copying process:
* Local Folders - checks if all local folders, the files are copied from (SOURCE site), exist. If any of the folder does not exist, the error gets written into a log file, emailed sent to the user and process gets terminated.
* Hadoop - DataNode - checks if DataNode process is up and running. It uses JPS command to validate if hadoop process is up and running. If the process is down, the error gets written into a log file, emailed sent to the user and process gets terminated
* HDFS Folders - checks if all HDFS folders, the files are copied into (DESTINATION site), exist. If any of the folder does not exist, the error gets written into a log file, emailed sent to the user and process gets terminated.
* Number Of Copied Files - the validation is carried out each and every time the files get copied from SOURCE to DESTINATION. The process counts the number of files picked up from SOURCE folder and the one pasted into DESTINATION one and compare these two. Should the numbers do not match, the error gets written into a log file, emailed sent to the user and process gets terminated.
Limitations
* Copy files with blank space(s) - Hadoop 2.4.0
Problem:
There is a problem when trying to copy files for which their name contains spaces.
When doing the following:
hdfs dfs -put /home/testUser/New filename.txt /user/hduser/temp
This returns the following error:
put: unexpected URISyntaxException
Solution:
See the documentation of this project available on sourceforge website.
 
< Prev   Next >