sparkle tech thoughts: February 2013

Saturday, February 23, 2013

Introduction to GIT with GIT commands

Recently I had some experiance working with git and I thought of sharing most basic and mostly used set of git commands. Before we begin I will give a quick introduction to "git".

What is "git"?

Git is an open source distributed version control system designed to manange source code, system which designed for speed and efficiency. So the important factor here to remember is git is a distributed version control system unlike the old fasion centralized version control system such as SVN. In a centralized VCS you have a server which has all the source code and clients, and these two parties can be distringuishly identified. If a client wants to use the code to do any modification, they have to checkout the source from the server and comit back to the server. So the entire code base stays in a single data store.If you loose the data store then you are screwed!!!. Git on the other hand is a distributed version control system, what it means is you dont check out a version of a project to start working but you clone it. So you just clone it to the local file system, this way is far efficient and you can work offline (You dont have to be online for most operations) no network issues,you can push and pull directly to peers. This means almost everything is local, so its very fast, every clone is a backup (so everyone working in that project has a backup for the system), and you can work offline.

Why use "git"?

I asked a friend why he thinks a project/company should adopt "git", his simple answer was less build breaks :D (which of course a good enough answer) and you can give restricted commit rights to team members so no careless glitches :). But of course there's more to that story, So I listed down few advantages which I could think of "git" and Why projects should move to "git"

Its very fast (since almost everything is local)

You want lose your code (Every clone is a backup to the system)

You can work offline to perfom each of the mostly used task such as:

performing a diff,

viewing file history,

committing changes,

merging branches,

obtaining other revision of a file,

switching branches

Which means you can work anywhere in the world even when you are up in the sky !!!!

Its imutable (it never removes data) -"git" will not re write your history, it will always write a new history (you will have a pointer to your new history) you can always go back so you want lose data.

Before I begin I should warn you, if you are used to old fasion version control system like SVN you are going to hate "git". You will start hating everything about "git" and get tired of it very easily. Because its very differant from the centralized version control systems for example most version control systems are file based delta storage its mainly file based operations on the other hand "git" thinks about data as snapshots,it looks at the content (ignores the filename) and put that content in the database as key value pairs and return the key.That is, instead of thinking about and storing commit points as file based patches or changes, it stores it as a simple snapshot of what your project looked like when you committed.. So the easiest way to get you hands on these cool stuff, you need to foget all you know about centralized version control systems, youve been using for the past years and start thinking differantly and it will blow your mind :)

Before I bore you off with the conceptual infomation on "git" I will start giving the most commonly used "git" commands to start off with "git" and I will continue explaining interesting stuff about "git" in my next blog posts.

init - This will initialize a brand new git repository in a project directory.

git init

clone - This will clone an exact copy of an existing project.

git clone http://git.stratos.com/amani.com/poo

add - Adding files

To add a single file:

 git add info.php

To add multiple files:

 git add info.php README.txt

To add all the files in the directory:

git add .

Status - This tells you what files have been modified since the last time they were committed.

git status

commit - Commit changes to head (but not yet to the remote repository):

git commit -m "Committing my changes"

Push - Send changes to the master branch of your remote repository

git push origin master

Blame - To check who screwed it up :)

git blame hello.java

If you screwed it up :( ?

reset - Revert the uncommited changes from last commit

git reset --hard HEAD

checkout - Undo local changes

git checkout -- myFile.txt

* If you mess up, you can replace the changes in your working tree with the last content in head:

Changes already added to the index, as well as new files, will be kept.

fetch - fetch the latest history from the server and point your local master branch at it.

git fetch origin

git reset --hard origin/master

Grep - Search the working directory for isService():

git grep "isService()"

Thats it for now to get a quick start on git try GitHub. GitHub made git more easy, you can try it out and boost up on git with GitHub. :)

Tuesday, February 19, 2013

How to configure a hadoop cluster

Prerequisites

Before we begin there are couple of softwares you need to install along with the hadoop user, before installing hadoop .

Java - Install Java into a location where all the user groups can access.
Eg: opt/java/jdk-1.6_29

rsync - Install rsync using apt-get (This is to copy the Hadoop distribution of Name Node across all the other nodes)

Create hadoop user - Navigate to /home. Create the user “hadoop”

To login as user hadoop using the command.

su - hadoop

bash

PS: Above steps need to be performed (software should be installed) on all the other Hadoop nodes as well

Setup public key login from master to slave nodes

Create ssh public keys for each user (ssh-keygen -t rsa -b 2048) and added the public key (*.pub) to the authorized_keys file in master and slave nodes.

Key Exchange for Passphraseless SSH

1. We need to have password / passphraseless SSH to communicate with other Hadoop nodes in the
cluster.

Try to SSH to another node
   ssh hadoop@amani26.poohdedoo.com

2. Generate a key for the Name Node using the following command.

   ssh-keygen

This will generate an output similar to below.

Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
8207:9e:1e:60:37:28:03:a6:18:b3:b6:f1:e4:2f:ef hadoop@bam01

3. This will create a .ssh directory inside the ‘hadoop’ user account. Navigate and to .ssh directory. It will contain a file with the generated public key. Inspect the public key stored in the ‘id_rsa.pub’ file

with the command,
cat id_rsa.pub

It will display the public keys …

4. This public key of the Name Node should appended to the ‘authorized_keys’ file in the other Data
Nodes. Execute the following command and copy the id_rsa.pub file into the other nodes.

   scp id_rsa.pub root@amani276.poohdedoo.com:/root

5. Login to the second Hadoop node’s ‘hadoop’ user account. Try to SSH to another node from this.

   ssh hadoop@amani26.poohdedoo.com

This will create the .ssh directory in the hadoop account.

6. Append the copied public key to the ‘authorized_key’ file in the hadoop account of this Data node.
Execute the following commands.

      cat /root/id_rsa.pub > authorized_keys
   chown hadoop:hadoop authorized_keys
   chmod 600 authorized_keys

7. Now you can ssh to this Data node from the earlier configured Master node. Login to the Master
node. From the hadoop account, login to the Data node with the following command.

ssh -i id_rsa hadoop@amani27.poohdedoo.com
   or
ssh hadoop@amani27.poohdedoo.com

Setup Hadoop

Download and extract hadoop (tar xvfz hadoop-x.x.x.tar.gz -C /mnt/)
Change the permission of the extracted directory if necessary (chown -R user:user /mnt/hadoop-x.x.x)
[optional] If IPv6 is not used disable it.

- add 'net.ipv6.conf.all.disable_ipv6 = 1'

Configure Hadoop

Configuration files $HADOOP_HOME/conf/

Set JAVA_HOME in $HADOOP_HOME/conf/hadoop-env.sh (Add export JAVA_HOME=/path/to/javahome)

eg: export JAVA_HOME=/opt/java/jdk1.6.0_29

Edit he HADOOP_HOME/conf/core-site.xml as follows:

<name>fs.default.name</name>

<value>hdfs://hadoop0.poohdedoo.com:9000</value>

</property>

<value>org.apache.hadoop.hdfs.DistributedFileSystem</value>

</property>

<name>hadoop.tmp.dir</name>

<value>/mnt/hadoop_tmp</value>

</property>

</configuration>

Edit the $HADOOP_HOME/conf/hdfs-site.xml as follows:

<name>dfs.replication</name>

</property>

<value>/mnt/hadoop_data/dfs/name</value>

</property>

<value>/mnt/hadoop_data/dfs/data</value>

</property>

</configuration>

Edit the $HADOOP_HOME/conf/mapred-site.xml as follows

<name>mapred.job.tracker</name>

<value>hadoop0.poohdedoo.com:9001</value>

</property>

<name>mapred.system.dir</name>

<value>/mnt/hadoop_data/mapred/system</value>

</property>

</configuration>

Edit the $HADOOP_HOME/conf/hadoop-policy.xml

By default the value for 'security.job.submission.protocol.acl' is *

change it to a user group or a name

<name>security.job.submission.protocol.acl</name>

<value>adminuser</value>

* Change the 'masters' and 'slaves' files (Master node only; slave machines does not need this configurations)

- $HADOOP_HOME/conf/maseters (masters file contain secondary namenode servers)

hadoop0.poohdedoo.com

- $HADOOP_HOME/conf/slaves (slaves file contain slave servers ; datanodes and task trackers)

hadoop1.poohdedoo.com

hadoop2.poohdedoo.com

Setting up hadoop cluster

Format the namenode before starting the cluster

$HADOOP_HOME/bin/hadoop namenode -format

start the services

$HADOOP_HOME/bin/start-all.sh

It will start namenode,jobtracker, secondarynamenode in master node and datanode and tasktracker on slave nodes. (To check the services run $JAVA_HOME/bin/jps)

stop the services

$HADOOP_HOME/bin/stop-all.sh

It will stop namenode,jobtracker, secondarynamenode in master node and datanode and tasktracker on slave nodes. (To check the services run $JAVA_HOME/bin/jps)