1. OOZIE job failed:

Error message : ERROR is considered as FAILED for SLA

Cause 1 : Not able to find hadoop namenode (master), jobtracker machine. Suppose you are running oozie, hadoop-master and job tracker on one machine and datanode, tasktracker are running on another machine.

Your job.properties file contains following lines: nameNode=hdfs://localhost:9000 jobTracker=localhost:9001

In above case, FS action will work fine because no map-reduce opertion is perform in FS action case. But, if you run map-reduce action then tasktracker will look hadoop-master on localhost machine becuase we have used localhost:9000 in job.properties file.

Solution : Used IP of hadoop-namenode and jobtracker machine in job.properties file instead of localhost.

Cause 2 : Oozie not able to find Mysql server. Suppose I am using mysql as a metastore for hive. Hive hive-default.xml file have following lines :

  <description>JDBC connect string for a JDBC metastore</description>

Solution : Use IP of mysql machine instead of localhost.

2. Zookeeper server not running:

Error message : Could not find my address: zk-serevr1 in list of ZooKeeper quorum servers

Causes : HBase tries to start a ZK server on some machine but that machine isn't able to find itself in the hbase.zookeeper.quorum configuration. This is a name lookup problem.

Solution : Use the hostname presented in the error message instead of the value you used (zk-server1). If you have a DNS server, you can set hbase.zookeeper.dns.interface and hbase.zookeeper.dns.nameserver in hbase-site.xml to make sure it resolves to the correct FQDN.

3. Hadoop-datanode job failed or datanode not running: java.io.IOException: File ../mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1

Cause 1 : Make sure atleast one datanode is running.

Cause 2 : namespaceID of master and slaves machines are not same. If you see the error java.io.IOException: Incompatible namespaceIDs in the logs of a datanode , chances are you are affected by bug HADOOP-1212 (well, I’ve been affected by it at least).

Solution : If namespaceID of master and slaves machines are not same. Than replace the namespaceID of slaves machine with master namespaceID. - dfs/name/current/VERSION file contains the namespaceID of master machine - dfs/data/current/VERSION file contains the namespaceID of master machine

Cause 3 : Datanode instance running out of space.

Solution : Free some space.

Cause 4 : You may also get this message due to permissions. May be JobTracker can not create jobtracker.info on startup.

4. Sqoop export command failed :

Error message : attempt_201101151840_1006_m_000001_0, Status : FAILED java.util.NoSuchElementException at java.util.AbstractList$Itr.next(AbstractList.java:350) at impressions_by_zip.__loadFromFields(impressions_by_zip.java:159) at impressions_by_zip.parse(impressions_by_zip.java:108)

Cause : Given field separator is not valid Solution : Specify correct field delimeter in sqoop export command.

5. HBase regionserver not running :

Error message : 2012-01-02 13:48:49,973 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: Master rejected startup because clock is out of sync org.apache.hadoop.hbase.ClockOutOfSyncException: org.apache.hadoop.hbase.ClockOutOfSyncException: Server hadoop-datanode2,60020,1325492317440 has been rejected; Reported time is too far out of sync with master. Time difference of 206141ms > max allowed of 30000ms

Solution : Clock of regionservers are not sync with master machine. Synchronized the clock of hbase master and regionserver machines.

