Sqoop Import Examples:


Sqoop Import :- Import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS) and its subprojects (Hive, HBase).


Import the data (MySQL table) to HBase:

Case 1: If table have primary key and import all the column of MySQL table into HBase table.
$ bin/sqoop import --connect jdbc:mysql://localhost/db1 --username root \
  --password root --table tableName --hbase-table hbase_tableName \
  --column-family hbase_table_col1 --hbase-create-table


Case 2: If table have primary key and import only few columns of MySQL table into HBase table.  
$ bin/sqoop import --connect jdbc:mysql://localhost/db1 --username root \
  --password root --table tableName --hbase-table hbase_tableName \
  --columns column1,column2 --column-family hbase_table_col1 \
  --hbase-create-table 
Note : Column names specified in --columns attribute must contain the primary key column.

Case 3: If table doesn't have primary key then choose one column as a hbase-row-key. Import all the column of MySQL table into HBase table.
$ bin/sqoop import --connect jdbc:mysql://localhost/db1 --username root \
  --password root --table tableName --hbase-table hbase_tableName \
  --column-family hbase_table_col1 --hbase-row-key column1 \
  --hbase-create-table


Case 4: If table doesn't have primary key then choose one column as a hbase-row-key. Import only few columns of MySQL table into HBase table.
$ bin/sqoop import --connect jdbc:mysql://localhost/db1 --username root \
  --password root --table tableName --hbase-table hbase_tableName \
  --columns column1,column2 --column-family hbase_table_col \
  --hbase-row-key column1 --hbase-create-table 
Note: Column name specified in hbase-row-key atribute must be in columns list. Otherwise command will execute successfully but no records are inserted into hbase.

Note : The value of primary key column or column specified in --hbase-row-key attribute become the HBase row value. If MySQL table doesn't have primary key or column specified in --hbase-row-key attribute doesn't have unique value then there is a lost of few records.

Example : Let us consider a MySQL table test_table which have two columns name,address. The table test_table doesn't have primary key or unique key column.
Records of test_table:
________________
name    address
----------------
abc    123
sqw    345
abc    125
sdf    1234
aql    23dw



Run the following command to import test_table data into HBase:
$ bin/sqoop import --connect jdbc:mysql://localhost/db1 --username root --password root \
  --table test_table --hbase-table hbase_test_table --column-family test_table_col1 \
  --hbase-row-key name --hbase-create-table


Only 4 records are visible into HBase table instead of 5. In above example two rows have same value 'abc' of name column and value of this column is used as a HBase row key value. If record having value 'abc' of name column come then thoes record will inserted into HBase table. Next time, another record having the same value 'abc' of name column come then thoes column will overwrite the value previous column.

Above problem also occured if table have composite primary key because the one column from composite key is used as a HBase row key.

Import the data (MySQL table) to Hive

Case 1: Import MySQL table into Hive if table have primary key.
$ bin/sqoop-import --connect jdbc:mysql://localhost:3306/db1 \
  -username root -password password --table tableName --hive-table tableName \
  --create-hive-table --hive-import --hive-home path/to/hive_home


Case 2: Import MySQL table into Hive if table doesn't have primary key.
$ bin/sqoop-import --connect jdbc:mysql://localhost:3306/db1 \
  -username root -password password --table tableName --hive-table tableName \
  --create-hive-table --hive-import --hive-home path/to/hive_home --split-by column_name
or
$ bin/sqoop-import --connect jdbc:mysql://localhost:3306/db1 \
  -username root -password password --table tableName --hive-table tableName \
  --create-hive-table --hive-import --hive-home path/to/hive_home -m 1


Import the data (MySQL table) to HDFS


Case 1: Import MySQL table into HDFS if table have primary key.
$ bin/sqoop import -connect jdbc:mysql://localhost:3306/db1 \
  -username root -password password --table tableName --target-dir /user/ankit/tableName 


Case 2: Import MySQL table into HDFS if table doesn't have primary key.
$ bin/sqoop import -connect jdbc:mysql://localhost:3306/db1 \
  -username root -password password --table tableName --target-dir /user/ankit/tableName -m 1


Sqoop Export Examples:


Sqoop Export: export the HDFS and its subproject (Hive, HBase) data back into an RDBMS. 
Export Hive table back to an RDBMS:

By default, Hive will stored data using ^A as a field delimiter and \n as a row delimiter.
$ bin/sqoop export --connect jdbc:mysql://localhost/test_db --table tableName \
  --export-dir /user/hive/warehouse/tableName --username root --password password \
  -m 1 --input-fields-terminated-by '\001'

where '\001' is octal representation of ^A.


Troubleshooting

Export 시에, NULL 처리: export할 때, 다음과 같은 Error가 발생된다면 –input-null-string, –input-null-not-string의 옵션을 생각하셔야 합니다.

Caused by: java.lang.RuntimeException: Can't parse input data: 'N'
	at optimizedCombination2013.__loadFromFields(optimizedCombination2013.java:684)
	at optimizedCombination2013.parse(optimizedCombination2013.java:552)
	at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:83)
	... 10 more
Caused by: java.lang.NumberFormatException: For input string: "N"
	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
	at java.lang.Integer.parseInt(Integer.java:449)
	at java.lang.Integer.valueOf(Integer.java:554)
	at optimizedCombination2013.__loadFromFields(optimizedCombination2013.java:636)
	... 12 more

이러한 오류 난 배경에는 cat으로 해당 table을 열어 보면 다음과 같은 구조로 되어 있습니다.

112680100001121636131000011216NNNNNNNN1
111216100001268036131000012680NNNNNNNN1
110000112161268036131121612680NNNNNNNN1

저 ‘N’이 문제가 되는 것 입니다. hive 상에서 select하여 결과를 보면

1	12680	["10000","11216"]	3	61	3	10000	11216	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	1
1	11216	["10000","12680"]	3	61	3	10000	12680	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	1
1	10000	["11216","12680"]	3	61	3	11216	12680	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	1

다음과 같이 NULL 캐릭터임을 알 수 있습니다. 문제는 sqoop은 이러한 내용은 모른다는데 있습니다. 따라서, sqoop에 –input-null-string, –input-null-non-string을 이용해서 처리해야 합니다.

#!/bin/bash
 
hive_char=$( printf "x01" )
 
sqoop export -P --connect "jdbc:sqlserver://server_ip;database=default;" 
  --username sa --table table_name -m 1 
  --export-dir /apps/hive/warehouse/what_you_want 
  --input-fields-terminated-by ${hive_char} 
  --lines-terminated-by 'n' 
  --update-key primary_column --update-mode allowinsert 
  --input-null-string "\\N" --input-null-non-string "\\N"


RDBMS(Oracle)의 데이터를 HDFS 으로 Import 실행시 오류

$ sqoop import \
> --connect jdbc:oracle:thin:@//192.168.1.190:1521/ora10g \
> --username scott \
> --password tiger \
> --table EMP \
> --hive-import
13/07/29 16:23:00 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
13/07/29 16:23:00 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
13/07/29 16:23:00 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
13/07/29 16:23:00 INFO manager.SqlManager: Using default fetchSize of 1000
13/07/29 16:23:00 INFO tool.CodeGenTool: Beginning code generation
13/07/29 16:23:00 INFO manager.OracleManager: Time zone has been set to GMT
13/07/29 16:23:00 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM BULLETIN t WHERE 1=0
13/07/29 16:23:00 ERROR tool.ImportTool: Imported Failed: Attempted to generate class with no columns!
> 변경
--username 의 DB 계정명을 대문자로 변경
 예) --username SCOTT

# RDBMS(Oracle)의 데이터를 Hive로 Import 시 오류

13/07/29 16:46:19 INFO hive.HiveImport: FAILED: Error in metadata: java.lang.RuntimeException:
Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
13/07/29 16:46:19 INFO hive.HiveImport: FAILED: Execution Error,
return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
13/07/29 16:46:19 ERROR tool.ImportTool: Encountered IOException running import job:
 java.io.IOException: Hive exited with status 1
        at org.apache.sqoop.hive.HiveImport.executeExternalHiveScript(HiveImport.java:364)
        at org.apache.sqoop.hive.HiveImport.executeScript(HiveImport.java:314)
        at org.apache.sqoop.hive.HiveImport.importTable(HiveImport.java:226)
        at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:415)
        at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:476)
        at org.apache.sqoop.Sqoop.run(Sqoop.java:145)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
        at org.apache.sqoop.Sqoop.main(Sqoop.java:238)
> 작업 순서
1) jps 에서 조회 되는 RunJar 프로세스를 중지(이전 Job 프로세스)
2) hdfs 상에 생성된 데이터를 삭제
- hdfs 데이터를 삭제하지 않고 재실행 할경우, 아래와 같이 이미 존재한다는 메시지 출력후 진행 중지됨.
  이유는 데이터를 hadoop 홈에 임시 저장하고, job이 끝나면 hive 홈으로 바꾸는(DDL포함) 방식으로 Import 진행.
  그래서 다시 import 실행시에 hdfs상에는 실제 데이터가 있기 때문에 발생.

메시지보기.. > 작업 진행

1) RunJar 프로세스 중지

[hadoop@master hive]$ jps
30520 JobTracker
28202 RunJar
30336 NameNode
31128 Jps

[hadoop@master hive]$ kill 28202
[hadoop@master hive]$ jps
30520 JobTracker
30336 NameNode
31155 Jps

2) hdfs상의 데이터 삭제

$ hadoop fs -rmr /user/hadoop/EMP
Moved to trash: hdfs://master.namenode:9000/user/hadoop/EMP


Reference:

Posted by 신공표 트랙백 0 : 댓글 0

댓글을 달아 주세요