Sqoop导入使用方法


从RDBMS(mysql)中导入数据到hdfs

全部导入(导入一个表中的数据)

在sqoop的安装目录下执行以下命令:

bin/sqoop import \
--connect jdbc:mysql://xxxx:3306/sqoop_test \
--username root \
--password 061111 \
--table staff \
--target-dir /user/sqoop \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t"

–connect:表示连接的数据库地址(sqoop_test为数据库名) –username和–password 表示要连接的数据库的用户名和密码 –table 表示想要导出数据的表 –target-dir 表示要导入到hdfs中的目录(如果不指定,默认存储在“/user/用户名/导入的表名” 目录下) –delete-target-dir 表示如果在hdfs中有该目录,则先删除,然后再导入数据到该目录下 –num-mappers 表示设置的maptask个数,默认为4个,决定最终在hdfs中生成的文件个数(将table中的数据分成几个文件分别存储)

注意:两个命令--xxx之间要有个空格

查询导入(部分导入,导入sql执行后的结果)
bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/sqoop_test \
--username root \
--password 061111 \
--target-dir /user/sqoop \
--delete-target-dir \
--num-mappers 2 \
--fields-terminated-by "\t" \
--query 'select name,sex from staff where id >=2 and $CONDITIONS' \
--split-by id

将sql ‘select name,sex from staff where id <=1 and $CONDITIONS;’执行后的结果,导入到hdfs中。

注意: 1.如果query后使用的是双引号,则$CONDITIONS前必须加转义符\,防止shell识别为自己的变量。 2.where条件中必须添加$CONDITIONS,否则会报异常: java.io.IOException: Query [select name,sex from staff where id <=1;] must contain '$CONDITIONS' in WHERE clause.1 3.当设置多个mappers进行作业时,一定要设置 –split-by 字段 ,即设置分片的依据;

指定列导入

bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/sqoop_test \
--username root \
--password 061111 \
--target-dir /user/sqoop \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t" \
--columns id,sex \
--table staff

用–column 参数设置要导入的table staff表中的列; 注意:如果要导入多列,列与列之间需要用逗号,分隔,且不要加空格,否则会报错,无法解析

条件导入(类似于②中的查询导入)
bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/sqoop_test \
--username root \
--password 061111 \
--target-dir /user/sqoop \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t" \
--table staff \
--where "id>6"

直接用–where参数 筛选数据

从RDBMS(mysql)中导入数据到Hive

普通导入

bin/sqoop import \
--connect jdbc:mysql://hadoop102:3306/sqoop_test \
--username root \
--password 061111 \
--table staff \
--m 1 --hive-import \
--fields-terminated-by "\t" \
--hive-overwrite \
--hive-database dbname \
--hive-table staff_yzy

其中,–hive-import表明是向hive中导入数据, –hive-table 表明导入到hive中的哪个表(如果没有该表,则直接创建表) –hive-overwrite表明覆盖之前的表中的数据,如果不加的话,则会新建一个文件,不覆盖之前的数据,相当于追加

增量导入数据(使用参数–incremental ,分为append模式和lastmodified模式)

append模式:
bin/sqoop import \
–connect jdbc:mysql://hadoop102:3306/sqoop_test \
–username root \
–password 061111 \
–table staff \
–m 1 \
–fields-terminated-by “\t\
–target-dir /user/hive/warehouse/staff666 \
–check-column id \
–incremental append \
–last-value 3

其中,–incremental append 表示增量导入的append模式 –check-column 字段名,表示按照该字段去check –last-value 表示将该字段的值为3的后面的数据append到Hive中对应表中,不包含值为3的那一行数据。

注意:增量导入数据,不能与–hive-等参数一起使用,暂时不支持,只能指定目录(hive中的表所在的目录),不能直接指定hive中的表,否则报错:Append mode for hive imports is not yet supported. Please remove the parameter –append-mode

lastmodified模式

要求mysql中的表中包含一列类型为timestamp的字段,时间戳

bin/sqoop import \
–connect jdbc:mysql://hadoop102:3306/sqoop_test \
–username root \
–password 061111 \
–table staff_timestamp \
–check-column last_modified \
–incremental lastmodified \
–last-value “2017-09-28 22:20:38” \
–m 1 \
–append

导入上次最后修改时间(”2017-09-28 22:20:38”)之后的数据,包含该时间点的数据;如果想将数据导入到hive中的表,可以指定–target-dir /user/hive/warehouse/表名

注意:使用lastmodified方式导入数据要指定增量数据是要–append(追加)还是要–merge-key(合并),如果是append的话将修改时候之后的数据拼接到hive中的表里,如果是merge-key的话则根据合并的字段,去除重复的字段,只保留第一个

其它有用的参数

-hive-drop-import-delims     删除字段中的 \n, \r, and \01
Argument    Description
--append    Append data to an existing dataset in HDFS
--as-avrodatafile   Imports data to Avro Data Files
--as-sequencefile   Imports data to SequenceFiles
--as-textfile   Imports data as plain text (default)
--as-parquetfile    Imports data to Parquet Files
--boundary-query <statement>    Boundary query to use for creating splits
--columns <col,col,col…>    Columns to import from table
--delete-target-dir Delete the import target directory if it exists
--direct    Use direct connector if exists for the database
--fetch-size <n>    Number of entries to read from database at once.
--inline-lob-limit <n>  Set the maximum size for an inline LOB
-m,--num-mappers <n>    Use n map tasks to import in parallel
-e,--query <statement>  Import the results of statement.
--split-by <column-name>    Column of the table used to split work units. Cannot be used with --autoreset-to-one-mapper option.
--split-limit <n>   Upper Limit for each split size. This only applies to Integer and Date columns. For date or timestamp fields it is calculated in seconds.
--autoreset-to-one-mapper   Import should use one mapper if a table has no primary key and no split-by column is provided. Cannot be used with --split-by <col> option.
--table <table-name>    Table to read
--target-dir <dir>  HDFS destination dir
--temporary-rootdir <dir>   HDFS directory for temporary files created during import (overrides default "_sqoop")
--warehouse-dir <dir>   HDFS parent for table destination
--where <where clause>  WHERE clause to use during import
-z,--compress   Enable compression
--compression-codec <c> Use Hadoop codec (default gzip)
--null-string <null-string> The string to be written for a null value for string columns
--null-non-string <null-string> The string to be written for a null value for non-string columns


--hive-home <dir>   Override $HIVE_HOME
--hive-import   Import tables into Hive (Uses Hive’s default delimiters if none are set.)
--hive-overwrite    Overwrite existing data in the Hive table.
--create-hive-table If set, then the job will fail if the target hive
table exists. By default this property is false.
--hive-table <table-name>   Sets the table name to use when importing to Hive.
--hive-drop-import-delims   Drops \n, \r, and \01 from string fields when importing to Hive.
--hive-delims-replacement   Replace \n, \r, and \01 from string fields with user defined string when importing to Hive.
--hive-partition-key    Name of a hive field to partition are sharded on
--hive-partition-value <v>  String-value that serves as partition key for this imported into hive in this job.
--map-column-hive <map> Override default mapping from SQL type to Hive type for configured columns. If specify commas in this argument, use URL encoded keys and values, for example, use DECIMAL(1%2C%201) instead of DECIMAL(1, 1).
从mysql导入hive脚本参考
#!/bin/bash 
##/*********************************************************************

#--------------------------------------------------公共区 BEGIN--------------------------
脱敏文件内容
CONNECTURL=jdbc:mysql://***:3306
USERNAME=***
PASSWORD=***
---------------------#
source /etc/profile
source /home/jflm/app/dfm/sh/param/sq_shell.config #脱敏文件 url 账号密码
#随机数以时间戳纳秒用于防止目录冲突
randnum=`date +%s%N`

#mysql_database
database=machine

#mysql table 
mysql_table=tb_loan_member_info

#hive_table
hive_table=time.tb_loan_member_info    
    ##1-入口参数,获取脚本运行时间参数,默认参数为今天,与当前调度中的参数一致
    echo $#

    if [ $# = 0 ]; then
        p_partition_d=`date -d "-1 days" +%Y%m%d`
        date=`date -d -0days +%Y%m%d`
    fi
    if [ $# = 1 ]; then
        p_partition_d=`date -d "$1" +%Y%m%d`
        date=`date -d "+1 days $1" +%Y%m%d`
    fi

    echo "p_partition_d:${p_partition_d}" 
    echo "date:${date}"

#--------------------------------------------------公共区 END-------------------------------------------------#


#-----------------------------------------------------执行区START-----------------------------------------------------------------#
sqoop import \
--connect $CONNECTURL/$database \
--username $USERNAME \
--password $PASSWORD \
--query 'select eid,user_id,mobile_no,register_date,real_name
 from '$mysql_table' where 1=1 and $CONDITIONS' \
--target-dir /user/jflm/odl/$database/$mysql_table/$randnum \
--num-mappers 1 \
--hive-import \
--hive-table $hive_table \
--null-string '\\N' \
--null-non-string '\\N' \
--hive-drop-import-delims \
--fields-terminated-by '\001' \
--lines-terminated-by '\n' \
--delete-target-dir \
--hive-overwrite \
--hive-partition-key pt_log_d \
--hive-partition-value $p_partition_d