If you do not fully understand how Hive/Impala stores your data, it might cost you badly.
I've learnt the hard way.
You double-checked with
select distinct(gender) from customers that the
gender column in your source RDBMS really contains only values
NULL. However, when you ingest the table into Hive (maybe with Apache Sqoop or Cornet) and run the check there, you see that weird values have creeped in:
> select distinct(gender) from customers; // Run in Hive/Impala +-----------------+ | gender | +-----------------+ | NULL | | CA 94304 | | male | | Page Mill Road | | female | +-----------------+
You check with
select count(*) from customers that the table in your RDBMS table has
156,010 rows. You ingest the table into Hive and BAM! All of a sudden there are 14 more customers.
Maybe the business is doing great and you gained 14 customers before you started the ingestion? Wondering, you check the source table size again.
This one is by far the most strange one.
You already have the
customers table ingested in Hive. Values in the
gender column look fine. The table has correct size of
156,010. All is fine.
You do some data cleaning with SQL and copy the result into a new table as follows:
CREATE TABLE customers_superclean AS SELECT name, coalesce(gender, 'unknown') FROM customers;
You check the size of the new table. BAM! 25 new customers in there. Impossible!
Neither Hive, Impala nor Sqoop gave you any error or warning. You have no idea what's going on. Somewhere at the back of your head, you start questioning the whole Hadoop infrastructure. You feel like the cool stuff you do all day with the data has been compromised.
All these problems can occur if the ingested data contains characters that Hive uses to delimit fields and rows. Typically, these are newlines. For instance, let's assume that the
customers table in the source RDBMS contains the following data (notice the newline in the first street name):
|male||Page Mill Road
|female||Great America Parkway|
If one naively ingests this data into a Hive table using the default settings (a text file with rows delimited by
\01 and fields delimited by newlines), the data gets broken. Look at the corresponding delimited file in HDFS:
male\01Page Mill Road CA 94304 female\01Great America Parkway
When Hive/Impala reads from this file, it finds three customers instead of two. The additional customer is of gender
CA 94304 and has no street specified. On top of that, the street field of the first customer misses the postal code.
Assume that the Hive table called
customers uses Avro or Parquet for data storage and that the data contains newlines. Querying the
customers table directly via Hive or Impala works as expected. However, let's create a new table as follows:
CREATE TABLE customers_superclean AS SELECT gender, street FROM CUSTOMERS;
The problem with this command is that the new table is backed by a newline-delimited text file. An Avro/Parquet record which contains a newline will be split into two records in the new table. Symptom #3 is born.
If possible, use binary storage for your Hive tables, for instance Apache Avro or Apache Parquet. Since these formats do not use dedicated characters to split a file into records and fields, Hive/Impala can read data with special characters properly.
Also, Avro and Parquet make it possible to safely copy records from one Hive table to another. For instance, checkout the following command which copies data to a Parquet table:
CREATE TABLE customers_superclean STORED AS parquet AS SELECT gender, street FROM customers;
The Parquet storage will ensure that even if data in the
customers table contains newlines or another delimiters, the data will be properly copied and interpreted in the new table.
When ingesting data into a delimited text file, you have to ensure that the file does not contain characters that Hive uses to split data into rows and fields. In general, there are two options to achieve this:
CREATE TABLE customers ROW FORMAT DELIMITED LINES TERMINATED BY '\002' FIELDS TERMINATED BY '\001';