« Marcel Krčah  « All entries

How a newline can ruin your Hive

If you do not fully understand how Hive/Impala stores your data, it might cost you badly.

I've learnt the hard way.

Symptom #1: Weird values in ingested Hive table

You double-checked with select distinct(gender) from customers that the gender column in your source RDBMS really contains only values male, female and NULL. However, when you ingest the table into Hive (maybe with Apache Sqoop or Cornet) and run the check there, you see that weird values have creeped in:

> select distinct(gender) from customers;  // Run in Hive/Impala
+-----------------+
| gender          |
+-----------------+
| NULL            |
| CA 94304        |
| male            |
| Page Mill Road  |
| female          |
+-----------------+

Symptom #2: Inconsistent size of ingested Hive table

You check with select count(*) from customers that the table in your RDBMS table has 156,010 rows. You ingest the table into Hive and BAM! All of a sudden there are 14 more customers.

Maybe the business is doing great and you gained 14 customers before you started the ingestion? Wondering, you check the source table size again.

Still 156,010.

Symptom #3: Weird data when copied from another Hive table

This one is by far the most strange one.

You already have the customers table ingested in Hive. Values in the gender column look fine. The table has correct size of 156,010. All is fine.

You do some data cleaning with SQL and copy the result into a new table as follows:

CREATE TABLE customers_superclean
AS SELECT name, coalesce(gender, 'unknown') FROM customers;

You check the size of the new table. BAM! 25 new customers in there. Impossible!

No errors, no warnings

Neither Hive, Impala nor Sqoop gave you any error or warning. You have no idea what's going on. Somewhere at the back of your head, you start questioning the whole Hadoop infrastructure. You feel like the cool stuff you do all day with the data has been compromised.

Cause: Hive delimiters present in the data

All these problems can occur if the ingested data contains characters that Hive uses to delimit fields and rows. Typically, these are newlines. For instance, let's assume that the customers table in the source RDBMS contains the following data (notice the newline in the first street name):

gender street
male Page Mill Road \n CA 94304
female Great America Parkway

If one naively ingests this data into a Hive table using the default settings (a text file with rows delimited by \01 and fields delimited by newlines), the data gets broken. Look at the corresponding delimited file in HDFS:

male\01Page Mill Road
CA 94304
female\01Great America Parkway

When Hive/Impala reads from this file, it finds three customers instead of two. The additional customer is of gender CA 94304 and has no street specified. On top of that, the street field of the first customer misses the postal code.

Particularly interesting case: Copying binary data to text file

Assume that the Hive table called customers uses Avro or Parquet for data storage and that the data contains newlines. Querying the customers table directly via Hive or Impala works as expected. However, let's create a new table as follows:

CREATE TABLE customers_superclean
AS SELECT gender, street FROM CUSTOMERS;

The problem with this command is that the new table is backed by a newline-delimited text file. An Avro/Parquet record which contains a newline will be split into two records in the new table. Symptom #3 is born.

Solution 1: Use binary storage formats like Avro or Parquet

If possible, use binary storage for your Hive tables, for instance Apache Avro or Apache Parquet. Since these formats do not use dedicated characters to split a file into records and fields, Hive/Impala can read data with special characters properly.

Also, Avro and Parquet make it possible to safely copy records from one Hive table to another. For instance, checkout the following command which copies data to a Parquet table:

CREATE TABLE customers_superclean STORED AS parquet
AS SELECT gender, street FROM customers;

The Parquet storage will ensure that even if data in the customers table contains newlines or another delimiters, the data will be properly copied and interpreted in the new table.

Solution 2: Ensure the delimited text file does not contain Hive delimiters

When ingesting data into a delimited text file, you have to ensure that the file does not contain characters that Hive uses to split data into rows and fields. In general, there are two options to achieve this:

  • Remove Hive delimiters from the data before ingestion. If you use Sqoop, there are handy parameters which can do that for you. Checkout the Sqoop docs and look for --hive-drop-import-delims and --hive-delims-replacement parameters.
  • Use custom Hive delimiters that are not present in the data. Unused non-printable characters are good candidates, for instance \01 or \02. You can instruct Hive to use custom delimiters as follows:

    CREATE TABLE customers
    ROW FORMAT DELIMITED
    LINES TERMINATED BY '\002'
    FIELDS TERMINATED BY '\001';
    

Happy ingesting!