How a newline can ruin your Hive

Published on 05 May 2015, in #data-engineering

If you do not fully understand how Hive/Impala stores your data, it might cost you badly.

I've learnt the hard way.

Symptom #1: Weird values in ingested Hive table #

You double-checked with select distinct(gender) from customers that the gender column in your source RDBMS really contains only values male, female and NULL. However, when you ingest the table into Hive (maybe with Apache Sqoop or Cornet) and run the check there, you see that weird values have creeped in:

> select distinct(gender) from customers;  // Run in Hive/Impala
+-----------------+
| gender          |
+-----------------+
| NULL            |
| CA 94304        |
| male            |
| Page Mill Road  |
| female          |
+-----------------+

Symptom #2: Inconsistent size of ingested Hive table #

You check with select count(*) from customers that the table in your RDBMS table has 156,010 rows. You ingest the table into Hive and BAM! All of a sudden there are 14 more customers.

Maybe the business is doing great and you gained 14 customers before you started the ingestion? Wondering, you check the source table size again.

Still 156,010.

Symptom #3: Weird data when copied from another Hive table #

This one is by far the most strange one.

You already have the customers table ingested in Hive. Values in the gender column look fine. The table has correct size of 156,010. All is fine.

You do some data cleaning with SQL and copy the result into a new table as follows:

CREATE TABLE customers_superclean
AS SELECT name, coalesce(gender, 'unknown') FROM customers;

You check the size of the new table. BAM! 25 new customers in there. Impossible!

No errors, no warnings #

Neither Hive, Impala nor Sqoop gave you any error or warning. You have no idea what's going on. Somewhere at the back of your head, you start questioning the whole Hadoop infrastructure. You feel like the cool stuff you do all day with the data has been compromised.

Cause: Hive delimiters present in the data #

All these problems can occur if the ingested data contains characters that Hive uses to delimit fields and rows. Typically, these are newlines. For instance, let's assume that the customers table in the source RDBMS contains the following data (notice the newline in the first street name):

gender	street
male	Page Mill Road`\n`CA 94304
female	Great America Parkway

If one naively ingests this data into a Hive table using the default settings (a text file with rows delimited by \01 and fields delimited by newlines), the data gets broken. Look at the corresponding delimited file in HDFS:

male\01Page Mill Road
CA 94304
female\01Great America Parkway

When Hive/Impala reads from this file, it finds three customers instead of two. The additional customer is of gender CA 94304 and has no street specified. On top of that, the street field of the first customer misses the postal code.

Particularly interesting case: Copying binary data to text file #

Assume that the Hive table called customers uses Avro or Parquet for data storage and that the data contains newlines. Querying the customers table directly via Hive or Impala works as expected. However, let's create a new table as follows:

CREATE TABLE customers_superclean
AS SELECT gender, street FROM CUSTOMERS;

The problem with this command is that the new table is backed by a newline-delimited text file. An Avro/Parquet record which contains a newline will be split into two records in the new table. Symptom #3 is born.

Solution 1: Use binary storage formats like Avro or Parquet #

If possible, use binary storage for your Hive tables, for instance Apache Avro or Apache Parquet. Since these formats do not use dedicated characters to split a file into records and fields, Hive/Impala can read data with special characters properly.

Also, Avro and Parquet make it possible to safely copy records from one Hive table to another. For instance, checkout the following command which copies data to a Parquet table:

CREATE TABLE customers_superclean STORED AS parquet
AS SELECT gender, street FROM customers;

The Parquet storage will ensure that even if data in the customers table contains newlines or another delimiters, the data will be properly copied and interpreted in the new table.

Solution 2: Ensure the delimited text file does not contain Hive delimiters #

When ingesting data into a delimited text file, you have to ensure that the file does not contain characters that Hive uses to split data into rows and fields. In general, there are two options to achieve this:

Remove Hive delimiters from the data before ingestion. If you use Sqoop, there are handy parameters which can do that for you. Checkout the Sqoop docs and look for --hive-drop-import-delims and --hive-delims-replacement parameters.
Use custom Hive delimiters that are not present in the data. Unused non-printable characters are good candidates, for instance \01 or \02. You can instruct Hive to use custom delimiters as follows:

CREATE TABLE customers
ROW FORMAT DELIMITED
  LINES TERMINATED BY '\002'
  FIELDS TERMINATED BY '\001';

Happy ingesting!

← Previous post: How to ingest data from Azure DataMarket into Hadoop
→ Next post: Resist the quench for coding

This blog is written by Marcel Krcah, an independent consultant for product-oriented software engineering. If you like what you read, sign up for my newsletter