Skip to main content
Marcel Krčah

How to ingest data from Azure DataMarket into Hadoop

Published on , in

For those of you who haven't encountered it yet, Azure DataMarket is quite an exciting platform by Microsoft which provides standardized access to plenty of interesting datasets. As of now, there are over 200 datasets from broad range of topics, including weather, demographics, automotive, agriculture, real-estate, etc. There might be a golden nugget for your bussiness hidden in there, so I would definitely recommend to go and check the platform out.

Recently, we have discovered such a golden nugget ourselves: details about every car sold in Netherlands in the past 15 years. A pretty exciting dataset, considering that the company I work for operates in the Dutch market of electric vehicles. What is more, the data is free, updated daily and comes shipped with a REST API interface. What more can we wish for?

To maximally leverage potential of the dataset, we ingest it into Hive, which allows us to:

Avro-backed external Hive tables FTW! #

Regarding the storage, we opted for an Avro-backed external Hive table. These types of tables rock, since:

Ingestion: From OData XML to an Avro Hive table #

Here are detailed steps how we get that data from DataMarket into Hive.

Step 1: Download the dataset as XML files, in a standardized OData format #

DataMarket publishes datasets as XML. Each XML follows the same OData schema (checkout the example XML with cars data). DataMarket provides this data via a REST interface, which supports various query parameters. Using the parameters, you control the subset of data you want to download.

Since the car dataset that we are interested in is about 15 GiB large, filtering proved very useful for our purposes. With the $filter, $top and $inlinecount parameters and a bit of curl and Bash, it is straightforward to download the whole dataset into files cars_2000.xml till cars_2015.xml, where the number indicates the year where a car was registered in Netherlands.

Step 2: Convert the dataset into Avro, using odata2avro utility #

To convert an XML file in OData schema to Avro, we created a Python command-line tool called odata2avro which does all the heavy lifting for you.

Just install the tool with pip install odata2avro and use it as follows:

$ odata2avro cars_2013.xml cars.avsc cars_2013.avro

This command reads an XML file cars_2013.xml and creates two files:

Step 3: Upload the Avro schema and Avro files to HDFS #

In this case, let's create and upload the data to /datamarket HDFS directory:

$ hdfs dfs -mkdir -p /datamarket/cars
$ hdfs dfs -put cars_20*.xml /datamarket/cars
$ hdfs dfs -put cars.avsc /datamarket

Step 4: Create an external Avro-backed Hive table #

To create an external Hive table with a schema according to /datamarket/cars.avsc and the data located in /datamarket/cars, use the following Hive command with the AvroSerDe:

CREATE EXTERNAL TABLE cars
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/datamarket/cars'
TBLPROPERTIES ('avro.schema.url'='hdfs:///datamarket/cars.avsc');

Step 5: Query & profit! #

Congrats! As of now, the data is accessible in both Hive and Impala:

$ impala-shell --query 'select * from cars;'

Bonus step: Keeping data updated #

The car dataset is append-only, so for us it's pretty straightforward to keep the data in Hive updated. We run a daily job which:

That's all!

This blog is written by Marcel Krcah, an independent consultant for product-oriented software engineering. If you like what you read, sign up for my newsletter