For those of you who haven't encountered it yet, Azure DataMarket is quite an exciting platform by Microsoft which provides standardized access to plenty of interesting datasets. As of now, there are over 200 datasets from broad range of topics, including weather, demographics, automotive, agriculture, real-estate, etc. There might be a golden nugget for your bussiness hidden in there, so I would definitely recommend to go and check the platform out.
Recently, we have discovered such a golden nugget ourselves: details about every car sold in Netherlands in the past 15 years. A pretty exciting dataset, considering that the company I work for operates in the Dutch market of electric vehicles. What is more, the data is free, updated daily and comes shipped with a REST API interface. What more can we wish for?
To maximally leverage potential of the dataset, we ingest it into Hive, which allows us to:
REFRESH <table-name>;to notify Impala about arrival of new data.
Here are detailed steps how we get that data from DataMarket into Hive.
DataMarket publishes datasets as XML. Each XML follows the same OData schema (checkout the example XML with cars data). DataMarket provides this data via a REST interface, which supports various query parameters. Using the parameters, you control the subset of data you want to download.
Since the car dataset that we are interested in is about 15 GiB large, filtering proved very useful for our purposes. With the
$inlinecount parameters and a bit of
curl and Bash, it is straightforward to download the whole dataset into files
cars_2015.xml, where the number indicates the year where a car was registered in Netherlands.
To convert an XML file in OData schema to Avro, we created a Python command-line tool called odata2avro which does all the heavy lifting for you.
Just install the tool with
pip install odata2avro and use it as follows:
$ odata2avro cars_2013.xml cars.avsc cars_2013.avro
This command reads an XML file
cars_2013.xml and creates two files:
cars.avsc- an Avro schema, in json, describing the dataset,
cars_2013.avro- a binary Avro file containing the dataset from the XML.
In this case, let's create and upload the data to
/datamarket HDFS directory:
$ hdfs dfs -mkdir -p /datamarket/cars $ hdfs dfs -put cars_20*.xml /datamarket/cars $ hdfs dfs -put cars.avsc /datamarket
To create an external Hive table with a schema according to
/datamarket/cars.avsc and the data located in
/datamarket/cars, use the following Hive command with the AvroSerDe:
CREATE EXTERNAL TABLE cars ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/datamarket/cars' TBLPROPERTIES ('avro.schema.url'='hdfs:///datamarket/cars.avsc');
Congrats! As of now, the data is accessible in both Hive and Impala:
$ impala-shell --query 'select * from cars;'
The car dataset is append-only, so for us it's pretty straightforward to keep the data in Hive updated. We run a daily job which:
REFRESH cars;so Impala registers the new data.