Google Analytics 360 is probably the best tool for monitoring online activity and clickstream analytics. The data may be even more valuable, if it could have been used together with other data within the company. In this post we present, how to load it into Apache Hive and how easy it is to do so with Apache NIFI.
We will use data from shop.googlemerchandisestore.com/, which is an official shop with Google gadgets
The shop uses Google Analytics 360 and clickstream data is exported into Big Query. Configuring an export to Big Query requires only a few clicks to configure. GoogleMerchandisestore data in BigQuery has been published by Google within bigquery-public-data dataset.
In former post we have shown how to fetch BigQuery table into NIFI with our custom NIFI processor. In this scenario, we can use the same processor and set scheduling strategy into “CRON driven”, so that it is run once a day.
Source table – ga_sessions
Table ga_sessions is a template table with a suffix corresponding to day data has been collected. This means each day can be queried as a separate table of the form ga_sessions_20170801. The table name with a yesterdays date can be constructed with NIFI expression language and date manipulation functions like now, minus and format.
Our custom GetBigQueryTableProcessor returns a flowfile with JSON data. We will parse it into AVRO file format, so that querying data in Hive becomes more convenient. In order to do so, we need to create AVRO schema of the dataset. In many cases this can be done manually. However ga_sessions contains hundreds of fields. This can be solved in a following way:
- We can export table into Google Storage file system by clicking “Export Table” button within Big Query console:
- We would love to export it directly to AVRO, however it is not working with ga_sessions data and we needed to export data in a JSON format.
- We download file from Google Storage onto local computer.
- Locally we can run pySpark and convert JSON to Avro:
df = spark.read.json("local_path_to_gs_data") df.write.avro("destination_path")
- Once we have an avro file the schema can be determined with avro tools:
java -jar avro-tools-.jar getschema local-avro-file-name
Please let us know if there is an easier way to determine Avro schema of the Big Query Table data 😉 Anyway this approach works and does not require creating schema manually.
The schema, we have fetched, will be used twice:
- To configure NIFI processor when converting JSON to AVRO.
- When defining Hive table schema.
For the latter case, it is reasonable to put AVRO schema on HDFS.
Creating Hive Table
Creating Hive Table with Avro data and the schema can be as easy as that:
create external table google_merchandise_store.ga_sessions ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS AVRO location '/hdfs-target-data-location' TBLPROPERTIES ('avro.schema.url'='hdfs://hdfs-path-to-avro-schema')
This probably will not work (does not work on default Hive settings) as some schema fields exceed the maximum typename length. To make it work we need add the following line to a hive-site config file:
If your are using Ambari, this can be done by adding the property above through Ambari > Hive > Configs > Advanced > Custom hive-site.
So far we are done with extracting data from BigQuery. We know AVRO schema of Google Analytics dataset and have prepared target Hive table. There are only two missing gaps:
- Converting JSON to Avro
- Store Avro flowfile on HDFS
Fortunately, these can be solved with standard Nifi Processors. ConvertRecord can be used to solve the first one with the following properties:
Avro Schema can be passed to JsonTreeReader as a “Schema text”:
PutHDFS stores flowfiles on HDFS within a specified location. At the end, data lands on HDFS and can be accessed in Apache Hive: