Facebook to BigQuery pipeline with Nifi

Analysing social media data can help you drive the business. In many cases, the data is freely available. The only missing gap is a process for migrating it to an ecosystem where it can be analysed. Then you would be able to mine social media profiles’ content including the profiles of your competitors. The easiest business case to start is to social media activity correlate with your business performance.  

In this blog we focus on data pipelines and this post is dedicated to fetching Facebook data. Once fetched, we will migrate it into BigQuery at Google Cloud Platform. We will get FCBayern messages and track Robert Lewandowski’s achievements via SQL. This can be a lot of fun, but it would have been much more fun if Robert had scored in Champions League semi-final games against Real Madrid (2018).

The interesting thing is that the whole pipeline can be created within Apache Nifi just by a couple of drag-n-drops within Web UI. Based on that, we can focus on business logic instead of its technical details and develop the pipeline quickly.  

Grab Facebook data into Nifi

We can use findmyfbid.com to detemine Facebook numeric Id of the FCBayern which is 822633781141031. We also create Facebook Application in order to obtain an access token and use Facebook Graph API Explorer for debugging API:

Screen Shot 2018-04-22 at 22.26.47.png

Then the Apache Nifi comes into action. To fetch data we use GetHTTP processor that will request url:

https://graph.facebook.com/v2.12/822633781141031/posts?access_token=—our-token

The processor fetches JSON data from Facebook. To make it work, we need to set up HTTPS connection and create SSLContextService in NIFI. Facebook does not check the Certificate Authority of the client, so we can configure SSLContextService with a keystore generated locally with keytool.

Process JSON

Single JSON response contains multiple posts. We apply SplitJson processor to split it into several records so that each post message becomes a single FlowFile and a single row in a target table. The JSONPath expression configured within a processor is simpy “$.data”.

Screen Shot 2018-05-08 at 17.08.58.png

Screen Shot 2018-05-08 at 17.09.15.png

Save records into BigQuery

We will use Nifi-BigQuery-Bundle with its PutBigQueryProcessor. Once the Bundle is installed we only need to specify within a processor:

  • Service Account Credentials JSON
  • BigQuery Dataset
  • BigQuery Table

Screen Shot 2018-05-08 at 17.11.07.png

Once we start the processors, BigQuery table gets populated with data:

Screen Shot 2018-04-22 at 22.39.16.png

Summary

Now we can start our analysis and Google Cloud Platform gives us plenty of great tools to do so. Creating Nifi flows required some hints, but once it is ready, the next workflow can be done in a minute.

This would have a been perfect pipeline but it is not… We fetch the same messages with each request and duplicate data in target table. Moreover, we do not fetch the history and we start with 100 latest posts.

If only we had a better processor customised to read Facebook API…  Feel free to read the next post about our GetFacebookProcessor.

Leave a Reply