In the previous post we have shown how social media data can be fetched from Facebook platform onto your analytical environment using Apache Nifi. Apache Nifi allows you to build ready to go drag-n-drop pipelines that are made of Nifi Processors.
Apache Nifi provides several processors. In many cases, it may not be enough for your use case and you will need to write a custom processor to fit your needs. This is what happened to us and why we have developed GetFacebookProcessor.
The example from the previous post contains serious drawbacks:
- It does not fetch historical data.
- It duplicates data as each request to Facebook Graph API returns fixed amount of 100 latest posts.
Facebook Graph API provides paging to solve that issue. JSON result with posts datacontains links to further posts:
When we run our processor for the first time, we can simply follow “next” url as long as we fetch all the historical posts. We can implement it within GetFacebookProcessor.
In order to fetch only the latest post, the processor needs to store a timestamp of the last fetched message. Fortunately Apache Nifi provides a State Manager which can store values in Zookeeper.
Based on state store information, we can pass “since” parameter to Facebook Graph API to only grab the messages that have not been fetched earlier.
Only 4 fields are required, to configure the GetFacebookProcessor:
We provide access key to Facebook key and we choose the content we want to fetch:
We also put ID of the profile to fetch and SSL Context. The SSL Context can be configured as previously.
Our pipeline can be created just out of two processors:
- GetFacebookProcessor to fetch data.
- PutBigQueryProcessor to save data into BigQuery.
It can be that easy because GetFacebookProcessor already splits JSON data.
At the end of the day, we end up with 2294 post of FCBayern that are available within Google Big Query for further analysis.
125 of them mention Robert Lewandowski – “The Goal Machine”: