Introduction

Let’s assume that we want to analyze how people think about a specific topic or maybe how they react to a certain incident, or what are their opinions about a particular topic.

For this, we can use different Machine Learning and Deep Learning algorithms. For the purpose of analysis, first, we need data- data about the people, their opinions, their ideas etc. But where do we avail those data from?

Well, the answer is Social Networking Platforms like Facebook, Twitter, Instagram etc. Social Networking platforms are used by billions of people these days. Nowadays it is a part of our day to day life. We spend a huge amount of time on Social Networks and also share our personal moments on social networks. In other words, we generate a lot of data on those platforms. Now just imagine the amount of information those social networking sites have about us. They know everything about us. If we can perform any kind of logical analysis on those data, we can understand these people.

In Data Analysis, the first step is to aggregate data. Without proper data, we cannot do any analysis. In different cases, the sources of the data are different and the process of aggregating them is different too. In this case, we want to aggregate data from various social media networks, which means we need a Social Network Aggregator.

Social Network Aggregation is a process by which we collect data from various social networks like Facebook, Instagram, Twitter etc. for various analysis and other tasks. Basically, social network aggregator is a tool that has a social network aggregation process for collecting data.

Problem Actualization

One of our clients from Singapore wanted a system that automatically collects data from various online and offline platforms and performs some advanced analysis on those data to extract some meaningful information. The client wanted a system where they can enter some specific keywords for each online platform prompting the system to automatically collect posts related to those keywords. They wanted to target all major social networking platforms like Facebook, Instagram, YouTube, Twitter, Reddit, other websites like Blogger.com and offline platforms like printed Magazines, Newspapers etc. Also, they wanted a scheduling system that runs on a specific
interval of time, maybe every hour or every minute and collects data.

Solution

In this post, we’ll discuss how we collect data from all online platforms.
The first step in this project was to build a system that collects all the data and stores it. Almost all social networks provide public APIs for accessing and performing different tasks on their respective platforms. We can use those APIs for collecting of data for our analysis.

Facebook developers provide a Graph API for accessing contents on Facebook. Using that Graph API we can programmatically query data, post new stories, manage ads, upload photos, and perform a wide variety of other tasks. We used NodeJS for this project, which is a powerful and popular scripting language used by several big corporations. For NodeJS, there is an NPM package, fbgraph, which is a simple library used for accessing Graph API. We used fbgraph for this project to collect data from Facebook.
Also, for accessing Instagram content, we used the same Facebook Graph API. Thus here too fbgraph was used for accessing the Graph API.

Similarly, for Twitter, we used the Twitter API. With Twitter Search API, we can find historical tweets. And also filter tweets based on certain criteria like keywords. Along with that, we used the Twitter Stream API, which is used for collecting tweets in real time. For this too, we have used an NPM package called twit, which is a Twitter API client for NodeJS.

In the case of YouTube, Google provides YouTube API which was used here for accessing data.
For Reddit, both the Search API and the Stream API were used for accessing their content.
For collecting data from other websites, we used Webhose, which is a popular data aggregator service that collects data from hundreds of websites. We used the webhoseio NPM package and the Webhose API, for accessing the data we needed.
And, finally, we built a scheduling system that runs on a specific interval of time and collects data from all the platforms mentioned above.

After collecting the data, it is stored into AWS S3 . AWS S3 is another popular service provided by Amazon, which provides the flexibility and availability that we need for this project.

The biggest problem that we faced during this project was scaling the system so that the system could collect millions of posts online, smoothly and efficiently. For that, we spent a huge amount of time on optimizing the code.

After finishing this part, we had a system that collected millions of data points from various corners of the internet world, which was then used for analysis.

Share this post on: