Challenge at hand
A Malaysian news agency wanted to automate insight-mining from enormous amounts of data collected from various disparate sources, so they could speed up generation of useful reports on key societal trends. Their reports were often used by the government to understand trends such as drug usage. Currently, data was being captured from social media, newspapers and physical scanning of documents, before being manually stored in spreadsheets and analysed for trends.
The current process was labor-intensive, time consuming and inefficient. It was also prone to errors and was not very scalable.Technical challenges in existing solution
Manual data entry: Data from different sources was manually entered into spreadsheets.
No standardization: Data came in different formats, sizes and languages, and needed to be harmonized.
Limitations on number of requests handled: Current AWS Lambda services had limitations on the number of requests processed at any given time.
Lack of customization: No access control through role management.
Reports lacked insights: No visualization of data to report high level trends.
The TechVariable Solution
TechVariable built a platform that automated data collection and segregation, harmonized different file types into a cohesive set of data files, and enabled the generation of reports with data visualization in real-time. With an easy-to-use, interactive and customized user interface solution, we created three automated data pipelines:
We created three automated data pipelines:
Pipeline 1: A custom scraping engine collects data from all online sources, extracts data from hardcopy formats, PDFs and jpegs using Optical Character Recognition (OCR), and summarizes lengthy articles according to identified keywords.
Pipeline 2: Video transcription using Google APIs to process video and audio files from YouTube or otherwise.
Pipeline 3: : A custom-built mechanism for online sources that segregates hashtags, keywords, stories, etc. for further processing.