Introduction

Today in the digital age, everyone has access to gadgets like a mobile phone or a professional camera that can take a picture or record audio and video of different types of incidents, events etc. that occur in our life. There are so many online platforms like YouTube, Instagram etc. where we share these types of files. Just imagine the amount of data we are talking about and all kinds of meaningful information we can extract out of these through proper analytical tools. To perform Data Analysis, it is important to understand that these recorded images, videos, or audios are very complex and unstructured. But there are tools available by which we can perform the analytics to extract information from these types of data

Problem Actualization

According to the requirement of the client, the analysis tool that we have built needed to have the ability to analyze these data and extract meaningful information from these data. They wanted a system to upload different images, pdf documents, audio files, and video files to the system for analysis.

For images and pdf documents, the client wanted the textual information from those files. That means they needed an OCR system.
According to Wikipedia
“Optical character recognition or optical character reader, often abbreviated as OCR, is the mechanical or electronic conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo (for example the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image (for example from a television broadcast).”

For audio and video files, the client wanted a system to extract the transcription from those data.
According to Wikipedia
“A transcription service is a business service which converts speech (either live or recorded) into a written or electronic text document.”
In other words, the client needed to convert the audio into text.

Solution

After extracting the information from the data, we also needed to perform various other analysis on those data.
So, our first step was to build a system where users can upload multiple types of files, without the specifying the format or type of files they were uploading. That means the client wanted the portal to have the ability where the user could upload some images, some audio clips, documents, videos etc. at the same time and the system needed to automatically detect the type of the file and perform the analysis on the that basis.
The second step was to handle images and documents files. We needed an Optical Character Recognition system for extracting textual information. We decided to use Google Vision API for this. Google Vision provides a very powerful OCR that can handle multiple languages, with the added advantage of being scalable. How it works is that, we send each image to Google Vision API, and then it sends us the extracted images. But for pdf documents, we cannot send it to Vision API directly. Rather, we need to convert each document to image and then send those images to Vision API.

For audio and video files, the client needed a transcription system. So, we decided to use Google Cloud Video Intelligence API here. It is an API that generates high quality, accurate transcription from audio and video files. We send each audio and video files to the API and it sends us the transcripted text.

To save the extracted information, we used AWS S3. S3 or Simple Storage Service is a flexible as well as scalable service provided by Amazon to store various types of data. Now the extracted information was in the right place, in the right format, and ready for further analysis.

Finally, after extracting the information from those data, we performed various other actions like translating the text to multiple other languages and then generating short and brief summaries, to perform sentiment analysis.

Share this post on: