Vahan AI

Tech Used: Python, OpenCV, NLTK, Pandas, Numpy, PDF Parsing, Selenium, Hadoop, Tesseract

I interned at Vahan AI between January 2020 to June 2020. Vahan AI is a YCombinator funded company, whose product acts as a Blue Collar Job platform, deployed as a chatbot that helps one get an interview in a matter of hours. From uploading the documents needed to verify one self, all the way to letting the user know an interview has been setup for him, Vahan takes care of it all through the use of a chatbot that vaguely resembles whatsapp in terms of feel and usability.

At Vahan, I was tasked with refactoring the User Verification and Validation vertical. This was to ensure that the product would be able to identify which of its end users were real and intent on finding a job. It would also ensure that the data the bot accepted and entered into into its database would be clean, and not irrelevent, like spam or greeting messages or links.

I started off doing a bit of text analysis, where my job was to identify which sort of chat messages does the bot encounter the most, and also categorize the same messages into several categories. This was to ensure that the bot knew which of the categories to reply to appropriately.

I was then tasked with identifying name specific information in valid messages to the bot. That is, if there was a name present in the message sent to the bot, extract it. This task was slightly more complicated, as it required me to identify a valid corpus i could use to serve as a reference. This would then lead me to building one of the largest repo of Indian names on the Web, through the Webscraping and PDF parsing. As the volume was drastically high, I had to come up with an efficient pipeline to parse the PDF's. This would then lead me to being exposed to Hadoop.

Parallely, I was also developing a POC for verifying an Aadhar Card image taken from a camera. The idea was to move from using the text on the card as a validation mechanism, to using the QR Code present on the card. This would reduce the number of OCR operations one had to make to verify/validate if an user presented a valid Aadhar card or not.

In all, Vahan exposed me to the world of ML that did not depend on obtaining a model, but rather the data nessecary to improve certain measures or strategies for the better. It also made a lot more comfortable scourging for the data over the web rather than rely on Kaggle or Github for ready made ones.