An existential problem for any major website today is how to handle toxic and divisive content. Quora wants to tackle this problem head-on to keep their platform a place where users can feel safe sharing their knowledge with the world. On Quora, people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge is to weed out insincere questions -- those founded upon false premises, or that intend to make a statement rather than look for helpful answers. In this Kaggle competition, Quora challenges data scientist to build models to identify and flag insincere questions. This will help quora in developing more scalable machine learning based methods apart from manual review to detect toxic and misleading content. Moreover it will help Quora in upholding their policy of “Be Nice, Be Respectful” and continue to be a place for sharing and growing the world’s knowledge.
An insincere question is defined as a question intended to make a statement rather than look for helpful answers. Some characteristics that can signify that a question is insincere:
This was a kernel only competition, therefore the entire solution right from preprocessing to training deep learning models had to be done under 2 hours runtime of kaggle kernel , otherwise the team was disqualified. The official site of competition containing competition details: Kaggle-Quora
The competition was challenging in terms of finding correlated local validation, running the solution in 2 hours, and producing reproducible results, to name a few. We found a nearly correlated validation set before 1 week of the competition end. We used simple averaging where each trained NN model (Pytorch) was different either in terms of learning rate, pre-processing, embedding or architecture to maintain model diversity. One important thing we realised was: given small learning rate and large number of epochs, single FastText embedding based models were beating our glove+paragram models. So we focused on tuning FastText based models which could provide considerable score within 5-6 epochs. We also added Gaussian Noise to some models after embeddings to reduce the overdependence of RNN on specific keywords.
This was my first kaggle competition. It wouldn’t have been possible to compete without my awesome teammates, Soham and Rahul, who put in lot of effort and made this competition a great learning experience.
We ranked 33rd out of 4037 teams and achieved a Silver medal with final F1-score of 0.70658 on test set. I entered about 1 month late in this competition, leaving me only about 1 and a half month for developing the solution. Being my first competition, I learnt a lot of stuff including: the pytorch framework, effect of random initilization, reproducibility of models, new machine learning learning techniques like ensembling to name a few! Truly Kaggle is a place where one can use all of their innovative ideas to compete with the best techniques possible :)
GitHub repo: https://github.com/soham97/Quora-Insincere-Questions-Classification-Challenge-NLP