January Challenge

Problem Statement

Welcome Data Scientist to the 5th SDS Club Monthly Challenge! In this month’s challenge your job is to help determine whether a stack overflow post is useful or not. To do this you will anaylze thousands of questions and determine if a question is high or low quality.


accuracy = \frac{TP + TN}{TP + TN + FP + FN}

Understanding the Dataset

Each column in the dataset is labeled and explained in more detail below.

Title – title of the question
Body – body of the question
Tags – tags of the questions (eg. python, java, data-science)
CreationDate – date and time the question was posted
Type – quality of question

The Type column has three possible values:
HQ: a high quality question without any edits
LQ_EDIT: a low quality question with community edits
LQ_CLOSE: a low quality question which has been closed

Dataset Files

public_questions.csv – Dataset to train and analyze
pred_questions.csv – Dataset to predict questions’ quality


All submissions should be sent through email to challenges@superdatascience.com. When submitting, the file should contain predictions made on the pred_questions.csv file, and it should have the following format:

NOTE: HQ, LQ_EDIT & LQ_CLOSE should be converted into the respective number values: 0, 1, 2In [ ]:




The data is an original Stack Overflow dataset, made public for reasearchers. This challenge is not sponsored nor endorsed by Stack Overflow in anyway.