January Challenge


Problem Statement

Welcome Data Scientist to the 5th SDS Club Monthly Challenge! In this month’s challenge your job is to help determine whether a stack overflow post is useful or not. To do this you will anaylze thousands of questions and determine if a question is high or low quality.

Evaluation

$$\begin{equation*}
accuracy = \frac{TP + TN}{TP + TN + FP + FN}
\end{equation*}$$

Understanding the Dataset

Each column in the dataset is labeled and explained in more detail below.

Title – title of the question
Body – body of the question
Tags – tags of the questions (eg. python, java, data-science)
CreationDate – date and time the question was posted
Type – quality of question

The Type column has three possible values:
HQ: a high quality question without any edits
LQ_EDIT: a low quality question with community edits
LQ_CLOSE: a low quality question which has been closed

Dataset Files

public_questions.csv – Dataset to train and analyze
pred_questions.csv – Dataset to predict questions’ quality

Submission

All submissions should be sent through email to challenges@superdatascience.com. When submitting, the file should contain predictions made on the pred_questions.csv file, and it should have the following format:

NOTE: HQ, LQ_EDIT & LQ_CLOSE should be converted into the respective number values: 0, 1, 2In [ ]:

0
1
2
1
0
2

https://github.com/edis/sds_challenges/tree/master/challenge_5

Acknowledgements

The data is an original Stack Overflow dataset, made public for reasearchers. This challenge is not sponsored nor endorsed by Stack Overflow in anyway.