January Challenge
Problem Statement
Welcome Data Scientist to the 5th SDS Club Monthly Challenge! In this month’s challenge your job is to help determine whether a stack overflow post is useful or not. To do this you will anaylze thousands of questions and determine if a question is high or low quality.
Evaluation
Understanding the Dataset
Each column in the dataset is labeled and explained in more detail below.
Title – title of the question
Body – body of the question
Tags – tags of the questions (eg. python, java, data-science)
CreationDate – date and time the question was posted
Type – quality of question
The Type column has three possible values:
HQ: a high quality question without any edits
LQ_EDIT: a low quality question with community edits
LQ_CLOSE: a low quality question which has been closed
Dataset Files
public_questions.csv – Dataset to train and analyze
pred_questions.csv – Dataset to predict questions’ quality
Submission
All submissions should be sent through email to challenges@superdatascience.com. When submitting, the file should contain predictions made on the pred_questions.csv file, and it should have the following format:
NOTE: HQ, LQ_EDIT & LQ_CLOSE should be converted into the respective number values: 0, 1, 2In [ ]:
0 1 2 1 0 2
Acknowledgements
The data is an original Stack Overflow dataset, made public for reasearchers. This challenge is not sponsored nor endorsed by Stack Overflow in anyway.
Leave a comment