CrowdFlower Winners Interview: 1st place, Chenglong Chen
标签:
CrowdFlower Winner‘s Interview: 1st place, Chenglong ChenThe Crowdflower Search Results Relevance competition asked Kagglers to evaluate the accuracy of e-commerce search engines on a scale of 1-4 using a dataset of queries & results. Chenglong Chen finished ahead of 1,423 other data scientists to take first place. He shares his approach with us from his home in Guangzhou, Guangdong, China. (To compare winning methodologies, you can read a write-up from the third place team here.)
The competition ran from May 11-July 6, 2015.
The Basics What was your background prior to entering this challenge?I was a Ph.D. student in Sun Yat-sen University, Guangzhou, China, and my research mainly focused on passive digital image forensics. I have applied various machine learning methods, e.g., SVM and deep learning, to detect whether a digital image has been edited/doctored, or how much has the image under investigation been resized/rotated.
Chenglong‘s profile on Kaggle
I am very interested in machine learning and have read quite a lot of related papers. I also love to compete on Kaggle to test out what I have learnt and also to improve my coding skill. Kaggle is a great place for data scientists, and it offers real world problems and data from various domains.
Do you have any prior experience or domain knowledge that helped you succeed in this competition?I have a background of image proecssing and have limited knowledge about NLP except BOW/TF-IDF kinda of things. During the competition, I frequently refered to the book Python Text Processing with NLTK 2.0 Cookbook or Google for how to clean text or create features from text.
I did read the paper about ensemble selection (which is the ensembling method I used in this competition) a long time ago, but I haven‘t have the opportunity to try it out myself in real word problem. I previously only tried simple (weighted) averaging or majority voting. This is the first time I got so serious about the model ensembling part.
How did you get started competing on Kaggle?It dates back a year and a half ago. At that time, I was taking Prof. Hsuan-Tien Lin‘s Machine Learning Foundations course on Coursera. He encouraged us to compete on Kaggle to apply what we have learnt to real world problems. From then on, I have occasionally participated in competitions I find interesting. And to be honest, most of my programming skills about Python and R are learnt during Kaggling.
What made you decide to enter this competition?After I passed my Ph.D. dissertation defense early in May, I have had some spare time before starting my job at an Internet company. I decided that I should learn something new and mostly get prepared for my job. Since my job will be about advertising and mostly NLP related, I thought this challenge would be a great opportunity to familiarize myself with some basic or advanced NLP concepts. This is the main reason that drove me to enter.
Another reason was that this dataset is not very large, which is ideal for practicing ensemble skills. While I have read papers about ensembling methods, I haven‘t got very serious about ensembling in previous competitions. Usually, I would try very simple (weighted) averaging. I thought this is a good chance to try some of the methods I have read, e.g., stacking generalization and ensemble selection.
Let‘s Get Technical What preprocessing and supervised learning methods did you use?The documentation and code for my approach are available here. Below is a high level overview of my method.
Figure 1. Flowchart of my method
For preprocessing, I mainly performed HTML tags dropping, word replacement, and stemming. For a supervised learning method, I used ensemble selection to generate an ensemble from a model library. The model library was built with models trained using various algorithms, various parameter settings, and various feature sets. I have usedHyperopt (usually used in parameter tuning) to choose parameter setting from a pre-defined parameter space for training different models.
I have tried various objectives, e.g., MSE, softmax, and pairwise ranking. MSE turned out to be the best with an appropriate decoding method. The following is the decoding method I used for MSE (i.e., regression):
Calculate the pdf/cdf of each median relevance level, 1 is about 7.6%, 1 + 2 is about 22%, 1 + 2 + 3 is about 40%, and 1 + 2 + 3 + 4 is 100%.
Rank the raw prediction in an ascending order.
Set the first 7.6% to 1, 7.6% - 22% to 2, 22% - 40% to 3, and the rest to 4.
In CV, the pdf/cdf is calculated using the training fold only, and in the final model training, it is computed using the whole training data.
温馨提示: 本文由Jm博客推荐,转载请保留链接: https://www.jmwww.net/file/69112.html