How Much Did It Rain? Winners Interview: 1st place, Devin An
标签:
How Much Did It Rain? Winner‘s Interview: 1st place, Devin AnzelmoAn early insight into the importance of splitting the data on the number of radar scans in each row helped Devin Anzelmo take first place in the How Much Did It Rain? competition. In this blog, he gets into details on his approach and shares key visualizations (with code!) from his analysis.
351 players on 321 teams built models to predict probabilistic distributions of hourly rainfall
The Basics What was your background prior to entering this challenge?My background is primarily in cognitive science and biology, but I dabbled in many different areas while in school. My particular interests are in human learning and behavior and how we can use human activity traces to learn to shape future actions.
Devin‘s profile on Kaggle
My interest in gaming as a means of teaching and competitive nature has made Kaggle a great fit for my learning style. I started competing seriously on Kaggle in October 2014. I did not have much experience with programming or applied machine learning, and thought entering a competition would provide a structured introduction. Once I started competing I found I had difficult time stopping.
What made you decide to enter this competition?I thought there was a decent chance I could get into the top five in the competition and this drove me to enter. After finishing the BCI competition I had to decide between Otto group product challenge and this one. I chose How Much Did it Rain because the dataset was difficult to process, and it wasn‘t obvious how to approach the problem. These factors favored my skills. I didn‘t feel like I could compete in Otto where the determining factor was going to primarily rely on ensembling skills.
Let‘s Get Technical What preprocessing and supervised learning methods did you use?Most of the preprocessing was just feature generation. Like most other competitors I used descriptive statistics and counts of the different error codes. These made up the bulk of my features and turned out to be enough to get first place. We were given QC‘d reflectivity data, but instead of using this information to limit the data used in feature generation I included it as a feature and let learning algorithm (Gradient Boosted Decision Trees) use it as needed.
The most important decision with regard to supervised learning was how to model the output probability distribution. I decided to model it by transforming the problem into a multi-class classification problem with soft output. Since there was not enough data to perform classification with the full 70 classes the problem had to be reduced further. It turned out there were many different ways that people solved this problem, and I highly recommend reading the end of competition thread for some other approaches.
See the code on scripts
I ended up using a simple method in which basic component probability distributions were combined using the output of a classification algorithm. For classes that had enough data a step function was used for a CDF. When there was less data the several labels were combined and replaced by a single value. In this case an estimation of the empirical distribution for that class was used as a component CDF. This method worked well and I used it for most of the competition. I did try regression and classification just on the data from the minority classes but it never performed quite as well as just using the empirical distribution.
What was your most important insight into the data?Early in the competition I discovered that it was helpful to split the data based on the number of radar scans in each row. Each row has data spanning the hour previous to the rain gauge reading. In some cases there was only one radar scan in others there was more then 50. There are over one hundred thousand rows in the training set with more then 17 radar scans. For this data I wanted to create features which take into account the changing of weather conditions over time. In doing this I realized it was not possible to make these features for the rows that had only 1 or 2 radar scans. This was the initial reason for splitting the dataset. When I started looking for places to split it I found that there was also a strong positive correlation between the number of radar scans and the average rain amount. Those rows with 1 scan had 95% 0mm of rain, while the subset with 17 or more scans only 48% of the data had 0mm of rain. Interestingly for the data with few radar scans many of the most important features were the counts of the error codes.
See the code on scripts
温馨提示: 本文由Jm博客推荐,转载请保留链接: https://www.jmwww.net/file/70243.html