当前位置:首页 > Windows程序 > 正文

ICDM Winners Interview: 3rd place, Roberto Diaz

2021-03-25 Windows程序

标签:

ICDM Winner‘s Interview: 3rd place, Roberto Diaz

This summer, the ICDM 2015 conference sponsored a competitionfocused on making individual user connections across multiple digital devices. Top teams were invited to submit a paper for presentation at an ICDM workshop.

Roberto Diaz, competing as team "CookieMonster", took 3rd place. In this blog, he shares how he became a Kaggle addict, what he values in a competition, and most importantly, details on his approach to this unique dataset. Congrats to Roberto for achieving his goal of becoming a top 100 Kaggle user!

407 players on 340 teams competed in ICDM 2015: Drawbridge Cross-Device Connections

The Basics What was your background prior to entering this challenge?

In addition to being a Kaggle addict, I am a researcher at Treelogicworking in the machine learning area. In parallel I work on my PhD thesis at the University Carlos III de Madrid focused on the parallelization of Kernel Methods.

Roberto‘s Kaggle profile

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

I didn‘t have any knowledge about this domain. The topic is quite new and I couldn‘t find any papers related to this problem, most probably because there are not public datasets.

How did you get started competing on Kaggle?

I started on the first Facebook competition a long time ago. A friend of mine was taking part in the challenge and he encouraged me to compete. That caught my initial curiosity so I accessed the challenge‘s forum and I read a post with a solution that scored quite well on the leaderboard and I thought "I think I can do better than that". At the end I scored 9th on the leaderboard.

For my second challenge (EMC Israel Data science challenge) I was on a team with my PhD mates. We finished 3rd receiving a prize.

After that it was too late for me, I had become an addict.

What made you decide to enter this competition?

The things I value most in a challenge are:

DÌaz-Morales, R., & Navia-V·zquez, A. (2015, September). Optimization of AMS using Weighted AUC optimized models. In *JMLR: Workshop and Conference Proceedings*, Vol. 42, pp. 109-127.

A domain unknown to me: It is the best way to learn about how to work with a different kind of data.

The need to preprocess and extract the features from raw data to build the dataset: It gives you the chance to use your intuition and imagination.

This challenge looked very interesting to me because all the conditions were met.

Let‘s Get Technical What preprocessing and supervised learning methods did you use?

In this challenge we had a list of devices and a list of cookies and we had to tell what cookies belonged to the person using the device.

The most important part was the feature extraction procedure, they had to contain information about the relation between devices and cookies (for example, the number of IP addresses visited by each one and by both of them).

Once I had the features I tried simple supervised machine learning algorithms and complex ones (my winning methodology was Semi-Supervised learning procedure using Gradient Boosting + Bagging) and the score just grew up from 0.865 to 0.88.

What was your most important insight into the data?

A key part of the solution was the initial selection of candidates and the post processing:

Initial selection: It was not possible to create a training set containing every combination of devices and cookies due to the high number of them. In order to reduce the initial complexity of the problem and to create an affordable dataset, some basic rules were created to obtain an initial reduced set of candidate cookies for every device. The rules are based on the IP addresses that both device and cookie have in common and how frequent they are in other devices and cookies.

Supervised Learning: Every pattern in the training and test set represents a device/candidate cookie pair obtained by the previous step and contains information about the device (Operating System (OS), Country, ...), the cookie (Cookie Browser Version, Cookie Computer OS,...) and the relation between them (number of IP addresses shared by both device and cookie, number of other cookies with the same handle than the cookie,...).

Post Processing: If the initial selection of candidates did not find a candidate with enough likelihood (logistic output of the classifier) we choose a new set of candidate cookies selecting every cookie that shares an IP address with the device and we score them using the classifier.

温馨提示: 本文由Jm博客推荐,转载请保留链接: https://www.jmwww.net/file/67284.html