David's Data Disseminations: Hackathon

Matt Clifford - DEN

My first hackathon was a great experience. I worked with one local teammate on site in Denver and two remote teammates in Seattle. We had from 10am to 5pm to submit predictions. I found that the tight timeline and working with a remote team energized the process significantly, creating a sense of focus and creativity it can be difficult to sustain throughout longer-term projects.

The project was loosely based on the management concept of "Good-Fast-Cheap". To tailor this concept to data science, sets of teams were given constraints of samples, features or algorithm. Our constraint was algorithmic. We were required to used a Random Forest classifier. We had our choice of features and samples, so used the largest of the available datasets.

Our SF colleague Anne shared a helpful pre-analysis tool that creates “HTML profiling reports from pandas DataFrame objects.”

This set had a class imbalance of about 75/25 in favor of the negative class (wage <=50K). We used the SMOTE technique, following a post by NVIDIA AI researcher Nick Becker, to oversample the minority class. As Becker discusses, it is essential to oversample AFTER applying train-test-split to the data. Otherwise, information will 'bleed' from the train set into the test set. SMOTE would not introduce outright duplicate observations into the test set, but it does use nearest neighbors of minority records to generate synthetic data, which would result in partial leakage. This might manifest as unevenly accurate validation predictions of near-neighbor observations.

As our SEA colleagues churned out engineered features, we continue to tune our model. To allow us to focus on tuning and potentially compare our constrained model choice with disallowed techniques in a short time span, we started from Berlin-based NLP researcher David Batista's code. This approach produces a clean multi-model summary report facilitating comparison along all .cv_results_ outputs from each tested Sci-Kit Learn model. We were technically allowed to use ExtraTrees models, so this was also helpful for tuning those models in tandem. Batista’s code employs GridSearchCV as driver by default. We briefly explored driving with RandomizedSearchCV also but determined that this method was not likely to significantly inform our process under the circumstances. We found Mohtadi Ben Fraj’s piece on Random Forest tuning helpful for it’s discussions of each

We also considered using itertools to grid search various combinations of promising features, but ultimately decided to skip this. We might like to add this as a feeder function to the Batista code in a future iteration, probably in a selective/prioritized manner informed by something like the pre-analysis code Anne shared.

We would also like to extend the pipeline summary report to handle performance with/without SMOTE and other preprocessing techniques, and to generate plots to compare speed/fit/generality of the models.

We would also like to funnel the winning model(s) from the pipeline into a more rigorous analysis including AUC-ROC and a confusion matrix, particularly recall and sensitivity.

Drew Dyson - DEN

Our very first Hack-a-thon! It wasn’t what I had imagined when I heard the term for the first time. I imagined ten people sitting in a fluorescent light-bulb lit basement, full of shadows trying to hack into some company or government agency. Our Hack-a-thon was attempting to ‘hack’ a problem of classify an individual based on their income. In order to keep the competition interesting each team had a constraint on their Data Science process. Our process was limited by only allowing us to use a Random Forest model as our final model. With our challenge set we started working.

The key to our success was our time spent communicating on Slack with our colleges in Seattle. Instead of some faceless user ID helping us ‘hack’ into the Pentagon, we had two solid colleges in Seattle (luckily with profile pictures) helping us divide and conquer the problem. I was concerned about how the communication would impact our ability to work cohesively but to my surprised the communication was smooth and effective. Everyone knew their roll and executed perfectly. The team in Seattle churned out impressive and effective features that combined the existing features we were given in the dataset. Some of the most effective features they produced were; Net Capital, Marriage, White-Collar Job and Husband. Marriage, White-Collar Job and Husband were all features that took in categorical data and binarized it allowing us to feed it into our model. Marriage and White-Collar Job took several different classification and allowed us to considered them as one whole entity making both of them more effective than any one classification by itself. These inventive features were the key to our success.

In Denver we focused on the modeling side. Matt was the brains of the operation, he pulled several different articles that addressed issues that came up during the process. One issue that he was able to solve for us was the large class imbalance. Our target class made up 25% of the train set which was causing our model to over classify in favor of the majority class. He suggested we use SMOTE, an oversampling method that compensated for our class imbalance. Matt also built some incredible pipelines that allowed us to essentially grid-search hyper-parameters and feature combinations over several model all in one go. While Matt was building his pipelines, I worked on examining and testing the features that were coming out of Seattle to determine which ones had the greatest effect on the existing model. In the end, once the pipelines were built and new features consolidated, we were able to successfully implement a random forest mode that categorized whether a person made more or less than $50K a year.

David B. Elliott - SEA

From reading a bit about the wage gap, I know that married people generally make more than un-married. It was therefore an easy step to transform the marital-status column into a Boolean column. Being married in some form or another had a surprisingly high correlation with wages.

Occupation did not readily lend itself to obvious categories. Turning the categoricals into dummies did not yield much. A colleague, Mark, however, would find some meaning in it.

Next on the docket was education. This was pretty easy to split into categories, with few surprising results. While all were relatively lightly correlated to earning, more education was more highly correlated to income over $50,000. Having a Master's degree was, surprisingly, less correlated than a Bachelor's.

Since being male and married are still positive indicators of income level, I decided to see what correlation being a husband was to earning potential. Quite strong, as it turns out.

I wanted to look at an interaction of the number of hours being worked and education, figuring that full-time and educated would be a good indicator of the ability to earn greater than 50k. I was right. I created a column that gave the employment level as a percentage of full-time (40 hours/week) and multiplied that by the numeric education level column. Quite a high correlation resulted, also, quite unsurprising.

Mark Carr - SEA

I am currently enrolled as a student in General Assembly's Data Science Immersive program, and our most recent project was a doozy.

Our group was tasked with sifting through a large US Census Bureau database, understanding its contents, extracting the most important features that would create a ‘recipe’ for making more than $50,0000/yr, and then generating a Random Forest Classification model to predict a class of “ >$50k” based on those descriptors. All in a window of a half-day with a team of four people scattered across the country!

The project started with an immediate implementation of a Random Forest model being fed a few columns. The predictive power of our model wasn’t where it needed to be, so from there, the task was feature engineering. As we would create new features, those features were appended to our model.

The initial EDA was enlightening. 'Age,' 'education,' and 'work' were the primary predictors that showed higher correlations to "wages." However, they were vague, not exactly accurate, and there were more insights we could squeeze out of the data.

Instead of "occupations," we stuck people's jobs into bins according to "white collar," "blue collar," and "pink collar." In the end, "white collar" became a feature that squeezed a little bit more out of our accuracy score. We also used new categorical labels to create features like "capital-net," (derived from capital-gain + capital-loss), "school," (derived from finished-education categories), and "age-bracket." There was incremental but positive change to the accuracy.

There is never enough time to explore everything. However, we also got a chance to see how aspects of people's lives, like marriage, gender, education, and age, affect their wages. To me there is never a dull moment if you are learning something new. I enjoyed this project a lot. I just wish there would have been more time. The point of the project was to teach us about the “good-fast-cheap” conundrum, and it certainly rang true here.

David's Data Disseminations

Monday, April 29, 2019

Hackathon

No comments:

Post a Comment