Introduction

The goal of this project is to find a machine learning model that can take in “visual” predictors, such as color or type of clothing, and output a predicted number of likes.

Data Set Overview

The data set that I am working with is called Polyvore Dataset which was gathered for a paper by Han, Xintong and Wu, Zuxuan and Jiang, Yu-Gang and Davis, Larry S. The GitHub page containing the data set also includes the link to the paper, “Learning Fashion Compatibility with Bidirectional LSTMs”.

The files containing the data have been pre-separated into training, validation, and testing, with each containing 17,316, 1,497, and 3,076 observations, respectively.

A brief overview of the website where this data comes from: Polyvore.com was a website where people could put together, upload, and share outfits using real clothing items (it has since been acquired and subsequently repurposed for selling clothes). Posts would look something like this:

Each observation in the data set is one of these outfits, which includes general information such as the view count, likes, and a link to the image, as well as each article of clothing/accessory making up the outfit. Each piece in the outfit has its own info, including the index (position in the outfit from top to bottom), name, price, and likes.

The predictors include up to 8 different outfit elements, with each element having 6 attributes. In other words, there can be up to 48 different predictors. In order to make this data usable for the machine learning tools I will be using, I need to do some pre-processing. I’ve decided to do this step in Python, and there is a decent amount of code involved. I’ll go over the basic steps taken to arrive at the final data frame, but the jupyter notebook with the full process is available on my Github page if you’re interested.

Data Pre-Processing

First, I imported the data set and merged the pre-split testing, training, and validation sets so I can have all of the data in one set. Looking at the data frame produced from the json data, we can see that the items column contains dictionaries as its entries.

In order to “unpack” this column, I made a function that takes the items column as input and outputs a new data frame containing all of the information expanded out into columns.

I merged this data frame to the original and dropped other columns I knew I wasn’t going to use. The next thing I needed to do was extract information from the item names, and I decided to focus on color. I made a function that looks at the words in all of the item name columns and outputs a list of the unique colors it found from the text.

I applied this function to the data frame and turned the unique colors into 4 columns, color_1, color_2, color_3, and color_4. I then dropped the item name columns since I no longer needed them.

The original Polyvore Dataset came with a text file, category_id.txt, which includes the category ids and the item types they correspond to. I used this file to map the item ids to their item types to make the data more readable.

Finally, I exported the data frame as a csv which I will be importing into R.

Missing Values

You might have noticed in the pre-processing step that there are some colors and item names that are “none”. The reason for this is that in the original data set, the nested lists are not all the same length; the outfits can each have 4 to 8 items. For the ones with fewer than 8 items, I had to backfill them, but I didn’t want to treat it as “missing” since the information that the outfit only has 4 items can be useful. The same goes for the color columns: some outfits only had 1, 2 or 3 unique colors. Again, the information that there’s only one color present in the outfit can be useful, so I’ve kept “none” as its own category.

There weren’t missing data in the original data set, but there were some outfits that I couldn’t “extract” color from with my function, so I dropped them from the final set.

Exploratory Data Analysis

Now that we’ve tidied the data in a format that will be better suited for our analysis, we can begin by making some plots and seeing how the data is distributed.

First we will import the data and convert the appropriate variables into factors.

# Reading the tidied data into R
outfits <- read.csv('polyvore_data/tidy_data.csv')

# converting all character columns into factors
outfits[sapply(outfits, is.character)] <- lapply(outfits[sapply(outfits, is.character)], as.factor)
# I will only be using set_id to retrieve photos, so we will leave it be

outfits %>% head(5)

In the final data set, there are 15 variables:

Name Variable description
views The number of times an outfit has been viewed
likes The number of likes an outfit received (the response variable)
set_id identifies the particular outfit
item_1 - item_8 gives the name of the item type for up to 8 items in an outfit
color_1 - color_4 gives the unique colors found in the outfit for up to 4 different colors

Likes

We will start by looking at the distribution of our response variable, likes.

outfits %>% 
  ggplot(aes(x = likes)) + 
  geom_histogram(binwidth = 100, fill="#2C514C") +
  labs(x = 'Number of Likes', y = 'Count', title = 'Distribution of the Number of Likes') +
  coord_cartesian(xlim = c(0, 4000)) +
  theme_polyvore()

The number of likes ranges from 0 to around 3000, and the frequency decreases exponentially as the number of likes goes up. This follows the sort of logic you’d expect, where a select few get lots of likes while most go unseen.

Items (Clothing Articles)

After an initial survey of the raw data, I noted that there were a lot of different item ids.

# Finds the number of unique items in item_1 column
outfits$item_1 %>% unique() %>% length()
## [1] 229

And indeed, there are 229 unique items in item_1 alone! To further examine the nature of the item columns, we will look at a plot of the item frequencies in item_1. Additionally, we’ll do a “zoom in” of the top 10 items by frequency in item_1.

outfits %>% ggplot(aes(x = fct_infreq(item_1))) +
  geom_bar(fill = '#F08080') +
  labs(x = 'Item 1', y='', title = 'Frequencies of item_1 Items') +
  theme_polyvore() +
  theme(axis.text.x = element_blank(), 
        panel.grid.major.x = element_blank(), 
        axis.line = element_blank()) -> p1

item1_t10 <- (sort(table(outfits$item_1), decreasing = TRUE) %>% names)[1:10]
outfits_item1_top10 <- outfits[outfits$item_1 %in% item1_t10,]

outfits_item1_top10 %>% 
  ggplot(aes(y = fct_infreq(item_1))) +
  geom_bar(fill="#583742") +
  labs(x = '', y='', title = 'Top 10 item_1 Items') +
  theme_polyvore() +
  theme(axis.line = element_blank()) -> p2

grid.arrange(p1, p2, ncol=2)

The first plot shows that occurrences of the 229 items in item_1 are concentrated towards the top few items and fall off significantly after. We can see from the second plot that the top items of item_1 are all tops, dresses, and jackets of some type. This similarity gives us an idea of how the items have been organized, and we can get a better picture from constructing a table.

item1_t5 <- (sort(table(outfits$item_1), decreasing = TRUE) %>% names)[1:5]
item2_t5 <- (sort(table(outfits$item_2), decreasing = TRUE) %>% names)[1:5]
item3_t5 <- (sort(table(outfits$item_3), decreasing = TRUE) %>% names)[1:5]
item4_t5 <- (sort(table(outfits$item_4), decreasing = TRUE) %>% names)[1:5]
item5_t5 <- (sort(table(outfits$item_5), decreasing = TRUE) %>% names)[1:5]
item6_t5 <- (sort(table(outfits$item_6), decreasing = TRUE) %>% names)[1:5]
item7_t5 <- (sort(table(outfits$item_7), decreasing = TRUE) %>% names)[1:5]
item8_t5 <- (sort(table(outfits$item_8), decreasing = TRUE) %>% names)[1:5]

item_freq_table <- data.frame(item_1 = item1_t5,
                              item_2 = item2_t5,
                              item_3 = item3_t5,
                              item_4 = item4_t5,
                              item_5 = item5_t5,
                              item_6 = item6_t5,
                              item_7 = item7_t5,
                              item_8 = item8_t5)
item_freq_table %>% gt() %>% 
  tab_header(
    title = "Top 5 Items by Frequency"
  )
Top 5 Items by Frequency
item_1 item_2 item_3 item_4 item_5 item_6 item_7 item_8
Tops Jackets Sandals Shoulder Bags None None None None
Sweaters Coats Ankle Booties Ankle Booties Earrings Earrings Sunglasses Lipstick
Day Dresses Shorts Sneakers Clutches Shoulder Bags Necklaces Earrings Sunglasses
T-Shirts Skinny Jeans Pumps Handbags Necklaces Sunglasses Lipstick Fragrance
Blouses Knee Length Skirts Skinny Jeans Pumps Bracelets & Bangles Bracelets & Bangles Tech Accessories Tech Accessories

From looking at both the raw data and this table, a general pattern emerges: within each outfit, the items are sorted from most to least significant, with item_1 being the most significant item and item_8 being the least. The top items in item_2 are still significant, with jackets being the most frequent and “bottoms” being common elements as well (shorts, jeans, skirts). item_3 dips into shoe territory, and item_4 through item_8 can be seen mostly as the “accessory section”. Another thing that is interesting to note from the table is that from item_5 and onward, “None” is the most frequent item. We can check the proportion of “None”s pretty easily.

# Gets proportion of data that has fewer than 5 items, fewer than 6 items, fewer than 7 items, and fewer than 8 items
props <- c(((filter(outfits, item_4 == 'None') %>% count()) / (outfits %>% count()))[1,1] %>% as.numeric(),
((filter(outfits, item_5 == 'None') %>% count()) / (outfits %>% count()))[1,1] %>% as.numeric(),
((filter(outfits, item_6 == 'None') %>% count()) / (outfits %>% count()))[1,1] %>% as.numeric(),
((filter(outfits, item_7 == 'None') %>% count()) / (outfits %>% count()))[1,1] %>% as.numeric(),
((filter(outfits, item_8 == 'None') %>% count()) / (outfits %>% count()))[1,1] %>% as.numeric())
props
## [1] 0.00000000 0.08148734 0.23016066 0.40670643 0.59554528

With some simple arithmetic, this gives us a breakdown of the item number proportions:

  • 8% of the outfits have 4 items,
  • 15% have 5 items,
  • 17% have 6 items,
  • 19% have 7 items, and
  • 41% have 8 items

Now we want to see what relationship the items have with likes. Before doing so, I will transform the data so that we only have 10 different categories, which will be the top 9 occurrences plus another category, ‘other’.

outfits2 <- outfits %>% mutate(item_1 = fct_lump_n(outfits$item_1, n=9))
colors <- c('#335C57', '#583742', '#F08080',
             '#335C57', '#583742', '#F08080',
             '#335C57', '#583742', '#F08080',
             '#335C57')
             
outfits2 %>% 
  ggplot(aes(x=likes, y=item_1, group=item_1)) +
  geom_boxplot(fill = colors) +
  labs(y = '', title = 'Boxplot of item_1') +
  theme_polyvore() +
  theme(axis.title.y = element_blank()) -> p1

outfits2 %>% 
  ggplot(aes(x=likes, y=item_1, group=item_1)) +
  geom_boxplot(fill = colors) +
  scale_x_continuous(limits = c(0, 600)) +
  labs(y = '', title = 'Boxplot of item_1 (zoomed)') +
  theme_polyvore() +
  theme(axis.title.y = element_blank()) -> p2

grid.arrange(p1, p2, ncol = 2)

In the first boxplot, we can see the full range of likes. It’s hard to glean any information from the boxes themselves, but we can see differences in the spread of outliers for each category. In the second boxplot, I’ve “zoomed in” by limiting the x axis so we can see the boxes better. We can see here that sweaters and day dresses have the highest average number of likes, while tank tops trail behind all of the other categories.

Colors

All of us understand that color plays an essential role in whether or not an outfit “works” (unless you’re doing wardrobe for a 1947 film noir), so it will be interesting to see what kind of colors are represented in the data.

colors1 <- c('#fb5c5c', '#4a687d', '#ffdfbf',
            'gray16', '#9b938c', '#e2ba59', '#628d55',
            '#643b34', '#e3774c', '#6e4258')

outfits %>% 
  ggplot(aes(y = fct_infreq(color_1))) +
  geom_bar(fill = colors1) +
  labs(x = '', y = '', title = 'Frequencies of Colors in color_1') +
  theme_polyvore() +
  theme(axis.line = element_blank(),
        panel.grid.major = element_blank()) -> p1

colors2 <- c('gray16', '#4e4c5e', '#ffdfbf', '#e2ba59',
             '#9b938c', '#643b34', '#4a687d', '#628d55',
             '#fb5c5c', '#e3774c', '#6e4258')

outfits %>% 
  ggplot(aes(y = fct_infreq(color_2))) +
  geom_bar(fill = colors2) +
  labs(x = '', y = '', title = 'Frequencies of Colors in color_2') +
  theme_polyvore() +
  theme(axis.line = element_blank(),
        panel.grid.major = element_blank()) -> p2

colors3 <- c('#4e4c5e', 'gray16', '#ffdfbf', '#e2ba59',
             '#9b938c', '#643b34', '#4a687d', '#6e4258',
             '#fb5c5c', '#628d55', '#e3774c')

outfits %>% 
  ggplot(aes(y = fct_infreq(color_3))) +
  geom_bar(fill = colors3) +
  labs(x = '', y = '', title = 'Frequencies of Colors in color_3') +
  theme_polyvore() +
  theme(axis.line = element_blank(),
        panel.grid.major = element_blank()) -> p3

colors4 <- c('#4e4c5e', 'gray16', '#9b938c', '#fb5c5c',
             '#e2ba59', '#4a687d', '#ffdfbf', '#6e4258',
             '#e3774c', '#643b34', '#628d55')

outfits %>% 
  ggplot(aes(y = fct_infreq(color_4))) +
  geom_bar(fill = colors4) +
  labs(x = '', y = '', title = 'Frequencies of Colors in color_4') +
  theme_polyvore() +
  theme(axis.line = element_blank(),
        panel.grid.major = element_blank()) -> p4

grid.arrange(p1, p2, p3, p4, ncol=2)

The color columns are similar to the item columns in the way they are ordered: color_1 will have higher importance than color_2 (and so on) in the outfit due to the way the information was extracted. This gives us some insight into the colors represented in these charts.

Red clearly reigns supreme for color_1, followed by blue, white, and black. We start to see “none” dominate starting in color_2, indicating that there are lots of outfits for which only one color could be extracted (or had a monochromatic scheme going on). In color_4, we see that most of the outfits simply don’t have a color entry. This could mean that most outfits stuck to 1-3 colors.

Next, we’ll take a look at the relationship between color and likes.

colors <- c('gray16', '#4a687d', '#643b34',
            '#9b938c', '#628d55', '#e3774c', '#6e4258',
            '#fb5c5c', '#ffdfbf', '#e2ba59')

outfits %>% 
  ggplot(aes(x=likes, y=color_1, group=color_1)) +
  geom_boxplot(fill=colors) +
  labs(title = 'Boxplot of color_1') +
  theme_polyvore() +
  theme(axis.title.y = element_blank()) -> p1

outfits %>% 
  ggplot(aes(x=likes, y=color_1, group=color_1)) +
  geom_boxplot(fill=colors) +
  scale_x_continuous(limits = c(0, 500)) + 
  labs(title = 'Boxplot of color_1 (zoomed)') +
  theme_polyvore() + 
  theme(axis.title.y = element_blank()) -> p2

grid.arrange(p1, p2, ncol=2)

The left graph shows similar behavior as the item_1 boxplot, with all of the colors being concentrated near zero but having different spreads of outliers into the higher numbers of likes. On the right boxplot, we can see that orange has the highest number of average likes, while red and blue are towards the low end. This is interesting since red and blue were the most frequent colors and orange was the second least frequent, so this could potentially mean that being less common gives outfits with orange a slight edge.

Fitting Models

With the information gathered from the exploratory data analysis, we can start fitting models. There are a few key steps in fitting a model:

  • Split the data into training and testing
  • Specify a recipe for the model to use
  • Create a workflow for the model
  • Do any tuning necessary
  • Fit training data to the model(s)
  • Fit best model to the testing data based off of training performance metrics

Regression models will be used for this case because the response variable, likes, is numeric (we also have an interesting situation where all of our predictors are categorical). Because there are so many different categories in the item columns, this could lead to a problem of overfitting. For now, we will keep each of these columns to 10 categories.

Splitting the Data

Without further ado, we will start by splitting our data into training and testing. I’ve decided to split it into 70% training and 30% testing, stratifying on the likes variable. Stratifying ensures that the training and testing have the same distribution of likes. If the distributions were different, the training data could skew the models one way, leading to poor performance on the testing data.

# Setting each item category to the top 9 categories, the rest will be "other"
outfits <- outfits %>% 
  mutate(item_1 = fct_lump_n(outfits$item_1, n=9), 
         item_2 = fct_lump_n(outfits$item_2, n=9), 
         item_3 = fct_lump_n(outfits$item_3, n=9), 
         item_4 = fct_lump_n(outfits$item_4, n=9), 
         item_5 = fct_lump_n(outfits$item_5, n=9), 
         item_6 = fct_lump_n(outfits$item_6, n=9), 
         item_7 = fct_lump_n(outfits$item_7, n=9), 
         item_8 = fct_lump_n(outfits$item_8, n=9))

set.seed(222)
outfits_split <- initial_split(outfits, prop=.70,
                               strata=likes)
outfits_train <- training(outfits_split)
outfits_test <- testing(outfits_split)

Creating a Recipe

The next step is to make a recipe that our models will use to fit the data. In our case, this is pretty analogous to constructing an outfit. Which colors should I include and how many? Should I wear a coat? What about boots? We can get this information from item_1 through item_8 and color_1 through color_4. I’m excluding views because even though it’s probably a good predictor, it isn’t really something you consider when picking out an outfit.

outfit_recipe <- recipe(likes ~ item_1 + item_2 + item_3 + item_4 +
                          item_5 + item_6 + item_7 + item_8 +
                          color_1 + color_2 + color_3 + color_4,
                        data = outfits) %>% 
    step_dummy(all_nominal_predictors()) %>% 
    step_interact(terms = ~ starts_with("color_1"):starts_with("color_2"))

I’ve also decided to include an interaction term between color_1 and color_2 to represent the “color scheme” of the outfit. This will allow the model to consider, for example, if an outfit has both red and green, or if it has black and white, and so on.

K-Fold Cross Validation

For our project, we will be using k-fold cross validation with a k value of 10, which is a process that splits up the training data into k different groups and then chooses one as the validation group, then fits the model using the rest of the groups as training, repeating the process k times. This process is useful since it produces less variance in the estimates of the model’s performance.

set.seed(222)
outfits_fold <- vfold_cv(outfits_train, v=10)

Model Building

The models that we will be using for this data, as mentioned before, will be regression models. The following are the ones I’ve chosen to include:

  • Linear regression
    • The simplest of the models. Estimates a linear relationship between the predictors and the response, then uses that estimate to make a prediction.
  • Elastic net
    • Similar to linear regression, but adds a “penalty” on the number of predictors. It’s elastic because it can do a combination of two different penalties, ridge and LASSO.
  • K-nearest neighbors
    • Makes predictions by finding the k nearest values, and then takes their average as the prediction.
  • Random Forest
    • Combines multiple decision trees and takes the average of their outcomes to make a decision. Each decision tree arrives at a value by making different splits, such as “is the first color red?” “is item 1 a jacket?” “is the second item skinny jeans?” and then based off of that information it will take an average of the group it ends up in, perhaps 265 likes. In a forest, each tree will try to ask different questions which allows for a less biased overall answer.
  • Boosted Trees
    • Makes predictions by iterating through multiple small/weak trees before arriving at the final prediction.

First, we will set up the models. You’ll notice that some of the models have tune() included in some of the parameters. This allows us to tune that parameter to find the value that produces the best predictions.

# Linear Model
lm_model <- linear_reg() %>% 
  set_engine("lm")


# K-nearest neighbors, tuning neighbors
knn_model <- nearest_neighbor(neighbors = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("kknn")


# Elastic net, tuning penalty and mixture
elastic_model <- linear_reg(penalty = tune(), 
                           mixture = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("glmnet")


# Random forest, tuning mtry, trees, and min_n
rf_model <- rand_forest(mtry = tune(), 
                       trees = tune(), 
                       min_n = tune()) %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("regression")


# Boosted trees, tuning trees, learn_rate, and mtry
boosted_model <- boost_tree(trees = tune(),
                           learn_rate = tune(),
                           mtry = tune()) %>%
  set_engine("xgboost") %>%
  set_mode("regression")

Next, we will set up the workflows for each model.

lm_wflow <- workflow() %>% 
  add_model(lm_model) %>% 
  add_recipe(outfit_recipe)


knn_wflow <- workflow() %>% 
  add_model(knn_model) %>% 
  add_recipe(outfit_recipe)


elastic_wflow <- workflow() %>% 
  add_recipe(outfit_recipe) %>% 
  add_model(elastic_model)


rf_wflow <- workflow() %>% 
  add_recipe(outfit_recipe) %>% 
  add_model(rf_model)


boosted_wflow <- workflow() %>% 
  add_recipe(outfit_recipe) %>% 
  add_model(boosted_model)

Now we’ll make tuning grids for the models with tuning parameters. These tuning grids specify the ranges of values we want to test for each parameter being tuned. This is important because for models like knn, a higher value of k will lead to a less flexible model and a lower value will be more flexible. We don’t know how much flexibility will work the best, so tuning allows us to find out.

knn_grid <- grid_regular(neighbors(range = c(10,25)), levels = 5)


elastic_grid <- grid_regular(penalty(),
                        mixture(range = c(0, 1)),
                             levels = 10)


rf_grid <- grid_regular(mtry(range = c(3, 5)), 
                        trees(range = c(10, 150)),
                        min_n(range = c(1, 20)),
                        levels = 8)

boosted_grid <- grid_regular(mtry(range = c(2, 4)), 
                        trees(range = c(10, 100)),
                        learn_rate(range = c(-10, -1)),
                        levels = 5)

And now the part where my PC takes a minor beating: tuning.

knn_tune <- tune_grid(
    knn_wflow,
    resamples = outfits_fold,
    grid = knn_grid
)

elastic_tune <- tune_grid(
  elastic_wflow,
  resamples = outfits_fold,
  grid = elastic_grid
)

rf_tune <- tune_grid(
  rf_wflow,
  resamples = outfits_fold,
  grid = rf_grid
)

boosted_tune <- tune_grid(
  boosted_wflow,
  resamples = outfits_fold,
  grid = boosted_grid
)

6 minutes and 39 seconds! Not bad! Now, I’ll save the results and load them back in so I don’t have to donate another 6 minutes and 39 seconds of my future.

save(knn_tune, file = "tunings/knn_tune.rda")
save(elastic_tune, file = "tunings/elastic_tune.rda")
save(rf_tune, file = "tunings/rf_tune.rda")
save(boosted_tune, file = "tunings/boosted_tune.rda")
load("tunings/knn_tune.rda")
load("tunings/elastic_tune.rda")
load("tunings/rf_tune.rda")
load("tunings/boosted_tune.rda")

BAM! Models are tuned!* We now want to find the root mean square error (RMSE) for each model. The RMSE is a measure of the differences between predicted values and actual values, so this will be useful in comparing the models’ performances.

* Linear regression hasn’t been tuned since it doesn’t have tuning parameters… I bet he feels so alone right now… It’s okay though. He will still be fit.

lm_fit <- fit_resamples(lm_wflow, resamples = outfits_fold)
save(lm_fit, file = "tunings/lm_fit.rda")

Training Performance

Now we can look at the performance of the model on the training data. First, we’ll look at the RMSE and then we’ll look at some of the autoplots of the tuned model.

load("tunings/lm_fit.rda")

lm_rmse <- collect_metrics(lm_fit)$mean[1] %>% round(2)

knn_rmse <- show_best(knn_tune, n=1)$mean %>% round(2)

elastic_rmse <- show_best(elastic_tune, n=1)$mean %>% round(2)

rf_rmse <- show_best(rf_tune, n=1)$mean %>% round(2)

boosted_rmse <- show_best(boosted_tune, n=1)$mean %>% round(2)

rmse_table <- data.frame(model = c('Linear Regression',
                                   'KNN','Elastic Net', 
                                   'Random Forest', 
                                   'Boosted Trees'),
                         RMSE = c(lm_rmse,  knn_rmse, 
                                  elastic_rmse, rf_rmse, 
                                  boosted_rmse))

rmse_table

Using the RMSE values to discern between the models, we can see that the elastic net model had the best performance, and k-nearest neighbors had the worst. Now, we’ll look at some of the autoplots to take a look at how the different tunings performed.

autoplot(elastic_tune, metric='rmse') +
  labs(title='Elastic') +
  theme_polyvore()

We can see from the elastic net model that having a lasso penalty of 1 and a mixture of 1 gives the best results. What this means is that the model that performed the best was a lasso regression.

autoplot(rf_tune, metric='rmse') +
  labs(title='Random Forest') +
  theme_polyvore()

The autoplot for the random forest tells us that 120 trees with mtry=5 seems to have the best performance. This makes sense, as having more trees gives the model more flexibility which would be helpful in the case where there are a lot of predictors and categories within those predictors.

autoplot(boosted_tune, metric='rmse') +
  labs(title='Boosted Trees') +
  theme_polyvore()

The learning rate of 0.1 worked the best by far, and having 100 trees was also the best, though the change between the number of trees is not very dramatic. This indicates that a faster learning rate was beneficial for the model.

Fitting and Testing the Models

It’s finally time to fit our final models onto the training data. I will be fitting the elastic net model since it did the best, and the boosted trees since it performed similarly (and because I’m curious about the tree models).

best_elastic <- select_best(elastic_tune, metric = 'rmse')
elastic_final_wflow <- finalize_workflow(elastic_wflow, best_elastic)
elastic_fit <- fit(elastic_final_wflow, data = outfits)

best_boosted <- select_best(boosted_tune, metric = 'rmse')
boosted_final_wflow <- finalize_workflow(boosted_wflow, best_boosted)
boosted_fit <- fit(boosted_final_wflow, data = outfits_train)

The models have been fit, it’s now time for testing…

augment(elastic_fit, new_data = outfits_test) %>% 
  rmse(truth = likes, estimate = .pred)
elastic_test_col <- predict(elastic_fit, new_data = outfits_test) 
elastic_test_res <- bind_cols(elastic_test_col, outfits_test)
augment(boosted_fit, new_data = outfits_test) %>% 
  rmse(truth = likes, estimate = .pred)
boosted_test_col <- predict(boosted_fit, new_data = outfits_test) 
boosted_test_res <- bind_cols(boosted_test_col, outfits_test)

Both models perform pretty similarly on the testing data, with the elastic net model still beating out the boosted trees model. Both RMSEs end up being around 500, which is not great given the distribution of likes. Without looking at the actual predictions made by the model, I would guess that the model might just be low-balling all of its predictions since the distribution of likes is right-skewed. To see if this is the case, we’ll check out a graph of the predicted likes against the actual number of likes

elastic_test_res %>% 
  ggplot(aes(x = .pred, y = likes)) +
  geom_point(color = '#643F4B', alpha = .3) +
  geom_abline(lty = 'longdash') +
  theme_polyvore()

Even though it’s clear the model didn’t do well, I’m actually really glad to see that there is somewhat of a positive relationship in the graph. This means that there is some information in the data that could help decide if an outfit gets more likes, and perhaps that’s all you need to bridge the gap between “fashion disaster” and “acceptable”.

I’m curious to see what variables contribute the most to the model, so we will look at a variable importance plot for the boosted tree model.

colors <- c('#F4A4A4', '#682735', '#556177',
            '#F4A4A4', '#682735', '#556177',
            '#F4A4A4', '#682735', '#556177',
            '#F4A4A4')
boosted_fit %>% extract_fit_parsnip() %>% 
  vip(aesthetics = list(fill = colors)) +
  theme_polyvore() +
  theme(axis.line = element_blank())

Interestingly enough, item_1_sweaters and item_6_sunglasses topped the list of variable importance. I’ll be honest: didn’t see that coming. All of the variables in this list are actually items, so maybe the takeaway here is that the items you choose to include in your outfit are more important than the colors.

Conclusion

This project was definitely a labor of love. I’m glad I got to see it through even if the models weren’t great at predicting the number of likes of a particular outfit. The elastic net model (effectively a lasso regression) ended up being the best model. I think this might have happened because the lasso regression is able to scale variables all the way to zero if they don’t help the model enough. After dummy coding, the data includes a lot of variables, so it’s probably good to trim it back. The fact that KNN was the worst performer makes sense for the same reason: too many predictors. With high numbers of predictors, there aren’t enough data points in each dimensional space to effectively use KNN.

One of the most engaging parts of this project was making decisions on how I was going to transform the data to make it work with the machine learning tools I have at hand. This is where I can see a lot of future improvements. In particular, I would want to see if I could tackle these large numbers of categorical variables in a different way. One approach I was tempted to take was to manually collapse the item categories into simpler ones, such as shirts, pants, dresses, jeans, etc. This could be better if it means less information is lost in this process compared to creating an “other” category.

One part of the data where I potentially lost a lot of information was in extracting the color information from the item names. In the future, I want to see if using semantic similarity or text analysis of some kind could extract better information that would lead to improved predictions.

I’m glad that I had the opportunity to explore this data set and explore machine learning through it. I’ve become familiarized with R, tidyverse, and tidymodels to a degree of confidence that I’ve really surprised myself with. And now, for my final question: how many likes will my current outfit get according to my model?

my_outfit <- data.frame('views' = 0, 'likes' = 0, 'set_id' = 1,
                        'item_1' = 'Sweaters', 'item_2' = 'Pants',
                        'item_3' = 'Other', 'item_4' = 'Other',
                        'item_5' = 'None', 'item_6' = 'None',
                        'item_7' = 'None', 'item_8' = 'None',
                        'color_1' = 'red', 'color_2' = 'gray',
                        'color_3' = 'white', 'color_4' = 'none')
my_outfit[sapply(outfits, is.character)] <- lapply(outfits[sapply(outfits, is.character)], as.factor)

predict(elastic_fit, new_data = my_outfit)

Not to brag, but my model thinks my outfit will get 331 likes.