The goal of this project is to find a machine learning model that can take in “visual” predictors, such as color or type of clothing, and output a predicted number of likes.
The data set that I am working with is called Polyvore Dataset which was gathered for a paper by Han, Xintong and Wu, Zuxuan and Jiang, Yu-Gang and Davis, Larry S. The GitHub page containing the data set also includes the link to the paper, “Learning Fashion Compatibility with Bidirectional LSTMs”.
The files containing the data have been pre-separated into training, validation, and testing, with each containing 17,316, 1,497, and 3,076 observations, respectively.
A brief overview of the website where this data comes from: Polyvore.com was a website where people could put together, upload, and share outfits using real clothing items (it has since been acquired and subsequently repurposed for selling clothes). Posts would look something like this:
Each observation in the data set is one of these outfits, which includes general information such as the view count, likes, and a link to the image, as well as each article of clothing/accessory making up the outfit. Each piece in the outfit has its own info, including the index (position in the outfit from top to bottom), name, price, and likes.
The predictors include up to 8 different outfit elements, with each element having 6 attributes. In other words, there can be up to 48 different predictors. In order to make this data usable for the machine learning tools I will be using, I need to do some pre-processing. I’ve decided to do this step in Python, and there is a decent amount of code involved. I’ll go over the basic steps taken to arrive at the final data frame, but the jupyter notebook with the full process is available on my Github page if you’re interested.
First, I imported the data set and merged the pre-split testing, training, and validation sets so I can have all of the data in one set. Looking at the data frame produced from the json data, we can see that the items column contains dictionaries as its entries.
In order to “unpack” this column, I made a function that takes the
items
column as input and outputs a new data frame
containing all of the information expanded out into columns.
I merged this data frame to the original and dropped other columns I knew I wasn’t going to use. The next thing I needed to do was extract information from the item names, and I decided to focus on color. I made a function that looks at the words in all of the item name columns and outputs a list of the unique colors it found from the text.
I applied this function to the data frame and turned the unique
colors into 4 columns, color_1
, color_2
,
color_3
, and color_4
. I then dropped the item
name columns since I no longer needed them.
The original Polyvore Dataset came with a text file, category_id.txt, which includes the category ids and the item types they correspond to. I used this file to map the item ids to their item types to make the data more readable.
Finally, I exported the data frame as a csv which I will be importing into R.
You might have noticed in the pre-processing step that there are some colors and item names that are “none”. The reason for this is that in the original data set, the nested lists are not all the same length; the outfits can each have 4 to 8 items. For the ones with fewer than 8 items, I had to backfill them, but I didn’t want to treat it as “missing” since the information that the outfit only has 4 items can be useful. The same goes for the color columns: some outfits only had 1, 2 or 3 unique colors. Again, the information that there’s only one color present in the outfit can be useful, so I’ve kept “none” as its own category.
There weren’t missing data in the original data set, but there were some outfits that I couldn’t “extract” color from with my function, so I dropped them from the final set.
Now that we’ve tidied the data in a format that will be better suited for our analysis, we can begin by making some plots and seeing how the data is distributed.
First we will import the data and convert the appropriate variables into factors.
# Reading the tidied data into R
outfits <- read.csv('polyvore_data/tidy_data.csv')
# converting all character columns into factors
outfits[sapply(outfits, is.character)] <- lapply(outfits[sapply(outfits, is.character)], as.factor)
# I will only be using set_id to retrieve photos, so we will leave it be
outfits %>% head(5)
In the final data set, there are 15 variables:
Name | Variable description |
---|---|
views |
The number of times an outfit has been viewed |
likes |
The number of likes an outfit received (the response variable) |
set_id |
identifies the particular outfit |
item_1 - item_8 |
gives the name of the item type for up to 8 items in an outfit |
color_1 - color_4 |
gives the unique colors found in the outfit for up to 4 different colors |
We will start by looking at the distribution of our response
variable, likes
.
outfits %>%
ggplot(aes(x = likes)) +
geom_histogram(binwidth = 100, fill="#2C514C") +
labs(x = 'Number of Likes', y = 'Count', title = 'Distribution of the Number of Likes') +
coord_cartesian(xlim = c(0, 4000)) +
theme_polyvore()
The number of likes ranges from 0 to around 3000, and the frequency decreases exponentially as the number of likes goes up. This follows the sort of logic you’d expect, where a select few get lots of likes while most go unseen.
After an initial survey of the raw data, I noted that there were a lot of different item ids.
# Finds the number of unique items in item_1 column
outfits$item_1 %>% unique() %>% length()
## [1] 229
And indeed, there are 229 unique items in
item_1
alone! To further examine the nature of the item
columns, we will look at a plot of the item frequencies in
item_1
. Additionally, we’ll do a “zoom in” of the top 10
items by frequency in item_1
.
outfits %>% ggplot(aes(x = fct_infreq(item_1))) +
geom_bar(fill = '#F08080') +
labs(x = 'Item 1', y='', title = 'Frequencies of item_1 Items') +
theme_polyvore() +
theme(axis.text.x = element_blank(),
panel.grid.major.x = element_blank(),
axis.line = element_blank()) -> p1
item1_t10 <- (sort(table(outfits$item_1), decreasing = TRUE) %>% names)[1:10]
outfits_item1_top10 <- outfits[outfits$item_1 %in% item1_t10,]
outfits_item1_top10 %>%
ggplot(aes(y = fct_infreq(item_1))) +
geom_bar(fill="#583742") +
labs(x = '', y='', title = 'Top 10 item_1 Items') +
theme_polyvore() +
theme(axis.line = element_blank()) -> p2
grid.arrange(p1, p2, ncol=2)
The first plot shows that occurrences of the 229 items in
item_1
are concentrated towards the top few items and fall
off significantly after. We can see from the second plot that the top
items of item_1
are all tops, dresses, and jackets of some
type. This similarity gives us an idea of how the items have been
organized, and we can get a better picture from constructing a
table.
item1_t5 <- (sort(table(outfits$item_1), decreasing = TRUE) %>% names)[1:5]
item2_t5 <- (sort(table(outfits$item_2), decreasing = TRUE) %>% names)[1:5]
item3_t5 <- (sort(table(outfits$item_3), decreasing = TRUE) %>% names)[1:5]
item4_t5 <- (sort(table(outfits$item_4), decreasing = TRUE) %>% names)[1:5]
item5_t5 <- (sort(table(outfits$item_5), decreasing = TRUE) %>% names)[1:5]
item6_t5 <- (sort(table(outfits$item_6), decreasing = TRUE) %>% names)[1:5]
item7_t5 <- (sort(table(outfits$item_7), decreasing = TRUE) %>% names)[1:5]
item8_t5 <- (sort(table(outfits$item_8), decreasing = TRUE) %>% names)[1:5]
item_freq_table <- data.frame(item_1 = item1_t5,
item_2 = item2_t5,
item_3 = item3_t5,
item_4 = item4_t5,
item_5 = item5_t5,
item_6 = item6_t5,
item_7 = item7_t5,
item_8 = item8_t5)
item_freq_table %>% gt() %>%
tab_header(
title = "Top 5 Items by Frequency"
)
Top 5 Items by Frequency | |||||||
item_1 | item_2 | item_3 | item_4 | item_5 | item_6 | item_7 | item_8 |
---|---|---|---|---|---|---|---|
Tops | Jackets | Sandals | Shoulder Bags | None | None | None | None |
Sweaters | Coats | Ankle Booties | Ankle Booties | Earrings | Earrings | Sunglasses | Lipstick |
Day Dresses | Shorts | Sneakers | Clutches | Shoulder Bags | Necklaces | Earrings | Sunglasses |
T-Shirts | Skinny Jeans | Pumps | Handbags | Necklaces | Sunglasses | Lipstick | Fragrance |
Blouses | Knee Length Skirts | Skinny Jeans | Pumps | Bracelets & Bangles | Bracelets & Bangles | Tech Accessories | Tech Accessories |
From looking at both the raw data and this table, a general pattern
emerges: within each outfit, the items are sorted from most to least
significant, with item_1
being the most significant item
and item_8
being the least. The top items in
item_2
are still significant, with jackets being the most
frequent and “bottoms” being common elements as well (shorts, jeans,
skirts). item_3
dips into shoe territory, and
item_4
through item_8
can be seen mostly as
the “accessory section”. Another thing that is interesting to note from
the table is that from item_5
and onward, “None” is the
most frequent item. We can check the proportion of “None”s pretty
easily.
# Gets proportion of data that has fewer than 5 items, fewer than 6 items, fewer than 7 items, and fewer than 8 items
props <- c(((filter(outfits, item_4 == 'None') %>% count()) / (outfits %>% count()))[1,1] %>% as.numeric(),
((filter(outfits, item_5 == 'None') %>% count()) / (outfits %>% count()))[1,1] %>% as.numeric(),
((filter(outfits, item_6 == 'None') %>% count()) / (outfits %>% count()))[1,1] %>% as.numeric(),
((filter(outfits, item_7 == 'None') %>% count()) / (outfits %>% count()))[1,1] %>% as.numeric(),
((filter(outfits, item_8 == 'None') %>% count()) / (outfits %>% count()))[1,1] %>% as.numeric())
props
## [1] 0.00000000 0.08148734 0.23016066 0.40670643 0.59554528
With some simple arithmetic, this gives us a breakdown of the item number proportions:
Now we want to see what relationship the items have with likes. Before doing so, I will transform the data so that we only have 10 different categories, which will be the top 9 occurrences plus another category, ‘other’.
outfits2 <- outfits %>% mutate(item_1 = fct_lump_n(outfits$item_1, n=9))
colors <- c('#335C57', '#583742', '#F08080',
'#335C57', '#583742', '#F08080',
'#335C57', '#583742', '#F08080',
'#335C57')
outfits2 %>%
ggplot(aes(x=likes, y=item_1, group=item_1)) +
geom_boxplot(fill = colors) +
labs(y = '', title = 'Boxplot of item_1') +
theme_polyvore() +
theme(axis.title.y = element_blank()) -> p1
outfits2 %>%
ggplot(aes(x=likes, y=item_1, group=item_1)) +
geom_boxplot(fill = colors) +
scale_x_continuous(limits = c(0, 600)) +
labs(y = '', title = 'Boxplot of item_1 (zoomed)') +
theme_polyvore() +
theme(axis.title.y = element_blank()) -> p2
grid.arrange(p1, p2, ncol = 2)
In the first boxplot, we can see the full range of likes. It’s hard to glean any information from the boxes themselves, but we can see differences in the spread of outliers for each category. In the second boxplot, I’ve “zoomed in” by limiting the x axis so we can see the boxes better. We can see here that sweaters and day dresses have the highest average number of likes, while tank tops trail behind all of the other categories.
All of us understand that color plays an essential role in whether or not an outfit “works” (unless you’re doing wardrobe for a 1947 film noir), so it will be interesting to see what kind of colors are represented in the data.
colors1 <- c('#fb5c5c', '#4a687d', '#ffdfbf',
'gray16', '#9b938c', '#e2ba59', '#628d55',
'#643b34', '#e3774c', '#6e4258')
outfits %>%
ggplot(aes(y = fct_infreq(color_1))) +
geom_bar(fill = colors1) +
labs(x = '', y = '', title = 'Frequencies of Colors in color_1') +
theme_polyvore() +
theme(axis.line = element_blank(),
panel.grid.major = element_blank()) -> p1
colors2 <- c('gray16', '#4e4c5e', '#ffdfbf', '#e2ba59',
'#9b938c', '#643b34', '#4a687d', '#628d55',
'#fb5c5c', '#e3774c', '#6e4258')
outfits %>%
ggplot(aes(y = fct_infreq(color_2))) +
geom_bar(fill = colors2) +
labs(x = '', y = '', title = 'Frequencies of Colors in color_2') +
theme_polyvore() +
theme(axis.line = element_blank(),
panel.grid.major = element_blank()) -> p2
colors3 <- c('#4e4c5e', 'gray16', '#ffdfbf', '#e2ba59',
'#9b938c', '#643b34', '#4a687d', '#6e4258',
'#fb5c5c', '#628d55', '#e3774c')
outfits %>%
ggplot(aes(y = fct_infreq(color_3))) +
geom_bar(fill = colors3) +
labs(x = '', y = '', title = 'Frequencies of Colors in color_3') +
theme_polyvore() +
theme(axis.line = element_blank(),
panel.grid.major = element_blank()) -> p3
colors4 <- c('#4e4c5e', 'gray16', '#9b938c', '#fb5c5c',
'#e2ba59', '#4a687d', '#ffdfbf', '#6e4258',
'#e3774c', '#643b34', '#628d55')
outfits %>%
ggplot(aes(y = fct_infreq(color_4))) +
geom_bar(fill = colors4) +
labs(x = '', y = '', title = 'Frequencies of Colors in color_4') +
theme_polyvore() +
theme(axis.line = element_blank(),
panel.grid.major = element_blank()) -> p4
grid.arrange(p1, p2, p3, p4, ncol=2)
The color columns are similar to the item columns in the way they are
ordered: color_1
will have higher importance than
color_2
(and so on) in the outfit due to the way the
information was extracted. This gives us some insight into the colors
represented in these charts.
Red clearly reigns supreme for color_1
, followed by
blue, white, and black. We start to see “none” dominate starting in
color_2
, indicating that there are lots of outfits for
which only one color could be extracted (or had a monochromatic scheme
going on). In color_4
, we see that most of the outfits
simply don’t have a color entry. This could mean that most outfits stuck
to 1-3 colors.
Next, we’ll take a look at the relationship between color and likes.
colors <- c('gray16', '#4a687d', '#643b34',
'#9b938c', '#628d55', '#e3774c', '#6e4258',
'#fb5c5c', '#ffdfbf', '#e2ba59')
outfits %>%
ggplot(aes(x=likes, y=color_1, group=color_1)) +
geom_boxplot(fill=colors) +
labs(title = 'Boxplot of color_1') +
theme_polyvore() +
theme(axis.title.y = element_blank()) -> p1
outfits %>%
ggplot(aes(x=likes, y=color_1, group=color_1)) +
geom_boxplot(fill=colors) +
scale_x_continuous(limits = c(0, 500)) +
labs(title = 'Boxplot of color_1 (zoomed)') +
theme_polyvore() +
theme(axis.title.y = element_blank()) -> p2
grid.arrange(p1, p2, ncol=2)
The left graph shows similar behavior as the item_1
boxplot, with all of the colors being concentrated near zero but having
different spreads of outliers into the higher numbers of likes. On the
right boxplot, we can see that orange has the highest number of average
likes, while red and blue are towards the low end. This is interesting
since red and blue were the most frequent colors and orange was the
second least frequent, so this could potentially mean that being less
common gives outfits with orange a slight edge.
With the information gathered from the exploratory data analysis, we can start fitting models. There are a few key steps in fitting a model:
Regression models will be used for this case because the response
variable, likes
, is numeric (we also have an interesting
situation where all of our predictors are categorical). Because there
are so many different categories in the item columns, this could lead to
a problem of overfitting. For now, we will keep each of these columns to
10 categories.
Without further ado, we will start by splitting our data into
training and testing. I’ve decided to split it into 70% training and 30%
testing, stratifying on the likes
variable. Stratifying
ensures that the training and testing have the same distribution of
likes. If the distributions were different, the training data could skew
the models one way, leading to poor performance on the testing data.
# Setting each item category to the top 9 categories, the rest will be "other"
outfits <- outfits %>%
mutate(item_1 = fct_lump_n(outfits$item_1, n=9),
item_2 = fct_lump_n(outfits$item_2, n=9),
item_3 = fct_lump_n(outfits$item_3, n=9),
item_4 = fct_lump_n(outfits$item_4, n=9),
item_5 = fct_lump_n(outfits$item_5, n=9),
item_6 = fct_lump_n(outfits$item_6, n=9),
item_7 = fct_lump_n(outfits$item_7, n=9),
item_8 = fct_lump_n(outfits$item_8, n=9))
set.seed(222)
outfits_split <- initial_split(outfits, prop=.70,
strata=likes)
outfits_train <- training(outfits_split)
outfits_test <- testing(outfits_split)
The next step is to make a recipe that our models will use to fit the
data. In our case, this is pretty analogous to constructing an outfit.
Which colors should I include and how many? Should I wear a coat? What
about boots? We can get this information from item_1
through item_8
and color_1
through
color_4
. I’m excluding views
because even
though it’s probably a good predictor, it isn’t really something you
consider when picking out an outfit.
outfit_recipe <- recipe(likes ~ item_1 + item_2 + item_3 + item_4 +
item_5 + item_6 + item_7 + item_8 +
color_1 + color_2 + color_3 + color_4,
data = outfits) %>%
step_dummy(all_nominal_predictors()) %>%
step_interact(terms = ~ starts_with("color_1"):starts_with("color_2"))
I’ve also decided to include an interaction term between
color_1
and color_2
to represent the “color
scheme” of the outfit. This will allow the model to consider, for
example, if an outfit has both red and green, or if it has black and
white, and so on.
For our project, we will be using k-fold cross validation with a k value of 10, which is a process that splits up the training data into k different groups and then chooses one as the validation group, then fits the model using the rest of the groups as training, repeating the process k times. This process is useful since it produces less variance in the estimates of the model’s performance.
set.seed(222)
outfits_fold <- vfold_cv(outfits_train, v=10)
The models that we will be using for this data, as mentioned before, will be regression models. The following are the ones I’ve chosen to include:
First, we will set up the models. You’ll notice that some of the models have tune() included in some of the parameters. This allows us to tune that parameter to find the value that produces the best predictions.
# Linear Model
lm_model <- linear_reg() %>%
set_engine("lm")
# K-nearest neighbors, tuning neighbors
knn_model <- nearest_neighbor(neighbors = tune()) %>%
set_mode("regression") %>%
set_engine("kknn")
# Elastic net, tuning penalty and mixture
elastic_model <- linear_reg(penalty = tune(),
mixture = tune()) %>%
set_mode("regression") %>%
set_engine("glmnet")
# Random forest, tuning mtry, trees, and min_n
rf_model <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("regression")
# Boosted trees, tuning trees, learn_rate, and mtry
boosted_model <- boost_tree(trees = tune(),
learn_rate = tune(),
mtry = tune()) %>%
set_engine("xgboost") %>%
set_mode("regression")
Next, we will set up the workflows for each model.
lm_wflow <- workflow() %>%
add_model(lm_model) %>%
add_recipe(outfit_recipe)
knn_wflow <- workflow() %>%
add_model(knn_model) %>%
add_recipe(outfit_recipe)
elastic_wflow <- workflow() %>%
add_recipe(outfit_recipe) %>%
add_model(elastic_model)
rf_wflow <- workflow() %>%
add_recipe(outfit_recipe) %>%
add_model(rf_model)
boosted_wflow <- workflow() %>%
add_recipe(outfit_recipe) %>%
add_model(boosted_model)
Now we’ll make tuning grids for the models with tuning parameters. These tuning grids specify the ranges of values we want to test for each parameter being tuned. This is important because for models like knn, a higher value of k will lead to a less flexible model and a lower value will be more flexible. We don’t know how much flexibility will work the best, so tuning allows us to find out.
knn_grid <- grid_regular(neighbors(range = c(10,25)), levels = 5)
elastic_grid <- grid_regular(penalty(),
mixture(range = c(0, 1)),
levels = 10)
rf_grid <- grid_regular(mtry(range = c(3, 5)),
trees(range = c(10, 150)),
min_n(range = c(1, 20)),
levels = 8)
boosted_grid <- grid_regular(mtry(range = c(2, 4)),
trees(range = c(10, 100)),
learn_rate(range = c(-10, -1)),
levels = 5)
And now the part where my PC takes a minor beating: tuning.
knn_tune <- tune_grid(
knn_wflow,
resamples = outfits_fold,
grid = knn_grid
)
elastic_tune <- tune_grid(
elastic_wflow,
resamples = outfits_fold,
grid = elastic_grid
)
rf_tune <- tune_grid(
rf_wflow,
resamples = outfits_fold,
grid = rf_grid
)
boosted_tune <- tune_grid(
boosted_wflow,
resamples = outfits_fold,
grid = boosted_grid
)
6 minutes and 39 seconds! Not bad! Now, I’ll save the results and load them back in so I don’t have to donate another 6 minutes and 39 seconds of my future.
save(knn_tune, file = "tunings/knn_tune.rda")
save(elastic_tune, file = "tunings/elastic_tune.rda")
save(rf_tune, file = "tunings/rf_tune.rda")
save(boosted_tune, file = "tunings/boosted_tune.rda")
load("tunings/knn_tune.rda")
load("tunings/elastic_tune.rda")
load("tunings/rf_tune.rda")
load("tunings/boosted_tune.rda")
BAM! Models are tuned!* We now want to find the root mean square error (RMSE) for each model. The RMSE is a measure of the differences between predicted values and actual values, so this will be useful in comparing the models’ performances.
* Linear regression hasn’t been tuned since it doesn’t have tuning parameters… I bet he feels so alone right now… It’s okay though. He will still be fit.
lm_fit <- fit_resamples(lm_wflow, resamples = outfits_fold)
save(lm_fit, file = "tunings/lm_fit.rda")
Now we can look at the performance of the model on the training data. First, we’ll look at the RMSE and then we’ll look at some of the autoplots of the tuned model.
load("tunings/lm_fit.rda")
lm_rmse <- collect_metrics(lm_fit)$mean[1] %>% round(2)
knn_rmse <- show_best(knn_tune, n=1)$mean %>% round(2)
elastic_rmse <- show_best(elastic_tune, n=1)$mean %>% round(2)
rf_rmse <- show_best(rf_tune, n=1)$mean %>% round(2)
boosted_rmse <- show_best(boosted_tune, n=1)$mean %>% round(2)
rmse_table <- data.frame(model = c('Linear Regression',
'KNN','Elastic Net',
'Random Forest',
'Boosted Trees'),
RMSE = c(lm_rmse, knn_rmse,
elastic_rmse, rf_rmse,
boosted_rmse))
rmse_table
Using the RMSE values to discern between the models, we can see that the elastic net model had the best performance, and k-nearest neighbors had the worst. Now, we’ll look at some of the autoplots to take a look at how the different tunings performed.
autoplot(elastic_tune, metric='rmse') +
labs(title='Elastic') +
theme_polyvore()
We can see from the elastic net model that having a lasso penalty of 1 and a mixture of 1 gives the best results. What this means is that the model that performed the best was a lasso regression.
autoplot(rf_tune, metric='rmse') +
labs(title='Random Forest') +
theme_polyvore()
The autoplot for the random forest tells us that 120 trees with mtry=5 seems to have the best performance. This makes sense, as having more trees gives the model more flexibility which would be helpful in the case where there are a lot of predictors and categories within those predictors.
autoplot(boosted_tune, metric='rmse') +
labs(title='Boosted Trees') +
theme_polyvore()
The learning rate of 0.1 worked the best by far, and having 100 trees was also the best, though the change between the number of trees is not very dramatic. This indicates that a faster learning rate was beneficial for the model.
It’s finally time to fit our final models onto the training data. I will be fitting the elastic net model since it did the best, and the boosted trees since it performed similarly (and because I’m curious about the tree models).
best_elastic <- select_best(elastic_tune, metric = 'rmse')
elastic_final_wflow <- finalize_workflow(elastic_wflow, best_elastic)
elastic_fit <- fit(elastic_final_wflow, data = outfits)
best_boosted <- select_best(boosted_tune, metric = 'rmse')
boosted_final_wflow <- finalize_workflow(boosted_wflow, best_boosted)
boosted_fit <- fit(boosted_final_wflow, data = outfits_train)
The models have been fit, it’s now time for testing…
augment(elastic_fit, new_data = outfits_test) %>%
rmse(truth = likes, estimate = .pred)
elastic_test_col <- predict(elastic_fit, new_data = outfits_test)
elastic_test_res <- bind_cols(elastic_test_col, outfits_test)
augment(boosted_fit, new_data = outfits_test) %>%
rmse(truth = likes, estimate = .pred)
boosted_test_col <- predict(boosted_fit, new_data = outfits_test)
boosted_test_res <- bind_cols(boosted_test_col, outfits_test)
Both models perform pretty similarly on the testing data, with the elastic net model still beating out the boosted trees model. Both RMSEs end up being around 500, which is not great given the distribution of likes. Without looking at the actual predictions made by the model, I would guess that the model might just be low-balling all of its predictions since the distribution of likes is right-skewed. To see if this is the case, we’ll check out a graph of the predicted likes against the actual number of likes
elastic_test_res %>%
ggplot(aes(x = .pred, y = likes)) +
geom_point(color = '#643F4B', alpha = .3) +
geom_abline(lty = 'longdash') +
theme_polyvore()
Even though it’s clear the model didn’t do well, I’m actually really glad to see that there is somewhat of a positive relationship in the graph. This means that there is some information in the data that could help decide if an outfit gets more likes, and perhaps that’s all you need to bridge the gap between “fashion disaster” and “acceptable”.
I’m curious to see what variables contribute the most to the model, so we will look at a variable importance plot for the boosted tree model.
colors <- c('#F4A4A4', '#682735', '#556177',
'#F4A4A4', '#682735', '#556177',
'#F4A4A4', '#682735', '#556177',
'#F4A4A4')
boosted_fit %>% extract_fit_parsnip() %>%
vip(aesthetics = list(fill = colors)) +
theme_polyvore() +
theme(axis.line = element_blank())
Interestingly enough, item_1_sweaters and item_6_sunglasses topped the list of variable importance. I’ll be honest: didn’t see that coming. All of the variables in this list are actually items, so maybe the takeaway here is that the items you choose to include in your outfit are more important than the colors.
This project was definitely a labor of love. I’m glad I got to see it through even if the models weren’t great at predicting the number of likes of a particular outfit. The elastic net model (effectively a lasso regression) ended up being the best model. I think this might have happened because the lasso regression is able to scale variables all the way to zero if they don’t help the model enough. After dummy coding, the data includes a lot of variables, so it’s probably good to trim it back. The fact that KNN was the worst performer makes sense for the same reason: too many predictors. With high numbers of predictors, there aren’t enough data points in each dimensional space to effectively use KNN.
One of the most engaging parts of this project was making decisions on how I was going to transform the data to make it work with the machine learning tools I have at hand. This is where I can see a lot of future improvements. In particular, I would want to see if I could tackle these large numbers of categorical variables in a different way. One approach I was tempted to take was to manually collapse the item categories into simpler ones, such as shirts, pants, dresses, jeans, etc. This could be better if it means less information is lost in this process compared to creating an “other” category.
One part of the data where I potentially lost a lot of information was in extracting the color information from the item names. In the future, I want to see if using semantic similarity or text analysis of some kind could extract better information that would lead to improved predictions.
I’m glad that I had the opportunity to explore this data set and explore machine learning through it. I’ve become familiarized with R, tidyverse, and tidymodels to a degree of confidence that I’ve really surprised myself with. And now, for my final question: how many likes will my current outfit get according to my model?
my_outfit <- data.frame('views' = 0, 'likes' = 0, 'set_id' = 1,
'item_1' = 'Sweaters', 'item_2' = 'Pants',
'item_3' = 'Other', 'item_4' = 'Other',
'item_5' = 'None', 'item_6' = 'None',
'item_7' = 'None', 'item_8' = 'None',
'color_1' = 'red', 'color_2' = 'gray',
'color_3' = 'white', 'color_4' = 'none')
my_outfit[sapply(outfits, is.character)] <- lapply(outfits[sapply(outfits, is.character)], as.factor)
predict(elastic_fit, new_data = my_outfit)
Not to brag, but my model thinks my outfit will get 331 likes.