Introduction

(Edited April 18th, 2018)

This is a continuation of my previous post, in which we scraped data about the mobile game Fire Emblem Heroes using the R package rvest. This time, we’re going to do some exploratory data analysis to get a better idea about the characters in the game and do some clustering analysis on the characters.

Loading libraries

library(ggplot2)
library(reshape2)
library(ggthemes)
library(proxy)
library(ggdendro)
library(WGCNA)
library(dynamicTreeCut)
library(ggrepel)

Data analysis

In my last post, we obtained character stats data, as well as movement and weapon type information:

head(max_stats)

##                          Name HP ATK SPD DEF RES BST WeaponType MoveType
## 1                        Abel 39  33  32  25  25 154 Blue Lance  Cavalry
## 2                     Alfonse 43  35  25  32  22 157  Red Sword Infantry
## 3 Alfonse (Hares at the Fair) 41  35  33  30  18 157  Green Axe  Cavalry
## 4                         Alm 45  33  30  28  22 158  Red Sword Infantry
## 5                      Amelia 47  34  34  35  23 173  Green Axe  Armored
## 6                        Anna 41  29  38  22  28 158  Green Axe Infantry

Distributions

Let’s start by looking at the distributions of each stat type to see how these values are spread out. I’ll remove the BST (base stat total) column for now. If you remember from my previous post, the BST is the sum of all other stats, and therefore it has a different range of values, which will get in the way of our visualization.

melt_stats <- melt(max_stats[, -7]) # Removing the BST column
ggplot(data=melt_stats, aes(value)) +
    geom_density(size=1, color="red") +
    facet_grid(~variable) +
    theme_few() +
    xlab("Value") +
    ylab("Density") +
    ggtitle("Stat densities") +
    theme(plot.title = element_text(hjust = 0.5))

We can see that most heroes have an ATK stat of around 32, and that this stat’s distribution is not very wide, meaning that there is not that big a difference between individual units’ offensive power. The DEF stat, however, seems to be a bit more distributed among the characters. We can, of course, compute each stat’s variance and observe this.

var(max_stats$ATK)

## [1] 11.82872

var(max_stats$DEF)

## [1] 40.5983

Anyway, let’s take a quick look at the BST column now. This can give us a quick-and-dirty idea of a unit’s overall power. It’s worth noting that the BST is also used by the game as a classification measure when pairing up players against each other in the Arena mode of the game, where players using characters with higher BSTs are pitted against each other. It also reflects in the number of points you get in the Arena, which allows you to challenge even stronger players given time. So let’s take a look at the heroes with the highest BST scores, which will maximize our Arena score (without considering other BST increasing factors, such as weapon power and skills).

top_heroes <- max_stats[order(max_stats$BST, decreasing = F), ]
top_heroes <- tail(top_heroes, 20)

ggplot(top_heroes, 
       aes(factor(top_heroes$Name, levels=top_heroes$Name), BST)) +
    geom_bar(stat="identity", fill="red") +
    coord_flip() + 
    scale_y_continuous(limits=c(0, 190), breaks=seq(0, 190, 20)) +
    ggthemes::theme_few() +
    xlab("Heroes") +
    ggtitle("Top 20 heroes with highest BST") +
    theme(plot.title = element_text(hjust = 0.5))

The barplot we just created might not seem very informative at first, given the similar BST values shown, but we do find something interesting once we include units’ movement type information:

ggplot(top_heroes, 
       aes(factor(top_heroes$Name, levels=top_heroes$Name), BST, fill=MoveType)) +
    geom_bar(stat="identity") +
    coord_flip() + 
    scale_y_continuous(limits=c(0, 190), breaks=seq(0, 190, 20)) +
    ggthemes::theme_few() +
    xlab("Heroes") +
    ggtitle("Top 20 heroes with highest BST + move type") +
    theme(plot.title = element_text(hjust = 0.5))

Armored units have the highest BST! …This really should not come as a surprise to most players; the whole idea of the Armored class is to have higher stats than average to compensate for their limited movement. One cool thing of note though is the inclusion of Myrrh, a flying type unit, in our top 20 BST list. Myrrh is a “Breath” weapon user (a.k.a. a ~~dragon~~ manakete), which explains a bit why her BST is higher than average (although the next highest “Breath” user, Nowi, only appears at #30, 12 units below).

Clustering analysis

Let’s try using some clustering analysis techniques to see if we can find any kind of cluster of units based on their stats. With this we should be able to identify similarly built characters or archetypes. First we’ll calculate the correlations between each hero’s stats to see how close each of them are. Then, we can use agglomerative clustering with the hclust function and euclidean distance to group our characters and visualize them in a dendrogram. Let’s remove our categorical variables and the BST column for now.

stats_only <- max_stats[, -c(7, 8, 9)]
rownames(stats_only) <- stats_only$Name
stats_only$Name <- NULL

stats_cor <- cor(t(stats_only))

d <- proxy::dist(stats_cor, method="euclidean")
fit <- hclust(d, method="ward.D")
dend <- ggdendro::dendro_data(fit)

dend_order <- fit$labels[fit$order]

dendro <-
    ggdendro::ggdendrogram(dend, rotate=FALSE) +
    scale_x_continuous(expand = c(0, 0.5), 
                       labels=dend_order, 
                       breaks=1:length(dend_order)) +
    scale_y_continuous(expand = c(0.02, 0)) +
    ggtitle("Hero similarity dendrogram") +
    theme(plot.title = element_text(hjust = 0.5))

dendro

I’ll use the cutreeDynamic function from the dynamicTreeCut package to cut our dendrogram into clusters. This package attempts to automatically define the best number of clusters based on the shape of the dendrogram, so we don’t have to worry about selecting an aribitrary number of clusters. I’ll also use the labels2colors function from the WGCNA package to obtain a color for each cluster underneath the dendrogram. These packages were originally developed to visualize clusters of genes in Bioinformatics analyses. Gotta love that interdisciplinarity!

The WGCNA package also provides a function with base R to plot the dendrogram with colors, but since I prefer ggplot2, I’ll be using a few functions from the grid and gridExtra packages. A BIG shout out to the StackOverflow members in this and this topic for the insights in how to align a dendrogram and how to combine and align two ggplot graph images.

dynamicMods <- dynamicTreeCut::cutreeDynamic(dendro=fit)                   
dynamicColors <- WGCNA::labels2colors(dynamicMods)

out <- data.frame(chars=rownames(stats_only), 
                  modules=dynamicMods, 
                  colors=dynamicColors,
                  stringsAsFactors=FALSE)

out <- out[match(dend_order, out$chars), ]

clusters <- ggplot(out, aes(factor(chars, levels=chars), y=1, fill=colors)) +
            geom_tile() +
            scale_fill_identity() +
            scale_y_continuous(expand=c(0, 0)) +
            theme(axis.title=element_blank(),
                axis.ticks=element_blank(),
                axis.text=element_blank(),
                legend.position="none",
                plot.title = element_text(hjust = 0.5))

gp1 <- ggplot2::ggplotGrob(dendro)
gp2 <- ggplot2::ggplotGrob(clusters)

maxWidth <- grid::unit.pmax(gp1$widths[2:5], gp2$widths[2:5])
gp1$widths[2:5] <- as.list(maxWidth)
gp2$widths[2:5] <- as.list(maxWidth)
g <- gridExtra::arrangeGrob(gp1, gp2, ncol=1,heights=c(9/10, 1/10))
grid::grid.draw(g)

As we can see we got 7 nice clusters, with the turquoise cluster being the largest. As a side note, the dynamicTreeCut package reserves the “grey” color for elements which weren’t able to be assigned to any other cluster. Usually, one would discard this group, but I’ll keep it in this analysis just so we don’t lose any hero. Let’s include cluster information into our data and take a look at a few boxplots so we can get a feel for each cluster.

max_stats$label <- dynamicColors

max_stats2 <- max_stats
max_stats2$BST <- NULL

melt_stats2 <- melt(max_stats2)

ggplot(melt_stats2, aes(value)) +
    geom_boxplot(aes(x=variable, y=value, fill=variable)) +
    facet_grid(~label) + 
    theme_few() +
    theme(legend.title=element_blank(),
          legend.position = "bottom",
          axis.title=element_blank(), 
          axis.text.x=element_blank(),
          axis.ticks.x=element_blank(),
          plot.title = element_text(hjust = 0.5)) +
    ggtitle("Stat boxplots by cluster")

We can also do principal component analysis to see how our clusters represent (most of) the variability of the data. We can see it’s not perfect, but there does seem to be some sense in our clustered points.

pca <- prcomp(stats_only)

df_out <- as.data.frame(pca$x)
df_out$label <- max_stats$label
head(df_out)

##                                   PC1       PC2       PC3        PC4
## Abel                        -1.727858  1.253100 -1.148936 -0.6118986
## Alfonse                      8.626724 -2.674268 -2.592615 -0.7335428
## Alfonse (Hares at the Fair)  5.765115  6.114256 -2.334997 -1.8918197
## Alm                          5.058571  1.167998 -1.279048  2.4418803
## Amelia                       9.378496  2.547198  2.816638 -1.8185376
## Anna                        -6.465976  3.767082  3.455966  2.1557244
##                                     PC5  label
## Abel                        -0.33728966  green
## Alfonse                      1.16584188    red
## Alfonse (Hares at the Fair)  0.08645923  brown
## Alm                         -1.94807904 yellow
## Amelia                      -5.88638553  brown
## Anna                        -4.38269169   grey

ggplot(df_out,aes(x=PC1, y=PC2, color=label)) +
    scale_color_identity() +
    geom_point(size = 5) + 
    theme_few() +
    ggtitle("Principal Component Analysis") +
    theme(plot.title = element_text(hjust = 0.5))

We can use geom_text to check out the units in each cluster:

ggplot(df_out,aes(x=PC1, y=PC2, color=label)) +
    scale_color_identity() +
    geom_point(size = 5) + 
    geom_text(label=rownames(df_out), size=3, color="black") +
    theme_few() +
    ggtitle("Principal Component Analysis") +
    theme(plot.title = element_text(hjust = 0.5))

We can now use the by function to find the average stats of heroes within each cluster:

stats_label <- as.data.frame(cbind(stats_only, max_stats$label))
names(stats_label) <- c("HP", "ATK", "SPD", "DEF", "RES", "label")
res <- by(stats_label[, 1:5], stats_label$label, colMeans)

The stats inside each element of the res list are the average stats of the heroes within each cluster. We can consider these stats, therefore as a sort of representative character for each color group. Let’s try to see now which hero within each cluster is the closest to this average. To do this, we’ll take each group of heroes, subtract the stats of our average hero and take the absolute value of the results, average them for each hero to get a distance measure, and see which hero has the lowest distance.

reps <- sapply(unique(as.character(stats_label$label)), USE.NAMES=FALSE, function(color){
    color_rep <- res[[color]]
    color_chars <- stats_label[stats_label$label == color,]
    color_chars$label <- NULL
    diff_df <- t(abs(t(color_chars) - color_rep))
    char_dist <- rowSums(diff_df)/ncol(diff_df)
    closest_char <- char_dist[which.min(char_dist)]
})

reps_df <- as.data.frame(cbind(unique(as.character(stats_label$label)), reps))
names(reps_df) <- c("cluster", "distance_to_mean")
reps_df

##                                cluster  distance_to_mean
## Roderick                         green 0.514285714285715
## Michalis                           red 0.630769230769231
## Sharena                          brown  1.05789473684211
## Reinhardt (World of Thracia)    yellow  1.68421052631579
## Titania                           grey 0.861538461538463
## L'Arachel                    turquoise  1.10222222222222
## Catria                            blue  1.33809523809524

We can now take a look at our PCA plot again and check that our representative heroes lie somewhat centralized in their own clusters:

reps <- rownames(reps_df)
chars <- rownames(df_out)

chars[!chars %in% reps] <- ""

ggplot(df_out,aes(x=PC1, y=PC2, color=label)) +
    scale_color_identity() +
    geom_point(size = 5) + 
    geom_text_repel(label=chars, size=3, color="black") +
    theme_few() +
    ggtitle("Principal Component Analysis") +
    theme(plot.title = element_text(hjust = 0.5))

To finish things off, let’s take a look at how movement and weapon types are arranged within each cluster. I couldn’t really decide on a better way to visualize this information, so here are two I could come up with:

A faceted stacked barplot

ggplot(max_stats, aes(x=factor(WeaponType))) +
    geom_bar(stat="count", aes(fill=max_stats$MoveType)) +
    xlab(label="Weapon type") +
    ylab(label="Count") +
    theme(legend.title = element_blank(),
          plot.title = element_text(hjust = 0.5)) + 
    coord_flip() +
    facet_wrap(~label) +
    ggtitle("Movement and weapon types by cluster")

A faceted bubble chart (once again, thanks to SO for this link)

counts <- paste(max_stats$WeaponType, max_stats$MoveType, max_stats$label)

max_stats$category <- counts
max_stats$size <- as.numeric(table(max_stats$category)[max_stats$category])
max_stats$radius <- sqrt(max_stats$size / pi)

ggplot(max_stats, aes(x=factor(WeaponType), y=MoveType)) +
    geom_point(aes(size=radius*7.5), shape=21, fill="white") + 
    geom_text(aes(label=size), size=4, color = "black") +
    scale_size_identity() +
    xlab(label="Weapon type") +
    ylab(label="Move type") +
    theme(axis.text.x = element_text(angle=90),
          plot.title = element_text(hjust = 0.5)) + 
    facet_wrap(~label) +
    ggtitle("Movement and weapon types by cluster")

These plots give us a pretty good idea about the composition of each of our clusters. We can see that while Red Sword Infantry characters are dominant in the cast, they’re mostly concentrated in the blue, brown and green clusters. Also, the yellow group seems to be where most armored users wound up. Finally, the most common characters in the turquoise group are mages and staff users. As we can see in our previous boxplots, the turquoise groups is also the one with characters with low defense and HP, and very high resistance, a common stat distribution for magic users like these.

Analyzing Fire Emblem Heroes data

Introduction

Loading libraries

Data analysis

Distributions

Clustering analysis