Suppose we have some data containing multiple variable values (measurements) for a number of elements (observations). We’d like to visualize this data, and when doing so, assign different colors to different elements.
This requires us to manually assign a color to each element, or if the data lends itself to it, compute some linear score for each element and use that as an index into a color palette. For large multidimensional data, neither approach is practical.
The package provides a quick-and-dirty solution which automatically computes a distinct color to each data element, where these colors try to reflect the “coarse” structure of the data; that is, “more similar” data elements are assigned “more similar” colors.
We’ll demonstrate this on the seatbelt data, one of the built-in data sets provided by R. To begin, we’ll load the package and access the seatbelts data. Since the package only works on simple matrices, we’ll have to convert the seatbelts time series data to a simple matrix:
library(chameleon) seatbelts <- matrix(as.numeric(Seatbelts), nrow=nrow(Seatbelts), ncol=ncol(Seatbelts)) colnames(seatbelts) <- colnames(Seatbelts) dim(seatbelts) #>  192 8 head(seatbelts, 4) #> DriversKilled drivers front rear kms PetrolPrice VanKilled law #> [1,] 107 1687 867 269 9059 0.1029718 12 0 #> [2,] 97 1508 825 265 7685 0.1023630 6 0 #> [3,] 102 1507 806 319 9963 0.1020625 12 0 #> [4,] 87 1385 814 407 10955 0.1008733 8 0
We now have a matrix with 192 elements and 8 measurements for each one. Using the function, we can assign a color to each element and use it to visualize the data, for example using a 2D UMAP projection:
colors <- data_colors(seatbelts) library(umap) layout <- umap(seatbelts, min_dist=0.99, random_state=123456)$layout plot(layout, asp=1, col=colors, pch=19, cex=1)
Assuming that UMAP has indeed captured some underlying structure of the data, we can see that the chosen colors correspond well to this structure, and possibly hint at some additional structure not captured well in the 2D projection.
However, 192 colors is “a bit much”, so we can’t expect them to be very distinct from each other. We can reduce the number of colors by grouping the data elements.
For example, since the original seatbelts data is a time series, we can compute for each row the year it applies to:
And then compute and show a color for each year:
year_colors <- data_colors(seatbelts, group=years) plot(layout, asp=1, col=year_colors[as.character(years)], pch=19, cex=1) legend('bottomleft', legend=names(year_colors), col=year_colors, lty=1, lwd=3, cex=0.75)
We see each year’s data is spread over a large part of the projection, but not uniformly so, suggesting that while there is some year-to-year variation, it probably isn’t the right way to group this data into distinct clusters.
For a more successful grouping example, we’ll use some single-cell RNA sequence (sc-RNA) data (provided as part of the package). This data contains a matrix, containing ~1.5K metacells (rows), and for each one, the UMI count (# of detected RNA molecules) for each of ~600 different “feature” genes (columns). In addition, it provides a vector of cell which were assigned to the metacells using a supervised analysis pipeline, and a 2-column matrix containing the 2D UMAP projection chosen to visualize the data (a common practice for scRNA data analysis).
Let’s compute a color for each cell type. For better results, we first convert the raw UMI counts to a log of the fraction of of the total UMIs in each metacell:
fractions <- pbmc$umis / rowSums(pbmc$umis) log_fractions <- log2(fractions + 1e-5) type_colors <- data_colors(log_fractions, group=pbmc$types)
We can then use this to color the provided 2D UMAP projection. Here we’ll use :
library(ggplot2) frame <- as.data.frame(pbmc$umap) frame$type <- pbmc$types ggplot(frame, aes(x=xs, y=ys, color=type)) + geom_point(size=0.75) + scale_color_manual(values=type_colors) + theme_light() + guides(color=guide_legend(override.aes=list(size=3))) + theme(legend.text=element_text(size=12), legend.key.height=unit(14, 'pt')) #> Warning: Removed 12 rows containing missing values (geom_point).
Here we see that, in contrast to the seatbelt years case above, each type (group of metacells) maps to a distinct region in the 2D UMAP projection, suggesting that grouping the metacells by the type annotations does capture some significant structure of the data.
The package also provides the lower-lever function which attempts to select a number of distinct colors, which can be directly assigned to unordered categorical data. For example:
distinct_colors(8) #> $lab #> l a b #> [1,] 45.31069 16.83526 -53.526963 #> [2,] 45.31069 59.32081 -53.526963 #> [3,] 45.31069 38.07803 -16.733398 #> [4,] 45.31069 59.32081 20.060166 #> [5,] 80.00000 -68.13584 44.589209 #> [6,] 80.00000 -25.65029 -28.997920 #> [7,] 80.00000 -46.89306 7.795645 #> [8,] 80.00000 -25.65029 44.589209 #> #> $name #>  "#3E67C5" "#9C3FC6" "#9B5188" "#C6314D" "#30E36E" "#4DD6FB" "#51DDB6" #>  "#B3D270"
By default, this excludes low-saturation colors (with a low value in the CIELAB color space), as well as too-dark and too-light colors (with too-low or too-high value in the CIELAB color space). This can be controlled by specifying explicit , and parameters.
The convenience function wraps this to allow it to be used this as a color scale for unordered (categorical) data. For example:
ggplot(frame, aes(x=xs, y=ys, color=type)) + geom_point(size=0.75) + scale_color_chameleon() + theme_light() + guides(color=guide_legend(override.aes=list(size=3))) + theme(legend.text=element_text(size=12), legend.key.height=unit(14, 'pt')) #> Warning: Removed 12 rows containing missing values (geom_point).
And the similar function allows it to be used for fill colors:
ggplot(frame, aes(x=xs, y=ys, fill=type)) + geom_point(shape=21, size=2, stroke=0.2, color="black") + scale_fill_chameleon() + theme_light() + theme(legend.text=element_text(size=12), legend.key.height=unit(14, 'pt')) #> Warning: Removed 12 rows containing missing values (geom_point).
Using any of these functions arbitrarily maps the colors to the values, making no attempt to have similar colors reflect similarities in the underlying data.