Introduction

This note walks through the common scenario of disabling specific gene modules from affecting the partitioning of cells into metacells. In many applications we would want to mask technical effects of stress (e.g. heat shock) from affecting the partitioning, or to mask cell cycle related genes in order to get metacells that represent cell types/states and not stages at the cell cycle.

We’ll start with finding genes that are correlated to a few cell cycle and stress halmark genes, then generate gene correlation heatmaps that will help to further filter the genes, and lastly show how to create metacells that are not affected by these genes.

We use here the small 8K PBMC dataset that is used in the Running metacell analysis: guided tutorial on 8K PBMCs vignette, please make sure to run that vignette before executing this one, since we’ll use several objects produced there.

Setting up

We start by loading the library and initializating the database and the figures directories:

library("metacell")

scdb_init("testdb/", force_reinit=T)
#> initializing scdb to testdb/

figs_dir = "figs"
scfigs_init(figs_dir)

Creating the gene correlation matrix

We’re going to generate a gene-gene correlation matrix using the umi matrix generated in the Running metacell analysis: guided tutorial on 8K PBMCs vignette. The genes will then be filtered by thresholding their correlation with the user supplied genes (termed anchor genes). In this vignette we’ll build a set of cell cycle and stress related genes. So let’s define these gene anchors, and some additional variables:

genes_anchors = c('MKI67', 'PCNA', 'TOP2A', 'TXN', 'HSP90AB1', 'FOS')
tab_fn = paste(figs_dir, "lateral_gmods.txt", sep="/")
gset_nm = "lateral"

tab_fn is the output file name of the correlation matrix, and gset_nm is the id of the gene set that will be generated.

Now we’ll call the function that computes the gene-gene correlation matrix:

mcell_mat_rpt_cor_anchors(mat_id="test", gene_anchors = genes_anchors, cor_thresh = 0.1, gene_anti = c(), tab_fn = tab_fn, sz_cor_thresh = 0.2)

The function computes the correlations of all genes in the umi matrix with the gene anchors after downsampling the umi counts. It then filters genes that their correlation to the anchor genes is above cor_thresh. Genes can also be filtered based on their correlation to a set of negative control genes, that can be specified in the gene_anti parameter. In this case, genes that their top correlated gene is a negative control and not one of the anchor genes are discarded. If the sz_cor_thresh is not NA, genes with correlation to the cells total umi counts above sz_cor_thresh are also selected. In our case, we expect some cell cycle genes to be correlated with cell size, so we specify a positive threshold.

Let’s read the filtered gene correlation matrix and examine it:

gcor_mat = read.table(tab_fn, header=T)
head(gcor_mat)
#>                  sz_cor        MKI67          PCNA        TOP2A        TXN
#> ABCB9       0.002267035  0.113515805  0.0077224535  0.060206640 0.01030069
#> ABHD5      -0.004551788 -0.006990550 -0.0052722943 -0.007749114 0.04990038
#> AC098614.2  0.007561982  0.072206551  0.0077146734  0.132693409 0.01243514
#> ACSL1      -0.015851397 -0.005267215 -0.0008923261 -0.005838775 0.03315487
#> ACTB        0.253784528  0.108718458  0.0454909890  0.101454370 0.15426727
#> ACTG1       0.221359303  0.110300254  0.0220282723  0.108523623 0.04266898
#>                HSP90AB1           FOS       max neg_max
#> ABCB9       0.006374946 -0.0080735106 0.1135158       0
#> ABHD5      -0.016661139  0.1031994092 0.1031994       0
#> AC098614.2 -0.002731917 -0.0004316086 0.1326934       0
#> ACSL1      -0.049008246  0.1054241285 0.1054241       0
#> ACTB       -0.068296466  0.1408468643 0.1542673       0
#> ACTG1       0.030165625 -0.0957211702 0.1103003       0
print(dim(gcor_mat))
#> [1] 582   9

The row names are genes, and the columns are the correlation to the total umi counts (sz_cor), the correlations to the anchor genes and their maximal one (max). Since we didn’t specify negative anchor genes in this example the maximal correlation to the negative anchors is zero (max_neg).

Now we’ll add a gene set to the database, comprised of the genes in the filtered matrix:

foc_genes = apply(gcor_mat[, genes_anchors], 1, which.max)

gset = gset_new_gset(sets = foc_genes, desc = "Cell cycle and stress correlated genes")
print(gset)
#> An object of class "tgGeneSets"
#> Slot "description":
#> [1] "Cell cycle and stress correlated genes"
#> 
#> Slot "set_names":
#> [1] 1 6 3 4 2 5
#> 
#> Slot "gene_set":
#>         ABCB9         ABHD5    AC098614.2         ACSL1          ACTB 
#>             1             6             3             6             4 
#>         ACTG1         ACTR2        ADGRF3        AGTRAP         AHNAK 
#>             1             6             3             6             6 
#>           AHR          AIF1       ALDH1A1         ALDH2       ALDH3B1 
#>             6             6             6             6             6 
#>         ALDOA       ALOX5AP       ANKRD28         ANPEP         ANXA1 
#>             6             4             6             6             6 
#>         ANXA2         ANXA5          AOAH         AP1S2         AP2S1 
#>             4             6             6             6             6 
#>         APLP2           APP     ARHGAP11A       ARHGDIB      ARHGEF40 
#>             6             4             1             1             6 
#>         ARL5B        ARRDC4         ASAH1         ASF1B         ASGR1 
#>             6             6             6             1             6 
#>          ASIP          ATF3          ATG3         ATOX1    ATP2B1-AS1 
#>             4             6             6             6             6 
#>         ATP5E        ATP5G2         ATP5I         ATP5L       ATP6V0B 
#>             6             6             6             6             6 
#>      ATP6V0E1       ATP6V1A      ATP6V1B2       ATP6V1F         AURKA 
#>             6             6             6             6             3 
#>         BACH1        BCL2A1         BIRC5       BLOC1S1         BLVRB 
#>             6             6             1             6             6 
#>        BNIP3L          BRI3          BST1      C10orf11      C12orf75 
#>             6             6             6             6             4 
#>       C14orf2      C19orf38      C1orf162       C4orf48         C5AR1 
#>             6             6             6             6             6 
#>       C9orf72         CALM2          CAPG        CAPZA2        CARD16 
#>             6             6             6             6             6 
#>         CARS2         CASP1         CASZ1      CATSPER1        CCDC50 
#>             6             6             3             6             4 
#>       CCDC88A          CCL3        CCL3L3         CCNA2         CCNB1 
#>             6             6             6             3             1 
#>         CCNL1          CCNY          CCR9          CD14         CD163 
#>             6             6             3             6             6 
#>          CD1D        CD300E       CD300LF         CD302          CD33 
#>             6             6             6             6             6 
#>          CD36           CD4          CD44          CD63          CD83 
#>             6             6             6             6             6 
#>          CD86          CD93           CDA         CDCA5         CDCA8 
#>             6             6             6             1             1 
#>        CDKN2C         CDKN3          CDT1         CEBPB         CEBPD 
#>             3             1             1             6             6 
#>         CECR1         CENPF         CENPK           CFD           CFP 
#>             6             1             3             6             6 
#>         CHEK1        CHMP1B          CHP1          CIB2         CKAP2 
#>             3             6             6             4             3 
#>         CKAP4         CKS1B          CKS2       CLEC10A       CLEC12A 
#>             6             3             1             4             6 
#>        CLEC4A        CLEC4C        CLEC4D        CLEC4E        CLEC7A 
#>             6             4             6             6             6 
#>         CLIC2         CMTM6         CNPY3         COTL1         COX5B 
#>             4             6             4             6             6 
#>        COX6B1         COX7B        CPPED1          CPVL         CREB5 
#>             6             6             6             6             6 
#>         CREG1         CRTAP        CSF2RB         CSF3R          CST3 
#>             6             6             4             6             6 
#>          CSTA    CTB-61M7.2  CTC-510F12.4  CTD-3252C9.4          CTSB 
#>             6             6             6             6             6 
#>          CTSC          CTSD          CTSH          CTSS          CTSZ 
#>             4             6             6             6             6 
#>         CXCL8          CYBA          CYBB        CYP1B1        DAZAP2 
#>             6             6             6             6             6 
#>         DDIAS         DERL3         DMXL2      DNASE1L3          DPYD 
#>             1             4             6             4             6 
#>        DPYSL2         DSCC1         DUSP1         DUSP6           DUT 
#>             6             1             6             6             1 
#>        DYNLT1          E2F1          ECT2     EEF1A1P42       EHBP1L1 
#>             6             2             1             3             6 
#>          EIF1      EIF4EBP1          ELOB          ENHO        ENTPD1 
#>             6             6             6             4             6 
#>          EREG         EVI2A         EVI2B          EZH2         F13A1 
#>             6             6             6             1             6 
#>       FAM120A       FAM198B        FAM45A         FANCI         FBXL5 
#>             6             6             6             1             6 
#>         FBXO5          FCAR        FCER1A        FCER1G        FCGR1A 
#>             1             6             4             6             6 
#>        FCGR2A         FCGRT          FCN1          FGD4          FGL2 
#>             6             6             6             6             6 
#>           FGR        FKBP1A          FLT3         FOLR3           FOS 
#>             6             6             4             6             6 
#>          FOSB         FOSL2          FPR1          FTH1           FTL 
#>             6             6             6             6             6 
#>         FTLP3          G0S2     GABARAPL1         GAPDH          GAS6 
#>             6             6             6             6             4 
#>          GAS7           GCA         GINS2        GLIPR1          GLMP 
#>             6             6             2             6             3 
#>          GLRX        GLT1D1          GLUL          GMFG         GNAI2 
#>             6             6             6             6             6 
#>          GNAQ           GNS        GPBAR1        GPCPD1          GPX1 
#>             6             6             6             6             6 
#>          GPX4           GRN           GSN         GSTO1         GSTP1 
#>             6             6             4             6             6 
#>          GZMB         H2AFY         H2AFZ         H3F3B         HACD4 
#>             4             6             1             6             6 
#>           HCK         HIPK3      HIST1H4C           HK2      HLA-DPA1 
#>             6             6             3             6             4 
#>      HLA-DPB1      HLA-DQA1       HLA-DRA      HLA-DRB1      HLA-DRB6 
#>             4             4             4             4             4 
#>         HMGB2          HNMT         HSBP1      HSP90AB1         HSPA8 
#>             1             6             6             5             5 
#>           ID1          IER2          IER3          IER5         IFI30 
#>             6             6             6             6             6 
#>        IFITM3        IFNGR1        IFNGR2         IGSF6       IL13RA1 
#>             6             6             6             6             6 
#>          IL1B         IL1RN        IQGAP1         IRAK3       IRF2BP2 
#>             6             6             6             6             6 
#>          IRF7          IRF8          IRS2         ITGAM         ITM2B 
#>             4             4             6             6             6 
#>         ITM2C          JAML           JUN          JUNB          JUND 
#>             4             6             6             6             6 
#>         KCNE3        KCTD12      KIAA0930        KIF18A         KIF23 
#>             6             6             6             1             3 
#>         KIFC1         KLF10          KLF4          KLF6         KPNA2 
#>             1             6             6             6             3 
#>         LACTB         LAMP2         LAMP5       LAMTOR2       LAMTOR4 
#>             6             6             4             6             6 
#>          LCP1        LGALS1        LGALS2        LGALS3        LILRA2 
#>             6             6             6             6             6 
#>        LILRA4        LILRA5        LILRB2        LILRB3        LILRB4 
#>             4             6             6             6             4 
#>     LINC00877     LINC00996          LMO2        LPCAT2          LRP1 
#>             6             4             6             6             6 
#>        LRRC25        LRRC26         LRRK2          LSM6          LST1 
#>             6             4             6             6             6 
#>         LTA4H          LY96          LYST           LYZ        MAD2L1 
#>             6             6             6             6             1 
#>          MAFB         MAP1A        MARCKS        MCEMP1          MCL1 
#>             6             4             6             6             6 
#>          MCM7         MEGF9        METTL9         MFSD1         MGST1 
#>             2             6             6             6             6 
#>         MGST2          MIDN         MKI67          MNDA         MPEG1 
#>             6             6             1             6             6 
#>        MS4A6A         MS4A7         MSRB1         MYADM         MYBL2 
#>             6             6             6             6             4 
#>          MYL6          MZB1         NAMPT         NAPRT         NAPSB 
#>             4             4             6             6             4 
#>          NCF2       NDUFA11        NDUFB1        NDUFS6          NEFH 
#>             6             6             6             6             3 
#>         NFAM1          NFE2        NFKBIA        NFKBIZ         NINJ1 
#>             6             6             6             6             6 
#>         NLRP3          NPC2          NPM1         NR4A2          NRG1 
#>             6             4             5             6             6 
#>        NUDT16          NUF2        NUP214        NUSAP1          OAZ1 
#>             6             1             6             1             6 
#>         ODF3B        OGFRL1         OSCAR         OTUD1        P2RY13 
#>             6             6             6             6             6 
#>       PACSIN1          PAK1         PCBP1         PCLAF          PCNA 
#>             4             6             6             1             2 
#>         PFDN5          PFN1           PGD          PGLS          PHEX 
#>             6             4             6             6             4 
#>        PICALM          PID1         PILRA        PLA2G7         PLAUR 
#>             6             6             6             6             6 
#>         PLBD1          PLD4          PLEK          PLP2        PLSCR1 
#>             6             4             4             4             6 
#>        PLXDC2        PLXNB2         PNRC1         POLE4        POLR2L 
#>             6             6             6             6             6 
#>          POMP          PPIB      PPP1R14B      PPP1R15A         PRAM1 
#>             6             4             4             6             6 
#>         PRDX3       PRELID1          PROC          PSAP         PSMB3 
#>             6             6             4             6             6 
#>         PTAFR         PTGDS         PTGS2          PTMS         PTPRE 
#>             6             4             6             4             6 
#>         PTTG1        PYCARD          PYGL           QKI          QPCT 
#>             1             6             6             6             6 
#>     RAB11FIP1         RAB31         RAB32         RAB3D          RAC1 
#>             6             6             6             6             6 
#>        RASSF2        RASSF4         RBM47          RBP7        RCBTB2 
#>             6             6             6             6             6 
#>          RETN          RGS2          RHOA          RHOB          RHOG 
#>             6             6             6             6             6 
#>         RIPK2     RN7SL368P     RNA5SP151        RNASE2        RNASE6 
#>             6             6             6             6             6 
#>        RNF130        RNF181        RNF19B      RNU4ATAC      RNVU1-19 
#>             6             6             6             6             6 
#> RP11-1143G9.4  RP11-160E2.6 RP11-363K21.1   RP11-40C6.2 RP11-522I20.3 
#>             6             6             6             5             3 
#>  RP11-700H6.1  RP11-73G16.2        RPL10A        RPL23A        RPL27A 
#>             1             4             5             5             5 
#>          RPL3         RPL30         RPL31          RPL5         RPLP0 
#>             5             5             5             5             5 
#>         RPLP2         RPS12         RPS18          RPS2         RPS24 
#>             5             5             5             5             6 
#>         RPS27        RPS27A         RPS29          RPS3         RPS4X 
#>             5             5             5             5             5 
#>          RPS5          RPS6          RPSA          RRM2          RTN3 
#>             5             5             5             3             6 
#>          RTN4          RXRA       S100A10       S100A11       S100A12 
#>             6             6             4             6             6 
#>        S100A4        S100A6        S100A8        S100A9        SAMHD1 
#>             6             6             6             6             6 
#>         SAP30          SAT1        SCAMP5        SCPEP1           SCT 
#>             3             6             4             6             4 
#>         SDCBP        SEC11A        SEC61B        SEC61G        SEMA4A 
#>             6             6             4             4             6 
#>         SERF2         SERP1      SERPINA1      SERPINF1          SGK1 
#>             6             6             6             4             6 
#>          SGO1          SGO2      SH3BGRL3         SIRPA        SIRPB1 
#>             1             1             4             6             6 
#>       SLC11A1       SLC15A3       SLC15A4       SLC16A3       SLC31A2 
#>             6             6             4             6             6 
#>        SLC7A7        SLC8A1         SMCO4        SMIM25         SMIM5 
#>             6             6             6             6             4 
#>         SMPD3          SNCA     SNORD3B-2         SNX10          SOD2 
#>             4             4             6             6             6 
#>         SPCS1         SPG21          SPI1          SRGN        SRPK2P 
#>             4             6             6             6             1 
#>         STAB1         STMN1         STX11        STXBP2         SULF2 
#>             6             1             6             6             6 
#>       SULT1A1           SYK        TALDO1          TCF4          TET2 
#>             6             6             6             4             6 
#>          TFEC         TGFBI        TIMM8B         TIMP1         TIMP2 
#>             6             6             6             6             6 
#>           TK1           TKT          TLR2          TLR8       TMEM107 
#>             3             6             6             6             6 
#>      TMEM167A      TMEM170B       TMEM205       TNFAIP2       TNFAIP3 
#>             6             6             6             6             6 
#>      TNFRSF1B      TNFRSF21       TNFSF10      TNFSF13B         TONSL 
#>             6             4             6             6             1 
#>         TOP2A          TPI1          TPM2          TPP1          TPX2 
#>             3             6             4             6             3 
#>         TREM1         TRIB1          TSPO        TUBA1B        TUBA1C 
#>             6             6             6             3             1 
#>          TUBB        TUBB4B           TXN          TYMP          TYMS 
#>             1             1             4             6             1 
#>        TYROBP          UGCG       UNC93B1        UQCR11          USP1 
#>             6             4             4             6             3 
#>         VAMP8          VCAN           VIM          VSIR         VSTM1 
#>             4             6             6             6             6 
#>          WARS          YBX3         YPEL5         YWHAE          ZEB2 
#>             6             6             6             6             6 
#>        ZFAND5         ZFP36       ZFP36L1       ZNF385A         ZNF81 
#>             6             6             6             6             1 
#>        ZSWIM6         ZWINT 
#>             6             3

scdb_add_gset(gset_nm, gset)

Note that we constructed a membership vector with genes as names and the number of the top correlated anchor gene as the value. This is optional, we could have created a gene set with all genes in the same set (e.g. 1). It is important to add the gene set to the database for the next steps. We use gset_nm as the gene set id.

Fine clustering and filtering of the gene set

To further examine the gene set we got, we’ll cluster the genes. First, we’ll create a sub-matrix from the umi matrix that will contain only the genes in the gene set:

sub_mat_id = paste("test", gset_nm, sep="_")
mcell_mat_ignore_genes(new_mat_id = sub_mat_id, mat_id = "test", ig_genes = names(foc_genes), reverse = T)

Now we’ll cluster the genes on the downsampled sub-matrix, using hierarchical clustering and cutting the tree to 20 clusters:

mcell_gset_split_by_dsmat(gset_id = gset_nm, mat_id = sub_mat_id, K = 20)
#> will downsample the matrix, N= 750
#> 100%

The number of clusters should be proportional to the number of genes we cluster, as we want to get homogeneous clusters of coodinated genes.

The cluster membership information is updated in the gene set. Let’s re-load it and have a look:

gset = scdb_gset(gset_nm)
print(gset)
#> An object of class "tgGeneSets"
#> Slot "description":
#> [1] "Cell cycle and stress correlated genes hc K=20"
#> 
#> Slot "set_names":
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
#> 
#> Slot "gene_set":
#>         ABCB9         ABHD5    AC098614.2         ACSL1          ACTB 
#>             1             2             3             2             4 
#>         ACTG1         ACTR2        ADGRF3        AGTRAP         AHNAK 
#>             5             6             3             7             4 
#>           AHR          AIF1       ALDH1A1         ALDH2       ALDH3B1 
#>             2             8             7             9             2 
#>         ALDOA       ALOX5AP       ANKRD28         ANPEP         ANXA1 
#>             4            10             2            11             4 
#>         ANXA2         ANXA5          AOAH         AP1S2         AP2S1 
#>             4             4             2            12             9 
#>         APLP2           APP     ARHGAP11A       ARHGDIB      ARHGEF40 
#>             7            13            14             5             2 
#>         ARL5B        ARRDC4         ASAH1         ASF1B         ASGR1 
#>             2             7            11            14            15 
#>          ASIP          ATF3          ATG3         ATOX1    ATP2B1-AS1 
#>            13             2             4             6            15 
#>         ATP5E        ATP5G2         ATP5I         ATP5L       ATP6V0B 
#>             6             3             6             3             9 
#>      ATP6V0E1       ATP6V1A      ATP6V1B2       ATP6V1F         AURKA 
#>             1             2             2             6             3 
#>         BACH1        BCL2A1         BIRC5       BLOC1S1         BLVRB 
#>             2             2            16             2            12 
#>        BNIP3L          BRI3          BST1      C10orf11      C12orf75 
#>             2            11             7             2            10 
#>       C14orf2      C19orf38      C1orf162       C4orf48         C5AR1 
#>             7            11             4             7            11 
#>       C9orf72         CALM2          CAPG        CAPZA2        CARD16 
#>             2             4             9             6            11 
#>         CARS2         CASP1         CASZ1      CATSPER1        CCDC50 
#>             2            11             3             7            13 
#>       CCDC88A          CCL3        CCL3L3         CCNA2         CCNB1 
#>             9            15            15            16            16 
#>         CCNL1          CCNY          CCR9          CD14         CD163 
#>            17             2             3            12             7 
#>          CD1D        CD300E       CD300LF         CD302          CD33 
#>             2            11            11             9             2 
#>          CD36           CD4          CD44          CD63          CD83 
#>            12             9             3             6            15 
#>          CD86          CD93           CDA         CDCA5         CDCA8 
#>             2             7            11             1            16 
#>        CDKN2C         CDKN3          CDT1         CEBPB         CEBPD 
#>             3            16            14            11             9 
#>         CECR1         CENPF         CENPK           CFD           CFP 
#>             7            16            14             8             8 
#>         CHEK1        CHMP1B          CHP1          CIB2         CKAP2 
#>            14             1             2            13             3 
#>         CKAP4         CKS1B          CKS2       CLEC10A       CLEC12A 
#>             7             3             3            18            11 
#>        CLEC4A        CLEC4C        CLEC4D        CLEC4E        CLEC7A 
#>             9            13             7             7             8 
#>         CLIC2         CMTM6         CNPY3         COTL1         COX5B 
#>            18             2             2             4             1 
#>        COX6B1         COX7B        CPPED1          CPVL         CREB5 
#>             3             3             9             8             2 
#>         CREG1         CRTAP        CSF2RB         CSF3R          CST3 
#>             2             2             9            12             8 
#>          CSTA    CTB-61M7.2  CTC-510F12.4  CTD-3252C9.4          CTSB 
#>             8             7             7            15             9 
#>          CTSC          CTSD          CTSH          CTSS          CTSZ 
#>             6            12             9             8             9 
#>         CXCL8          CYBA          CYBB        CYP1B1        DAZAP2 
#>            15             9             8             7             1 
#>         DDIAS         DERL3         DMXL2      DNASE1L3          DPYD 
#>             1            13             2            13             1 
#>        DPYSL2         DSCC1         DUSP1         DUSP6           DUT 
#>             9             3            17            15             3 
#>        DYNLT1          E2F1          ECT2     EEF1A1P42       EHBP1L1 
#>             2            14             3             3             2 
#>          EIF1      EIF4EBP1          ELOB          ENHO        ENTPD1 
#>            17            15             7            18             2 
#>          EREG         EVI2A         EVI2B          EZH2         F13A1 
#>             2             1             2             3             1 
#>       FAM120A       FAM198B        FAM45A         FANCI         FBXL5 
#>             7             7             2            14             1 
#>         FBXO5          FCAR        FCER1A        FCER1G        FCGR1A 
#>             3             2            18             8             7 
#>        FCGR2A         FCGRT          FCN1          FGD4          FGL2 
#>             2             9             8             2             8 
#>           FGR        FKBP1A          FLT3         FOLR3           FOS 
#>             9             4            18             7            17 
#>          FOSB         FOSL2          FPR1          FTH1           FTL 
#>            17             2             7             8             8 
#>         FTLP3          G0S2     GABARAPL1         GAPDH          GAS6 
#>            11             2             2             4            13 
#>          GAS7           GCA         GINS2        GLIPR1          GLMP 
#>             2            12            14             9             3 
#>          GLRX        GLT1D1          GLUL          GMFG         GNAI2 
#>             1             2             2             3             6 
#>          GNAQ           GNS        GPBAR1        GPCPD1          GPX1 
#>             7             2            11             7            12 
#>          GPX4           GRN           GSN         GSTO1         GSTP1 
#>             1            12            18             6             8 
#>          GZMB         H2AFY         H2AFZ         H3F3B         HACD4 
#>            13             6             3            17             2 
#>           HCK         HIPK3      HIST1H4C           HK2      HLA-DPA1 
#>            11             7             5             7            19 
#>      HLA-DPB1      HLA-DQA1       HLA-DRA      HLA-DRB1      HLA-DRB6 
#>            19            19            19            19            19 
#>         HMGB2          HNMT         HSBP1      HSP90AB1         HSPA8 
#>             1             7             7             5            20 
#>           ID1          IER2          IER3          IER5         IFI30 
#>             2            17             2             2            11 
#>        IFITM3        IFNGR1        IFNGR2         IGSF6       IL13RA1 
#>            11             2             2             8             2 
#>          IL1B         IL1RN        IQGAP1         IRAK3       IRF2BP2 
#>            15             2             2             7             2 
#>          IRF7          IRF8          IRS2         ITGAM         ITM2B 
#>            13            13             1             7             3 
#>         ITM2C          JAML           JUN          JUNB          JUND 
#>            13             9            17            17             7 
#>         KCNE3        KCTD12      KIAA0930        KIF18A         KIF23 
#>             2             9             2             3             3 
#>         KIFC1         KLF10          KLF4          KLF6         KPNA2 
#>            16             2            15            17             3 
#>         LACTB         LAMP2         LAMP5       LAMTOR2       LAMTOR4 
#>             2             2            13             2             7 
#>          LCP1        LGALS1        LGALS2        LGALS3        LILRA2 
#>             6             8             8             8            11 
#>        LILRA4        LILRA5        LILRB2        LILRB3        LILRB4 
#>            13             7            11             2            18 
#>     LINC00877     LINC00996          LMO2        LPCAT2          LRP1 
#>             2            13             7             2             7 
#>        LRRC25        LRRC26         LRRK2          LSM6          LST1 
#>            11            13            15             1             8 
#>         LTA4H          LY96          LYST           LYZ        MAD2L1 
#>             2             7            11             8             3 
#>          MAFB         MAP1A        MARCKS        MCEMP1          MCL1 
#>            11            13            15             7             7 
#>          MCM7         MEGF9        METTL9         MFSD1         MGST1 
#>            14             7             1             2            12 
#>         MGST2          MIDN         MKI67          MNDA         MPEG1 
#>             9             7            16             8             9 
#>        MS4A6A         MS4A7         MSRB1         MYADM         MYBL2 
#>            12            11             7            17            13 
#>          MYL6          MZB1         NAMPT         NAPRT         NAPSB 
#>             6            13            15             2            13 
#>          NCF2       NDUFA11        NDUFB1        NDUFS6          NEFH 
#>            11             1             2             6             3 
#>         NFAM1          NFE2        NFKBIA        NFKBIZ         NINJ1 
#>             7             2            15             2             1 
#>         NLRP3          NPC2          NPM1         NR4A2          NRG1 
#>             2             8            20            17             2 
#>        NUDT16          NUF2        NUP214        NUSAP1          OAZ1 
#>             2             3            12            16             9 
#>         ODF3B        OGFRL1         OSCAR         OTUD1        P2RY13 
#>            15            15             7             2             2 
#>       PACSIN1          PAK1         PCBP1         PCLAF          PCNA 
#>            13             4             1            14            14 
#>         PFDN5          PFN1           PGD          PGLS          PHEX 
#>             3             6             7             2            13 
#>        PICALM          PID1         PILRA        PLA2G7         PLAUR 
#>             2            15            11             7            15 
#>         PLBD1          PLD4          PLEK          PLP2        PLSCR1 
#>            12            13             9             3             2 
#>        PLXDC2        PLXNB2         PNRC1         POLE4        POLR2L 
#>             7             2             3             2             6 
#>          POMP          PPIB      PPP1R14B      PPP1R15A         PRAM1 
#>             6            10            13            17            11 
#>         PRDX3       PRELID1          PROC          PSAP         PSMB3 
#>             1             2            13             8             6 
#>         PTAFR         PTGDS         PTGS2          PTMS         PTPRE 
#>             2            13             2            10             9 
#>         PTTG1        PYCARD          PYGL           QKI          QPCT 
#>            16             9             2             2             7 
#>     RAB11FIP1         RAB31         RAB32         RAB3D          RAC1 
#>             2             9            15             2             9 
#>        RASSF2        RASSF4         RBM47          RBP7        RCBTB2 
#>             7             7             7            12             1 
#>          RETN          RGS2          RHOA          RHOB          RHOG 
#>             7            15             6            15             4 
#>         RIPK2     RN7SL368P     RNA5SP151        RNASE2        RNASE6 
#>             2             2             2             7             9 
#>        RNF130        RNF181        RNF19B      RNU4ATAC      RNVU1-19 
#>             9             1             2             1             2 
#> RP11-1143G9.4  RP11-160E2.6 RP11-363K21.1   RP11-40C6.2 RP11-522I20.3 
#>            12             2             2            20             3 
#>  RP11-700H6.1  RP11-73G16.2        RPL10A        RPL23A        RPL27A 
#>             3            13            20            20            20 
#>          RPL3         RPL30         RPL31          RPL5         RPLP0 
#>            20            20            20            20            20 
#>         RPLP2         RPS12         RPS18          RPS2         RPS24 
#>            20            20            20            20             3 
#>         RPS27        RPS27A         RPS29          RPS3         RPS4X 
#>            20            20            20            20            20 
#>          RPS5          RPS6          RPSA          RRM2          RTN3 
#>            20            20            20            16             7 
#>          RTN4          RXRA       S100A10       S100A11       S100A12 
#>             6             7             4             8            12 
#>        S100A4        S100A6        S100A8        S100A9        SAMHD1 
#>             8             8            12            12             9 
#>         SAP30          SAT1        SCAMP5        SCPEP1           SCT 
#>             3             8            13             2            13 
#>         SDCBP        SEC11A        SEC61B        SEC61G        SEMA4A 
#>             2             6            10            10             2 
#>         SERF2         SERP1      SERPINA1      SERPINF1          SGK1 
#>             9             3             8            13            15 
#>          SGO1          SGO2      SH3BGRL3         SIRPA        SIRPB1 
#>             3             3             4             2             7 
#>       SLC11A1       SLC15A3       SLC15A4       SLC16A3       SLC31A2 
#>            11             2            13             2             2 
#>        SLC7A7        SLC8A1         SMCO4        SMIM25         SMIM5 
#>            11             2            11            11            13 
#>         SMPD3          SNCA     SNORD3B-2         SNX10          SOD2 
#>            13             1             2             2            15 
#>         SPCS1         SPG21          SPI1          SRGN        SRPK2P 
#>            10             2             8             8             3 
#>         STAB1         STMN1         STX11        STXBP2         SULF2 
#>             7             5             4             8             2 
#>       SULT1A1           SYK        TALDO1          TCF4          TET2 
#>            11             2             7            13             1 
#>          TFEC         TGFBI        TIMM8B         TIMP1         TIMP2 
#>             2             9             6             4             2 
#>           TK1           TKT          TLR2          TLR8       TMEM107 
#>            14            11             2             7             1 
#>      TMEM167A      TMEM170B       TMEM205       TNFAIP2       TNFAIP3 
#>             7             2             2            15            17 
#>      TNFRSF1B      TNFRSF21       TNFSF10      TNFSF13B         TONSL 
#>            11            13             2            12             3 
#>         TOP2A          TPI1          TPM2          TPP1          TPX2 
#>            16             4            13             2            16 
#>         TREM1         TRIB1          TSPO        TUBA1B        TUBA1C 
#>             2             2             8             3             3 
#>          TUBB        TUBB4B           TXN          TYMP          TYMS 
#>             5             3            10             8            16 
#>        TYROBP          UGCG       UNC93B1        UQCR11          USP1 
#>             8            13            18             1             3 
#>         VAMP8          VCAN           VIM          VSIR         VSTM1 
#>             9            12             4             4             7 
#>          WARS          YBX3         YPEL5         YWHAE          ZEB2 
#>            11            15            17             6             9 
#>        ZFAND5         ZFP36       ZFP36L1       ZNF385A         ZNF81 
#>             7            17             2             9             3 
#>        ZSWIM6         ZWINT 
#>             2            14

However, it is much more convenient to look at the correlation heatmaps of these clusters, so let’s generate those:

mcell_plot_gset_cor_mats(gset_id = gset_nm, scmat_id = sub_mat_id)
#> will downsample the matrix, N= 750

The plots are generated under a subdirectory of the figures directory named gset_nm.gset_cors. Common practice is to examine them and select the clusters that we want to keep. In our example, In this case we would probably keep clusters 3, 5, 10, 14, 16 and 17 in the gene set: clus3clus5clus10clus14clus16clus17

The rest of the clusters contain mostly genes that are correlated to cell size but not related to the cell cycle.

To create the final filtered gene set, we’ll keep only the clusters we’ve selected in the gene set, and we’ll save the filtered gene set in the database:

mcell_gset_remove_clusts(gset_id = gset_nm, filt_clusts = c(3, 5, 10, 14, 16, 17), new_id = paste0(gset_nm, "_filtered"), reverse=T)

lateral_gset_id = paste0(gset_nm, "_filtered")
lateral_gset = scdb_gset(lateral_gset_id)
print(lateral_gset)
#> An object of class "tgGeneSets"
#> Slot "description":
#> [1] "Cell cycle and stress correlated genes hc K=20 cl_filt"
#> 
#> Slot "set_names":
#> [1]  3  5 10 14 16 17
#> 
#> Slot "gene_set":
#>    AC098614.2         ACTG1        ADGRF3       ALOX5AP     ARHGAP11A 
#>             3             5             3            10            14 
#>       ARHGDIB         ASF1B        ATP5G2         ATP5L         AURKA 
#>             5            14             3             3             3 
#>         BIRC5      C12orf75         CASZ1         CCNA2         CCNB1 
#>            16            10             3            16            16 
#>         CCNL1          CCR9          CD44         CDCA8        CDKN2C 
#>            17             3             3            16             3 
#>         CDKN3          CDT1         CENPF         CENPK         CHEK1 
#>            16            14            16            14            14 
#>         CKAP2         CKS1B          CKS2        COX6B1         COX7B 
#>             3             3             3             3             3 
#>         DSCC1         DUSP1           DUT          E2F1          ECT2 
#>             3            17             3            14             3 
#>     EEF1A1P42          EIF1          EZH2         FANCI         FBXO5 
#>             3            17             3            14             3 
#>           FOS          FOSB         GINS2          GLMP          GMFG 
#>            17            17            14             3             3 
#>         H2AFZ         H3F3B      HIST1H4C      HSP90AB1          IER2 
#>             3            17             5             5            17 
#>         ITM2B           JUN          JUNB        KIF18A         KIF23 
#>             3            17            17             3             3 
#>         KIFC1          KLF6         KPNA2        MAD2L1          MCM7 
#>            16            17             3             3            14 
#>         MKI67         MYADM          NEFH         NR4A2          NUF2 
#>            16            17             3            17             3 
#>        NUSAP1         PCLAF          PCNA         PFDN5          PLP2 
#>            16            14            14             3             3 
#>         PNRC1          PPIB      PPP1R15A          PTMS         PTTG1 
#>             3            10            17            10            16 
#> RP11-522I20.3  RP11-700H6.1         RPS24          RRM2         SAP30 
#>             3             3             3            16             3 
#>        SEC61B        SEC61G         SERP1          SGO1          SGO2 
#>            10            10             3             3             3 
#>         SPCS1        SRPK2P         STMN1           TK1       TNFAIP3 
#>            10             3             5            14            17 
#>         TONSL         TOP2A          TPX2        TUBA1B        TUBA1C 
#>             3            16            16             3             3 
#>          TUBB        TUBB4B           TXN          TYMS          USP1 
#>             5             3            10            16             3 
#>         YPEL5         ZFP36         ZNF81         ZWINT 
#>            17            17             3            14

Generating metacells while blacklisting our lateral gene set

This section will follow the exact same steps performed in the Running metacell analysis: guided tutorial on 8K PBMCs vignette in order to generate the K-nn graph, the co-clustering matrix and the metacells, with the small change of disabling the lateral genes from affecting the process.

We start by removing the lateral genes from the list of feature genes:

marker_gset = scdb_gset("test_feats")   
marker_gset = gset_new_restrict_gset(gset = marker_gset, filt_gset = lateral_gset, inverse = T, desc = "cgraph gene markers w/o lateral genes")
scdb_add_gset("test_feats_filtered", marker_gset)

We’ll now generate the graph using the filtered list of feature genes:

mcell_add_cgraph_from_mat_bknn(mat_id="test", 
                gset_id = "test_feats_filtered", 
                graph_id="test_graph_filtered",
                K=100,
                dsamp=T)
#> will downsample the matrix, N= 1877
#> will build balanced knn graph on 8276 cells and 893 genes, this can be a bit heavy for >20,000 cells
#> sim graph is missing 39 nodes, out of 8276

And now simply following the steps of the Running metacell analysis: guided tutorial on 8K PBMCs vignette from the K-nn graph until the colored metacell object:

mcell_coclust_from_graph_resamp(
                coc_id="test_coc500_filtered", 
                graph_id="test_graph_filtered",
                min_mc_size=20, 
                p_resamp=0.75, n_resamp=500)
#> running bootstrap to generate cocluster
#> 0%...28%...40%...53%...65%...78%...91%...100%
#> done resampling

mcell_mc_from_coclust_balanced(
                coc_id="test_coc500_filtered", 
                mat_id= "test",
                mc_id= "test_filtered_mc", 
                K=30, min_mc_size=30, alpha=2)
#> filtered 4999536 left with 741923 based on co-cluster imbalance
#> building metacell object, #mc 77
#> add batch counts
#> compute footprints
#> compute absolute ps
#> compute coverage ps
#> reordering metacells by hclust and most variable two markers
#> reorder on HLA-DRB1 vs LDHB

mcell_mc_split_filt(new_mc_id="test_filtered_mc_f", 
            mc_id="test_filtered_mc", 
            mat_id="test",
            T_lfc=3, plot_mats=F)
#> starting split outliers
#> splitting metacell 2
#> add batch counts
#> compute footprints
#> compute absolute ps
#> compute coverage ps

marks_colors = read.table(system.file("extdata", "pbmc_mc_colorize.txt", package="metacell"), sep="\t", h=T, stringsAsFactors=F)
mc_colorize("test_filtered_mc_f", marker_colors=marks_colors)

It is convenient not to restrict the selection of genes that will be displayed in the heatmap of genes and metacells, but to mark the lateral genes that did not affect the partitioning of cells to metacells:

mcell_gset_from_mc_markers(gset_id="test_filtered_markers", mc_id="test_filtered_mc_f", blacklist_gset_id=lateral_gset_id)
mcell_gset_from_mc_markers(gset_id="test_filtered_markers_lat", mc_id="test_filtered_mc_f", filt_gset_id=lateral_gset_id)
    
mcell_mc_plot_marks(mc_id="test_filtered_mc_f", gset_id="test_filtered_markers", mat_id="test", lateral_gset="test_filtered_markers_lat")

heatmap_marks The marker genes in the test_filtered_markers gene set do not contain any lateral gene (shown in black in the heatmap), and the test_filtered_markers_lat gene set contain the lateral genes that were selected as markers (shown in red).