Guided Analysis for Gene Signature Based Repositioning

1 Abstract

Identifying drugs targeting specific gene perturbations is facilitated by the relinc and slinky packages. This document describes how to conduct such an analysis.

2 Pre-requisites.

The relinc package is designed to expedite analysis by storing zscores on a redis server. Do not open your redis server to the world. If working on AWS, open port 6379 only to other servers within a specified security group. Use similar strategies in other environments to keep your redis instance behind a firewall. If you do not, you will find your redis server hacked within minutes.

The following tutorial was completed using an m5.4xlarge instance type ($0.77/hour for on demand, presently $0.25/hour for spot pricing) running Ubuntu Server 20.04 LTS (HVM) (ami-0801628222e2e96d6), with a 100GB root volume.

First we need to install some system libraries for R, and redis.

sudo -s
apt-get update
apt-get -y install build-essential
apt-get -y install gfortran
apt-get -y install libssl-dev
apt-get -y install libcurl4-openssl-dev
apt-get -y install libxml2-dev
apt-get -y install libtool-bin
apt-get -y install libhiredis-dev
apt-get -y install emacs-nox

sudo apt-get -y install redis-server

We can then install the latest R (at least, the latest relase from version 4)

# a little shell magic to get our Ubuntu release's "codename"
RELEASE=$(lsb_release -as 2>/dev/null | tail -n1)

echo "deb https://cloud.r-project.org/bin/linux/ubuntu ${RELEASE}-cran40/" >> /etc/apt/sources.list
gpg --keyserver keyserver.ubuntu.com --recv-keys 51716619E084DAB9
gpg -a --export 51716619E084DAB9 | sudo apt-key add -
apt-get update
apt-get -y install r-base

Optionally, you can install RStudio server to make debugging and plotting simpler.

sudo apt-get install gdebi-core
wget https://download2.rstudio.org/server/bionic/amd64/rstudio-server-1.4.1717-amd64.deb
sudo gdebi rstudio-server-1.4.1717-amd64.deb

We are now ready to install libraries we will need to complete the analysis. From within an R session, run the following to check for required libraries and install any that are missing.


req <- c(
    "R6",
    "doMC",
    "rmdformats",
    "foreach",
    "parallel",
    "doParallel",
    "slinky",
    "redux",
    "flock",
    "dplyr",
    "devtools",
    "org.Hs.eg.db",
    "snow", 
    "doSNOW",
    "redis")

ix <- which(req %in% installed.packages())
if(length(ix < length(req))) {
  setRepositories(ind = c(1:6))
  install.packages(req[-ix])
}

if(!require(relinc))
  devtools::install_github("vanandelinstitute/relinc")

Make sure the redis-server is configured to store the snapshot somewhere with enough room (the default location, /var/lib/redis, is fine if your root drive has 50GB or so free), and to allow connections from outside (by commenting out the “bind” line in the network section of redis.conf) if you are going to use separate servers to do the analysis (see below). Remember, “outside” means “other servers in your security group or behind your firewall”, not the entire world.

Significance testing requires boostrapping large number of samples, and thus having many cores is desirable. For example, we use 100,000 boostrapped samples for gene significance testing, which takes about 8 hours with 16 cores. If you are working in the cloud, you may not want to keep such a beefy server running all the time. The recommended approach is to have your redis server on a smaller server (a couple of processors and 30GB of RAM is recommended) which you leave running until prolonged periods of downtime are anticipated, and then one or more larger multi-core servers in the same security group when you are actively selecting drugs.

The analysis also requires the LINCS L1000 data files (expression data and instance “info”). These need to be fetched and stored somewhere memorable that has enough room for them. Once you have loaded the z-scores, you no longer need the expression data file.

mkdir data
cd data
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE92nnn/GSE92742/suppl/GSE92742_Broad_LINCS_Level3_INF_mlr12k_n1319138x12328.gctx.gz
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE92nnn/GSE92742/suppl/GSE92742_Broad_LINCS_inst_info.txt.gz
gunzip GSE92742_Broad_LINCS_Level3_INF_mlr12k_n1319138x12328.gctx.gz

Finally, the worked examples below assume that the relinc github repository is available in your current working directory so that it can access data files provided in that repository. This is simply achieved from a terminal session as follows:

git clone https://github.com/vanandelinstitute/relinc

3 Calculating zscores

Once your redis server is up and running, you can configure the client and calculate and store robust zscores as follows. The first command will prompt you for details about the location of your redis server and data files. You will also be prompted to save this information so you do not need to enter it each time.

library(relinc)
rl <- Relincr$new(interactive = TRUE)
rl$zscore_all_drugs()
rl$zscore_all_genes()

That will take quite a while–several hours or so, depending on how many cores you have. Note that the you will experience diminishing returns at this step with lots of cores, since this requires calls to the clue.io api to identify appropriate control samples for each instance. Relincr throttles these requests to to more than 1 every 2 seconds (even if using multiple cores). Since calculating the zscores also take a short bit of time, this is more efficient if you have multiple cores, but probably having more than 4-8 cores will not help any. (This is not true when bootstrapping p-values later on– for that task, the more parallelism the better)

Although it takes a while, this only needs to be done once. Once you have zscores calculated and loaded, you can use them forever assuming you hang on to the redis dump file–dump.rdb by default–so you can reload the data automatically in the event of a server reboot. For this reason, you may want to configure redis to store the dump file somewhere non-ethereal (such as a dedicated EBS volume you create and mount for this purpose). If you do decide to do that, remember that you must both update the dir setting in redis.conf and add your dump file directory to /etc/systemd/system/redis-server with a line something like this (but with the actual path to the mount you created):

ReadWriteDirectories=/mnt/data/redis/dump/directory

4 Gene scoring

The next step is to identify the gene signature for the gene perturbation of analysis. For this example, we derive our signature from instances in the LINCS database that were treated with short hairpins targetting the LMNA gene.


library(rredis)
r_host = "localhost"
r_port = 6379
redisConnect(host=r_host, port=r_port)

if(!exists("metadata")) metadata <- readRDS("relinc/docs/metadata.rds")
metadata$pert_desc <- tolower(metadata$pert_desc)

lam_inst <- which(metadata$pert_desc == "lmna" &  
                    metadata$is_gold & 
                    metadata$pert_type == "trt_sh")
lam_keys <- paste(metadata$distil_id[lam_inst])
data <- do.call(cbind, redisMGet(lam_keys))
ids <- rownames(data)
data <- apply(data, 2, as.numeric)
rownames(data) <- ids
saveRDS(data, file="relinc/docs/lamin.rds")

Robustly identifying differentially expressed genes (DEGs) continues to be an active area of research in itself. It is widely known that large scale gene expression efforts are plagued with “batch effects” wherein the variation between samples is overwhelmed by variation between plates, technicians, dates, labs, etc. We took two steps to minimize batch effects. First, we calculated z-scores using control samples from the same plate as the treated cells, thereby eliminating systematic biases between plates. Second, we converted our z-scores to ranks, thereby eliminating any isotonic (rank invariant) shifts in z-scores that may arise between batches. As we converted to rank space to eliminate isotonic batch effects, we proceeded with non-parametric approach to DEG identification.

To identify genes exhibiting consistent differential expression, we performed a “rank of ranks” analysis. First, the data was ranked sample-wise as described above. Second, the entire matrix of ranks (978 genes by 84 shRNA samples) was ranked and Kolmogorov Smirnov analysis was performed on the position of each occurrence of each gene within this vector of ranks. The resulting analysis quantifies the extent to which the expression of each gene was consistently biased up or down relative to all other genes.

ks <- function(x, ix) {
  n <- length(x)
  scores <- -rep(1/(n-length(ix)), n)
  inc <- 1/length(ix)
  
  # need to account for ties
  ix <- floor(x[ix])
  scores[ix] <- 0
  for(i in ix) {
    scores[i] = scores[i] + inc
  }
  if(-min(cumsum(scores)) >= max(cumsum(scores))) {
    return(0)
  } else {
    return(max(cumsum(scores)))
  }
}

# we want the largest values to have the smallest rank 
# (i.e. 1 = the largest score)
ranks <- apply(-data, 2, rank)
ranks <- rank(as.vector(ranks))
genes <- rep(rownames(data), ncol(data))

# for selecting down regulated genes, we invert the rankings
ranks_d <- rank(-ranks)

# now we quantify how biased (up or down regulated) each gene is across all 
# instances
up <- numeric(0)
down <- numeric(0)
for(g in unique(genes)) {
  up <- c(up, ks(ranks, which(genes==g) ) )
  down <- c(down, ks(ranks_d, which(genes==g) ) )
}
names(up) <- unique(genes)
names(down) <- unique(genes)

saveRDS(up, "./relinc/docs/lamin_sh_scores_up.rds")
saveRDS(down, "./relinc/docs/lamin_sh_scores_down.rds")

We want to identify those genes that change in expression specifically in response to our treatment of interest (in this case, treatment with short hairpins targetting LMNA), as opposed to being a generic response to perturbation. To do that, we bootstrap the distribution of scores for each gene based on 100,000 random sets of samples with the same class of treatment (short hairpins) and the same size as our sample set of interest (84 samples).

Note that this process takes a while and benefits from parallelization as demonstrated before. Long running processes such as this are best run within a detachable terminal session (such as screen or tmux) so you can log out and pick up the results later. The ubuntu environment created above includes both the screen and tmux terminal applications.

library(parallel)
library(doParallel)
library(rredis)

cores <- detectCores() - 1
registerDoParallel(cores)

# this is the same as above. Just copying here for clarity
ks <- function(x, ix) {
  n <- length(x)
  scores <- -rep(1/(n-length(ix)), n)
  inc <- 1/length(ix)
  
  # need to account for ties
  ix <- floor(x[ix])
  scores[ix] <- 0
  for(i in ix) {
    scores[i] = scores[i] + inc
  }
  if(-min(cumsum(scores)) >= max(cumsum(scores))) {
    return(0)
  } else {
    return(max(cumsum(scores)))
  }
}

metadata <- readRDS("./relinc/docs/metadata.rds")

r <- foreach(i=1:100000, .packages="rredis") %dopar% {
  r_host = "localhost"
  r_port = 6379

  # keep data set size the same as our dataset of interest
  rand_inst <- sample(which(metadata$pert_type == "trt_sh"), 84)
  rand_keys <- metadata$distil_id[rand_inst]

  redisConnect(host=r_host, port=r_port)
  rand_data <- do.call(cbind, redisMGet(rand_keys))
  redisClose()

  ids <- rownames(rand_data)
  rand_data <- apply(rand_data, 2, as.numeric)
  rownames(rand_data) <- ids

# invert sign so that largest values have smallest rank
  ranks <- apply(-rand_data, 2, rank)
  ranks <- rank(as.vector(ranks))
  genes <- rep(rownames(rand_data), ncol(rand_data))

  ranks_down <- rank(-ranks)
  up_r <- numeric(0)
  down_r <- numeric(0)

  for(g in unique(genes)) {
    up_r <- c(up_r, ks(ranks, which(genes==g) ) )
    down_r <- c(down_r, ks(ranks_down, which(genes==g) ) )
  }
  return(list(up=up_r, down=down_r, gene_ids=rownames(rand_data)))
}


f1 <- function(x) { return(x$up)}
f2 <- function(x) { return(x$down)}

up <- do.call(cbind, lapply(r, f1))
down <- do.call(cbind, lapply(r, f2))
rownames(up) <- rownames(down) <- r[[1]]$gene_ids

saveRDS(up, file="./relinc/docs/ks_dist_sh_84_up.rds")
saveRDS(down, file="./relinc/docs/ks_dist_sh_84_down.rds")

Now with these bootstrapped estimates of the probability distribution of ks scores under the current conditions of interest, we can convert our ks scores for each gene in the LMNA perturbed instances to adjusted p-values as follows.

up <- readRDS("./relinc/docs/ks_dist_sh_84_up.rds")
down <- readRDS("./relinc/docs/ks_dist_sh_84_down.rds")

f1 <- function(x) { x - up }
up_p <- apply(kspdf_up, 2, f1)
up_p <- apply(up_p, 1, function(x) { sum(x >= 0)} )
up_p <- (up_p + 1) / 100000  # constrain to >= 0.00001 as that is limit of 
                             # detection
up_p_adj <- p.adjust(up_p, "fdr")

  
f2 <- function(x) { x - down }
down_p <- apply(kspdf_down, 2, f2)
down_p <- apply(down_p, 1, function(x) { sum(x > 0)} )
down_p <- (down_p + 1) / 100000  # constrain to >= 0.00001 as that is limit 
                                 # of detection
down_p_adj <<- p.adjust(down_p, "fdr")

Final gene selection for the LMNA signature can then be determined by setting appropriate thresholds on the KS score and adjusted p-value for each gene. For example, you might require that the KS score be greater than 0.2 and the adjusted p-value be less than 0.001. This filtering would be done on both the up and down scores to obtain the up-regulated and down-regulated genes for the signature, respectively.

5 Drug scoring

For this example we use a signature of genes that are up and down regulated in stenotic vs. normal atrioventricular valve tissue. The signatures are provided as gene symbols which must be converted to entrez gene ids and then filtered on the genes that are in the 978 landmark genes directly profiled by the L1000 project.


# get the rownames of the stored zscores

rn <- rownames(rl$fetch_rand(1))

up <- read.delim("av_stenosis_up.grp")[,1]
up.eg <- unlist(mget(up, org.Hs.egSYMBOL2EG, ifnotfound=NA))
ix <- which(up.eg %in% rn)
up.eg <- up.eg[ix]
writeLines(up.eg, "av_stenosis_up_eg.grp")

down <- read.delim("av_stenosis_down.grp")[,1]
down.eg <- unlist(mget(down, org.Hs.egSYMBOL2EG, ifnotfound=NA))
ix <- which(down.eg %in% rn)
down.eg <- down.eg[ix]
writeLines(down.eg, "av_stenosis_down_eg.grp")

This results in up and down gene lists which happen to both be 13 genes in length. We must create a matrix of enrichment scores derived from random gene signatures of this length to use downstream for bootstrap p-value estimation. We want to restrict our sampling to instances that have been treated with FDA approved drugs as that is the universe from which we will be calculating scores in subsequent analysis.


library(parallel)
library(doParallel)
library(rredis)

metadata <- readRDS("relinc/docs/metadata.rds")
metadata$pert_desc <- tolower(metadata$pert_desc)

fda_ix <- which(metadata$is_fda & metadata$is_gold)

fda_keys <-metadata$distil_id[fda_ix]

bigFetch <- function(keys, stride=2000) {
  data <- matrix(0, ncol = length(keys), nrow = 978)
  keys <- split(keys, ceiling(seq_along(fda_keys)/stride))

  for (i in 1:length(keys)) {
    r <- redisMGet(keys[[i]])
    for (ii in 1:length(r)) {
      if (length(r[[ii]]) >  0) {
        data[,((i - 1) * stride + ii)] <- r[[ii]]
      }
    }
    print(i*stride)
  }
  data
}

redisConnect()
data <- bigFetch(fda_keys)
data <- apply(data, 2, as.numeric)
missing <- which(apply(data, 2, sum)==0)
data <- data[,-missing]
fda_ix <- fda_ix[-missing]
rownames(data) <- names(redisGet(fda_keys[1]))
redisClose()

The enrichment score we are using is the very simple XSUM statistic which Agarwal et al. have demonstrated is quite performant compared to other metrics they tested. We generate 10,000 random signatures (each having 13 “up” genes and 13 “down” genes to match our actual disease signature from above), and for each random signature we calculate and store the XSUM statistic for that signature.

Note that this take a while and benefit from parallelization, as implemented below.

xsum <- function(x, up, down, n=489) {
  up.ix <- which(rownames(x) %in% up)
  down.ix <- which(rownames(x) %in% down)
  f <- function(a) {
    a_r <- rank(a)
    changed <- a * (a_r > ( length(a_r) - n) | a_r < n)
    sum(changed[up.ix]) - sum(changed[down.ix])  
  }
  apply(x, 2, f)
}

cores <- detectCores() - 1
registerDoParallel(cores)
  
r <- foreach(i=1:10000, .packages="rredis", .combine="cbind") %dopar% {  
  up <- sample(rownames(data), 12)
  down <- sample(rownames(data), 12)
  scores <- xsum(data, up, down)
  names(scores) <- metadata$pert_desc[fda_ix]
  
  drugs <- metadata$pert_desc[fda_ix]
  spec <- numeric()
  for(s in unique(drugs)) {
    spec <- c(spec, median(scores[which(drugs == s)], na.rm=TRUE))
  }
  names(spec) <- unique(drugs)
  spec
}

saveRDS(r, file="xsum_raw_median_489_13_13.rds")

This matrix of bootstrapped values can then be used to estimate p-values when we calculate the enrichment score for each drug against our disease signature as follows.

We then score each drug against the disease signature described above.


# NOTE: We swap the signatures because we want to REVERSE it, not mimic it
up <- read.delim("./relinc/docs/av_stenosis_down_eg.grp")[,1]
down <- read.delim("./relinc/docs/av_stenosis_up_eg.grp")[,1]
scores <- xsum(data, up, down)
names(scores) <- metadata$pert_desc[fda_ix]

drugs <- metadata$pert_desc[fda_ix]
spec <- numeric()
for(s in unique(drugs)) {
  spec <- c(spec, median(scores[which(drugs == s)], na.rm=TRUE))
}
names(spec) <- unique(drugs)

specpdf <- readRDS("relinc/docs/xsum_raw_median_489_13_13.rds")
ix <- match(names(spec), rownames(specpdf))
specpdf <- specpdf[ix,]
spec_p <- apply(sweep(specpdf, 1, spec, "-" ), 1, 
                function(x) { sum(x>0)}) /  10001
spec_p[which(spec < 0)] <- NA
res <- cbind(spec, spec_p)

drugs <- head(rownames(res[order(res[, 2]),]))
drugs <- c("cyproheptadine", "fexofenadine")
ix <- match(gg, rownames(data))
data.sym <- data
rownames(data.sym) <- mget(rownames(data), org.Hs.egSYMBOL)
par(mfrow=c(3,2))
for(d in drugs) {
  ix.d <- which(names(scores) == d)
  pheatmap(data.sym[ix, ix.d], 
           #legend = FALSE, 
           gaps_row = c(12),
           cluster_rows = FALSE, 
           cluster_cols = FALSE,
           color = colorRampPalette(c("blue", "white", "red"))(11),
           breaks=seq(-5, 5),
           main = d)
}


write.table(tt, quote=FALSE, 
            sep="\t", 
            row.names=TRUE, 
            col.names=TRUE, 
            file = "relinc/docs/scores.tab")


pheatmap(data[ix, ix.cyp], cluster_rows = FALSE, cluster_cols = FALSE,
         color = colorRampPalette(c("blue", "white", "red"))(21),
         breaks=seq(-10, 10))

We can then perform some QA on these results to ensure the calculation was performed as expected.


redisConnect()
ix <- which(metadata$pert_desc = "cyproheptadine" & metadata$is_gold)
keys <- metadata$distil_id[ix]
dat <- redisMGet(keys)
dat <- as.data.frame(dat)
rownames(dat) <- names(redisGet(keys[1]))

gg <- c(up, down)
ix <- match(gg, rownames(dat))
dat <- dat[ix,]
saveRDS(dat, "cyproheptadine.rds")

dat <- readRDS("cyproheptadine.rds")