Predicting protein expression with model based boosting

This is an R Markdown Notebook for the “Springer chapter Tailoring codon usage to the underlying biology for protein expression optimization”. When you execute code within the notebook, the results appear beneath the code.

This notebook relies on the mboost package. If you have not installed it, just run

if(FALSE) {
  install.packages('mboost')
}

Our goal is to predict the protein expression based on sequence features. You can either start from scratch, but this means you have to calculate the sequence based predictors first, e.g. for a completely new organism, or you might want to re-train parts of the model for. In both cases, you first of all need protein expression data. A good source for this is paxdb.

require(mboost)

## Loading required package: mboost

## Warning: package 'mboost' was built under R version 3.6.3

## Loading required package: parallel

## Loading required package: stabs

## Warning: package 'stabs' was built under R version 3.6.3

## This is mboost 2.9-2. See 'package?mboost' and 'news(package  = "mboost")'
## for a complete list of changes.

In the next step we read the joint data frame of protein abundance and sequence based predictors.

x <- readRDS('yeast_clean.RDS')
labels<-c('Log Protein Abundance',
          'COSEM current',
          'Average elongation rate',
          'Length',
          'Bottleneck index',
          'Ramp index',
          'Accuracy',
          'foo',
          'Log Transcript abundance',
          'GC3 content',
          '5\' RNA folding energy',
          'Number of Hairpins',
          'Effective Translation Time',
          'Total Translation Time')
names(labels) <- colnames(x)

Now, we split the data frame into training and testing set with a 70/30 split. For this we set a defined seed for reproducibility.

set.seed(1234)

## create train and test sets
train.idx <- sample.int(nrow(x), size = floor(0.7 * nrow(x)), replace = FALSE)
train <- x[train.idx, ]
test <- x[-train.idx, ]

For actually fitting the model, we have some options: we can use a partial set of predictors, or choose a specific base function for every predictor. mboost allows us to pick between ordinary least square linear effects bols(in fact, bols also supports ordinary least square with categorical effects via dummy coding, group-wise interactions and more), B-splines, bbs; tree-based learners, btree and also let’s us impose constraint on monotonicity by using bmono for monotonous B-splines. The latter is especially valuable if we have prior knowledge about the predictors, e.g., it would make sense to impose a monotonicity constraint on the transcript abundance, since it us unlikely that higher transcript abundance will ever lead to lower protein abundance if fitted for the whole organism.

For trying out multiple models in parallel, we recommend the following way to create a vector of models

model_fms <- c(paste0("lprot ~ bmono(COSEMcurrent) + bmono(AvgElRate) + BottleneckIndex + ",
                      "Accuracy + RampIndex + FiveRnaFolding + bmono(TranscriptAbundance) + ",
                      "GC3 + bols(NHairpins) + Length"),
               paste0("lprot ~ bmono(COSEMcurrent) + bmono(AvgElRate) + BottleneckIndex + ",
                      "Accuracy + RampIndex + FiveRnaFolding + ",
                      "GC3 + bols(NHairpins) + Length"),
               paste0("lprot ~ bmono(COSEMcurrent)+bmono(TranscriptAbundance)"),
               paste0("lprot ~ bmono(COSEMcurrent) + bmono(AvgElRate) + BottleneckIndex + ",
                      "Accuracy + RampIndex + FiveRnaFolding + bmono(TranscriptAbundance) + ",
                      "GC3 + bols(NHairpins) "),
               paste0("lprot ~ bols(COSEMcurrent) + bols(AvgElRate) + bols(BottleneckIndex) + ",
                      "bols(Accuracy) + bols(RampIndex) + bols(FiveRnaFolding) + bols(TranscriptAbundance) + ",
                      "GC3 + bols(NHairpins)"),
               paste0("lprot ~ bmono(COSEMcurrent)"))
names(model_fms) <- c("All variables", "Without Transcript Abundance", "Only COSEM and Transcript Abundance","Even More","Linear Approximation","Only current")

model <- cvm <- r.squared <- vector("list", 6)
names(model) <- names(cvm) <- names(r.squared) <- names(model_fms)

Fitting the individual models follows consists of three steps,

Set up the model with gamboost since we are using different base-learners. We run the optimization procedure for a maximum of mstop = 500 steps and keep track of all intermediate models [1,500]
The model needs some protection against overfitting, this is achieved by detemining the optimal stopping iteration mstop. There are multiple ways in which this can be done, we use the function cvrisk for performing crossvalidation.
Choosing the optimal mstop allows us to choose the most optimal model and calculate the R squared performance indicator on the test data.

The calculation might take some time (a few minutes on a 2018 laptop).

for (i in 1:length(model_fms)) {
  ## fit model
  model[[i]] <- gamboost(as.formula(model_fms[i]),
                         data = train,
                         control = boost_control(mstop = 500, trace = FALSE))
  
  ## use cross validation to obtain optimal stopping iteration
  cvm[[i]] <- cvrisk(model[[i]])
  plot(cvm[[i]], main = names(model_fms[i]))
  ## set model to optimal stopping iteration
  mstop(model[[i]]) <- mstop(cvm[[i]])
  ## compute R2
  suppressWarnings(r.squared[[i]] <- 1 - sum((test$lprot - predict(model[[i]], newdata = test))^2) / 
                     sum((test$lprot - mean(test$lprot))^2))
}

# print R2
r.squared

## $`All variables`
## [1] 0.490187
## 
## $`Without Transcript Abundance`
## [1] 0.3632215
## 
## $`Only COSEM and Transcript Abundance`
## [1] 0.4456215
## 
## $`Even More`
## [1] 0.4904975
## 
## $`Linear Approximation`
## [1] 0.4639162
## 
## $`Only current`
## [1] 0.1375663

In addition, we can also show the predicted compared to the measured protein abundance.

par(mfrow = c(3, 4), mar = c(5, 5, 0.5, 0.1))
for (var in variable.names(model[[1]])) {
  for (i in 1:length(model_fms)) {
    ## plot effect estimate j
    if (i == 1) {
      plot(model[[i]], type = "l", which = var, ylim = c(-1,1),
           xlab = labels[variable.names(model[[i]])], lwd = 1.5)
    } else {
      if (var %in% variable.names(model[[i]]))
        lines(model[[i]], which = var, lty = i, col = i, lwd = 1.5)
    }        
  }
}
legend("topright", names(model_fms), title = "Variables in model", 
       lty = 1:3, col = 1:3)

par(mfrow = c(3, 1),  mar = c(5, 5, 2.5, 0.1))

for (i in 1:length(model_fms)) {
  ## plot observed vs predicted
  plot(x = test$lprot, y = predict(model[[i]], newdata = test),
       pch=20, xlab = 'Observed', ylab = 'Predicted', 
       main = paste0('Test set (Variables: ', names(model_fms[i]), ')'))
  abline(0, 1)
  text(-1.5, 3, sprintf("R2 = %.4f", round(r.squared[[i]], 4)))
}

## Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
## args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
## Linear extrapolation used.

## Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
## args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
## Linear extrapolation used.

## Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
## args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
## Linear extrapolation used.

## Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
## args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
## Linear extrapolation used.

## Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
## args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
## Linear extrapolation used.

## Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
## args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
## Linear extrapolation used.

## Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
## args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
## Linear extrapolation used.

## Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
## args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
## Linear extrapolation used.

## Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
## args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
## Linear extrapolation used.

## Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
## args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
## Linear extrapolation used.

## Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
## args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
## Linear extrapolation used.

## Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
## args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
## Linear extrapolation used.

## Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
## args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
## Linear extrapolation used.

## Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
## args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
## Linear extrapolation used.

## Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
## args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
## Linear extrapolation used.

## Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
## args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
## Linear extrapolation used.

## Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
## args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
## Linear extrapolation used.

## Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
## args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
## Linear extrapolation used.

## Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
## args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
## Linear extrapolation used.

## Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
## args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
## Linear extrapolation used.

## Warning in bsplines(mf[[i]], knots = args$knots[[i]]$knots, boundary.knots =
## args$knots[[i]]$boundary.knots, : Some 'x' values are beyond 'boundary.knots';
## Linear extrapolation used.