Function to predict ontology terms for genomes with domain architectures (including individual domains)

Description

dcAlgoPredictGenome is supposed to predict ontology terms for genomes with domain architectures (including individual domains).

Usage

dcAlgoPredictGenome(input.file, RData.HIS = c(NULL, "Feature2GOBP.sf", "Feature2GOMF.sf", 
  "Feature2GOCC.sf", "Feature2HPPA.sf", "Feature2GOBP.pfam", "Feature2GOMF.pfam", 
      "Feature2GOCC.pfam", "Feature2HPPA.pfam", "Feature2GOBP.interpro", "Feature2GOMF.interpro", 
      "Feature2GOCC.interpro", "Feature2HPPA.interpro"), weight.method = c("none", 
      "copynum", "ic", "both"), merge.method = c("sum", "max", "sequential"), scale.method = c("log", 
      "linear", "none"), feature.mode = c("supra", "individual", "comb"), slim.level = NULL, 
      max.num = NULL, parallel = TRUE, multicores = NULL, verbose = T, RData.HIS.customised = NULL, 
      RData.location = "https://github.com/hfang-bristol/RDataCentre/blob/master/dcGOR")

Arguments

input.file
an input file containing genomes and their domain architectures (including individual domains). For example, a file containing Hominidae genomes and their domain architectures can be found in http://dcgor.r-forge.r-project.org/data/Feature/Hominidae.txt. As seen in this example, the input file must contain the header (in the first row) and two columns: 1st column for 'Genome' (a genome like a container), 2nd column for 'Architecture' (SCOP domain architectures, each represented as comma-separated domains). Alternatively, the input.file can be a matrix or data frame, assuming that input file has been read. Note: the file should use the tab delimiter as the field separator between columns
RData.HIS
RData to load. This RData conveys two bits of information: 1) feature (domain) type; 2) ontology. It stores the hypergeometric scores (hscore) between features (individual domains or consecutive domain combinations) and ontology terms. The RData name tells which domain type and which ontology to use. It can be: SCOP sf domains/combinations (including "Feature2GOBP.sf", "Feature2GOMF.sf", "Feature2GOCC.sf", "Feature2HPPA.sf"), Pfam domains/combinations (including "Feature2GOBP.pfam", "Feature2GOMF.pfam", "Feature2GOCC.pfam", "Feature2HPPA.pfam"), InterPro domains (including "Feature2GOBP.interpro", "Feature2GOMF.interpro", "Feature2GOCC.interpro", "Feature2HPPA.interpro"). If NA, then the user has to input a customised RData-formatted file (see RData.HIS.customised below)
weight.method
the method used how to weight predictions. It can be one of "none" (no weighting; by default), "copynum" for weighting copynumber of architectures, and "ic" for weighting information content (ic) of the term, "both" for weighting both copynumber and ic
merge.method
the method used to merge predictions for each component feature (individual domains and their combinations derived from domain architecture). It can be one of "sum" for summing up, "max" for the maximum, and "sequential" for the sequential merging. The sequential merging is done via: \sum_{i=1}{\frac{R_{i}}{i}}, where R_{i} is the i^{th} ranked highest hscore
scale.method
the method used to scale the predictive scores. It can be: "none" for no scaling, "linear" for being linearily scaled into the range between 0 and 1, "log" for the same as "linear" but being first log-transformed before being scaled. The scaling between 0 and 1 is done via: \frac{S - S_{min}}{S_{max} - S_{min}}, where S_{min} and S_{max} are the minimum and maximum values for S
feature.mode
the mode of how to define the features thereof. It can be: "supra" for combinations of one or two successive domains (including individual domains; considering the order), "individual" for individual domains only, and "comb" for all possible combinations (including individual domains; ignoring the order)
slim.level
whether only slim terms are returned. By defaut, it is NULL and all predicted terms will be reported. If it is specified as a vector containing any values from 1 to 4, then only slim terms at these levels will be reported. Here is the meaning of these values: '1' for very general terms, '2' for general terms, '3' for specific terms, and '4' for very specific terms
max.num
whether only top terms per sequence are returned. By defaut, it is NULL and no constraint is imposed. If an integer is specified, then all predicted terms (with scores in a decreasing order) beyond this number will be discarded. Notably, this parameter works after the preceding parameter slim.level
parallel
logical to indicate whether parallel computation with multicores is used. By default, it sets to true, but not necessarily does so. Partly because parallel backends available will be system-specific (now only Linux or Mac OS). Also, it will depend on whether these two packages "foreach" and "doMC" have been installed. It can be installed via: source("http://bioconductor.org/biocLite.R"); biocLite(c("foreach","doMC")). If not yet installed, this option will be disabled
multicores
an integer to specify how many cores will be registered as the multicore parallel backend to the 'foreach' package. If NULL, it will use a half of cores available in a user's computer. This option only works when parallel computation is enabled
verbose
logical to indicate whether the messages will be displayed in the screen. By default, it sets to TRUE for display
RData.HIS.customised
a file name for RData-formatted file containing an object of S3 class 'HIS'. By default, it is NULL. It is only needed when the user wants to perform customised analysis. See dcAlgoPropagate on how this object is created
RData.location
the characters to tell the location of built-in RData files. By default, it remotely locates at "https://github.com/hfang-bristol/RDataCentre/blob/master/dcGOR" and "http://dcgor.r-forge.r-project.org/data". For the user equipped with fast internet connection, this option can be just left as default. But it is always advisable to download these files locally. Especially when the user needs to run this function many times, there is no need to ask the function to remotely download every time (also it will unnecessarily increase the runtime). For examples, these files (as a whole or part of them) can be first downloaded into your current working directory, and then set this option as: RData.location=".". If RData to load is already part of package itself, this parameter can be ignored (since this function will try to load it via function data first). Here is the UNIX command for downloading all RData files (preserving the directory structure): wget -r -l2 -A "*.RData" -np -nH --cut-dirs=0 "http://dcgor.r-forge.r-project.org/data"

Value

a matrix of terms X genomes, containing the predicted scores (per genome) as a whole

Note

none

Examples

# 1) Prepare an input file containing domain architectures input.file <- "http://dcgor.r-forge.r-project.org/data/Feature/Hominidae.txt" # 2) Do prediction using built-in data output <- dcAlgoPredictGenome(input.file, RData.HIS="Feature2GOMF.sf", parallel=FALSE)
Start at 2015-07-23 12:28:29 Read the input file 'http://dcgor.r-forge.r-project.org/data/Feature/Hominidae.txt' ... Predictions for 4 sequences (9214 distinct architectures) using 'Feature2GOMF.sf' RData, 'sum' merge method, 'log' scale method and 'supra' feature mode (2015-07-23 12:28:29) ... ############################## 'dcAlgoPredict' is being called... ############################## Start at 2015-07-23 12:28:29 Load the HIS object 'Feature2GOMF.sf' (2015-07-23 12:28:29) ... 'Feature2GOMF.sf' (from https://github.com/hfang-bristol/RDataCentre/blob/master/dcGOR/Feature2GOMF.sf.RData?raw=true) has been loaded into the working environment Predictions for 9214 architectures using 'sum' merge method, 'log' scale method and 'supra' feature mode (2015-07-23 12:28:30)... 1 out of 9214 (2015-07-23 12:28:30) 922 out of 9214 (2015-07-23 12:28:37) 1844 out of 9214 (2015-07-23 12:28:40) 2766 out of 9214 (2015-07-23 12:28:43) 3688 out of 9214 (2015-07-23 12:28:46) 4610 out of 9214 (2015-07-23 12:28:48) 5532 out of 9214 (2015-07-23 12:28:51) 6454 out of 9214 (2015-07-23 12:28:55) 7376 out of 9214 (2015-07-23 12:28:58) 8298 out of 9214 (2015-07-23 12:29:00) 9214 out of 9214 (2015-07-23 12:29:03) End at 2015-07-23 12:29:03 Runtime in total is: 34 secs ############################## 'dcAlgoPredict' has been completed! ############################## A summary in terms of ontology terms using 'none' weight method (2015-07-23 12:29:03)... Load the HIS object 'Feature2GOMF.sf' (2015-07-23 12:29:03) ... 'Feature2GOMF.sf' (from https://github.com/hfang-bristol/RDataCentre/blob/master/dcGOR/Feature2GOMF.sf.RData?raw=true) has been loaded into the working environment 1 out of 4 (2015-07-23 12:29:03) 2 out of 4 (2015-07-23 12:29:04) 3 out of 4 (2015-07-23 12:29:06) 4 out of 4 (2015-07-23 12:29:07) End at 2015-07-23 12:29:08 Runtime in total is: 39 secs
dim(output)
[1] 2836 4
output[1:10,]
gx hs of xp GO:0003674 1.0000 1.0000 1.0000 1.0000 GO:0005488 0.8282 0.8334 0.8267 0.8246 GO:0005515 0.6853 0.6901 0.6819 0.6830 GO:0003824 0.7433 0.7332 0.7443 0.7430 GO:0043167 0.5006 0.4976 0.4971 0.4925 GO:0016787 0.4827 0.4732 0.4812 0.4798 GO:0016740 0.5363 0.5200 0.5361 0.5350 GO:0097159 0.4650 0.4481 0.4638 0.4606 GO:1901363 0.4538 0.4368 0.4524 0.4502 GO:0043168 0.4166 0.4213 0.4147 0.4107
# 3) Advanced usage: using customised data x <- base::load(base::url("http://dcgor.r-forge.r-project.org/data/Feature2GOMF.sf.RData"))
Error: the input does not start with a magic number compatible with loading from a connection
RData.HIS.customised <- 'Feature2GOMF.sf.RData' base::save(list=x, file=RData.HIS.customised)
Error in base::save(list = x, file = RData.HIS.customised): object 'x' not found
#list.files(pattern='*.RData') ## you will see an RData file 'Feature2GOMF.sf.RData' in local directory output <- dcAlgoPredictGenome(input.file, parallel=FALSE, RData.HIS.customised=RData.HIS.customised)
Start at 2015-07-23 12:29:08 Read the input file 'http://dcgor.r-forge.r-project.org/data/Feature/Hominidae.txt' ... Predictions for 4 sequences (9214 distinct architectures) using 'Feature2GOBP.sf' RData, 'sum' merge method, 'log' scale method and 'supra' feature mode (2015-07-23 12:29:09) ... ############################## 'dcAlgoPredict' is being called... ############################## Start at 2015-07-23 12:29:09 Load the HIS object 'Feature2GOBP.sf' (2015-07-23 12:29:09) ... 'Feature2GOBP.sf' (from https://github.com/hfang-bristol/RDataCentre/blob/master/dcGOR/Feature2GOBP.sf.RData?raw=true) has been loaded into the working environment Predictions for 9214 architectures using 'sum' merge method, 'log' scale method and 'supra' feature mode (2015-07-23 12:29:11)... 1 out of 9214 (2015-07-23 12:29:11) 922 out of 9214 (2015-07-23 12:29:24) 1844 out of 9214 (2015-07-23 12:29:41) 2766 out of 9214 (2015-07-23 12:29:53) 3688 out of 9214 (2015-07-23 12:30:08) 4610 out of 9214 (2015-07-23 12:30:18) 5532 out of 9214 (2015-07-23 12:30:33) 6454 out of 9214 (2015-07-23 12:30:51) 7376 out of 9214 (2015-07-23 12:31:04) 8298 out of 9214 (2015-07-23 12:31:16) 9214 out of 9214 (2015-07-23 12:31:26) End at 2015-07-23 12:31:26 Runtime in total is: 137 secs ############################## 'dcAlgoPredict' has been completed! ############################## A summary in terms of ontology terms using 'none' weight method (2015-07-23 12:31:26)... Load the HIS object 'Feature2GOBP.sf' (2015-07-23 12:31:26) ... 'Feature2GOBP.sf' (from https://github.com/hfang-bristol/RDataCentre/blob/master/dcGOR/Feature2GOBP.sf.RData?raw=true) has been loaded into the working environment 1 out of 4 (2015-07-23 12:31:27) 2 out of 4 (2015-07-23 12:31:37) 3 out of 4 (2015-07-23 12:31:50) 4 out of 4 (2015-07-23 12:31:58) End at 2015-07-23 12:32:06 Runtime in total is: 178 secs
dim(output)
[1] 11203 4
output[1:10,]
gx hs of xp GO:0008150 1.0000 1.0000 1.0000 1.0000 GO:0009987 0.9451 0.9440 0.9455 0.9458 GO:0044699 0.8750 0.8741 0.8725 0.8713 GO:0044763 0.8430 0.8397 0.8412 0.8404 GO:0065007 0.8517 0.8472 0.8487 0.8494 GO:0008152 0.8505 0.8403 0.8490 0.8485 GO:0050789 0.8296 0.8250 0.8267 0.8272 GO:0032501 0.7676 0.7672 0.7651 0.7618 GO:0044707 0.7554 0.7545 0.7519 0.7490 GO:0032502 0.7531 0.7511 0.7505 0.7466