Function to propagate ontology annotations according to an input file

Description

dcAlgoPropagate is supposed to propagate ontology annotations, given an input file. This input file contains original annotations between domains/features and ontology terms, along with the hypergeometric scores (hscore) in support for their annotations. The annotations are propagated to the ontology root (either retaining the maximum hscore or additively accumulating the hscore). After the propogation, the ontology terms of increasing levels are determined based on the concept of Information Content (IC) to product a slim version of ontology. It returns an object of S3 class "HIS" with three components: "hscore", "ic" and "slim".

Usage

dcAlgoPropagate(input.file, ontology = c(NA, "GOBP", "GOMF", "GOCC", "DO", "HPPA", 
  "HPMI", "HPON", "MP", "EC", "KW", "UP"), propagation = c("max", "sum"), output.file = "HIS.RData", 
      verbose = T, RData.ontology.customised = NULL, RData.location = "https://github.com/hfang-bristol/RDataCentre/blob/master/dcGOR")

Arguments

input.file
an input file used to build the object. This input file contains original annotations between domains/features and ontology terms, along with the hypergeometric scores (hscore) in support for their annotations. For example, a file containing original annotations between SCOP domain architectures and GO terms can be found in http://dcgor.r-forge.r-project.org/data/Feature/Feature2GO.sf.txt. As seen in this example, the input file must contain the header (in the first row) and three columns: 1st column for 'Feature_id' (here SCOP domain architectures), 2nd column for 'Term_id' (GO terms), and 3rd column for 'Score' (hscore). Alternatively, the input.file can be a matrix or data frame, assuming that input file has been read. Note: the file should use the tab delimiter as the field separator between columns
ontology
the ontology identity. It can be "GOBP" for Gene Ontology Biological Process, "GOMF" for Gene Ontology Molecular Function, "GOCC" for Gene Ontology Cellular Component, "DO" for Disease Ontology, "HPPA" for Human Phenotype Phenotypic Abnormality, "HPMI" for Human Phenotype Mode of Inheritance, "HPON" for Human Phenotype ONset and clinical course, "MP" for Mammalian Phenotype, "EC" for Enzyme Commission, "KW" for UniProtKB KeyWords, "UP" for UniProtKB UniPathway. For details on the eligibility for pairs of input domain and ontology, please refer to the online Documentations at http://supfam.org/dcGOR/docs.html. If NA, then the user has to input a customised RData-formatted file (see RData.ontology.customised below)
propagation
how to propagate the score. It can be "max" for retaining the maximum hscore (by default), "sum" for additively accumulating the hscore
output.file
an output file used to save the HIS object as an RData-formatted file (see 'Value' for details). If NULL, this file will be saved into "HIS.RData" in the current working local directory. If NA, there will be no output file
verbose
logical to indicate whether the messages will be displayed in the screen. By default, it sets to TRUE for display
RData.ontology.customised
a file name for RData-formatted file containing an object of S4 class 'Onto' (i.g. ontology). By default, it is NULL. It is only needed when the user wants to perform customised analysis using their own ontology. See dcBuildOnto for how to creat this object
RData.location
the characters to tell the location of built-in RData files. By default, it remotely locates at "https://github.com/hfang-bristol/RDataCentre/blob/master/dcGOR" and "http://dcgor.r-forge.r-project.org/data". For the user equipped with fast internet connection, this option can be just left as default. But it is always advisable to download these files locally. Especially when the user needs to run this function many times, there is no need to ask the function to remotely download every time (also it will unnecessarily increase the runtime). For examples, these files (as a whole or part of them) can be first downloaded into your current working directory, and then set this option as: RData.location=".". If RData to load is already part of package itself, this parameter can be ignored (since this function will try to load it via function data first). Here is the UNIX command for downloading all RData files (preserving the directory structure): wget -r -l2 -A "*.RData" -np -nH --cut-dirs=0 "http://dcgor.r-forge.r-project.org/data"

Value

an object of S3 class HIS, with following components:

  • hscore: a list of features, each with a term-named vector containing hscore
  • ic: a term-named vector containing information content (IC). Terms are ordered first by IC and then by longest-path level, making sure that for terms with the same IC, parental terms always come first
  • slim: a list of four slims, each with a term-named vector containing information content (IC). Slim '1' for very general terms, '2' for general terms, '3' for specific terms, '4' for very specific terms

Note

None

Examples

# build an "HIS" object for GO Molecular Function input.file <- "http://dcgor.r-forge.r-project.org/data/Feature/Feature2GO.sf.txt" Feature2GOMF.sf <- dcAlgoPropagate(input.file=input.file, ontology="GOMF", output.file="Feature2GOMF.sf.RData")
Start at 2015-07-23 12:34:03 Read the input file 'http://dcgor.r-forge.r-project.org/data/Feature/Feature2GO.sf.txt' (2015-07-23 12:34:03) ... Load the ontology 'GOMF' (2015-07-23 12:34:06) ... 'onto.GOMF' (from package 'dcGOR' version 1.0.5) has been loaded into the working environment Do propagation via 'max' operation (2015-07-23 12:34:11) ... At level 15, there are 3 nodes, and 4 incoming neighbors (2015-07-23 12:34:12). At level 14, there are 6 nodes, and 7 incoming neighbors (2015-07-23 12:34:12). At level 13, there are 10 nodes, and 12 incoming neighbors (2015-07-23 12:34:12). At level 12, there are 24 nodes, and 28 incoming neighbors (2015-07-23 12:34:13). At level 11, there are 32 nodes, and 34 incoming neighbors (2015-07-23 12:34:13). At level 10, there are 76 nodes, and 63 incoming neighbors (2015-07-23 12:34:13). At level 9, there are 132 nodes, and 99 incoming neighbors (2015-07-23 12:34:14). At level 8, there are 270 nodes, and 175 incoming neighbors (2015-07-23 12:34:15). At level 7, there are 459 nodes, and 229 incoming neighbors (2015-07-23 12:34:17). At level 6, there are 842 nodes, and 254 incoming neighbors (2015-07-23 12:34:20). At level 5, there are 568 nodes, and 172 incoming neighbors (2015-07-23 12:34:25). At level 4, there are 273 nodes, and 60 incoming neighbors (2015-07-23 12:34:27). At level 3, there are 120 nodes, and 13 incoming neighbors (2015-07-23 12:34:28). At level 2, there are 20 nodes, and 1 incoming neighbors (2015-07-23 12:34:29). after propagation, there are 6018 features annotated by 2836 terms. Determining IC-based slim levels (2015-07-23 12:34:29) ... 1 level with 6 terms with IC falling around 0.47 (between 0.00 and 0.94). 2 level with 38 terms with IC falling around 1.42 (between 1.18 and 1.65). 3 level with 217 terms with IC falling around 2.36 (between 2.13 and 2.60). 4 level with 838 terms with IC falling around 3.31 (between 3.07 and 3.54). An object of S3 class 'HIS' has been built and saved into '/Users/hfang/Sites/SUPERFAMILY/dcGO/dcGOR/Feature2GOMF.sf.RData'. End at 2015-07-23 12:41:49 Runtime in total is: 466 secs
names(Feature2GOMF.sf)
[1] "hscore" "ic" "slim"
Feature2GOMF.sf$hscore[1]
$`100879` GO:0003674 GO:0003824 GO:0016740 GO:0016772 GO:0016779 GO:0017125 GO:0034061 14.83 14.83 14.83 14.83 14.83 6.85 8.34 GO:0003887 1.84
Feature2GOMF.sf$ic[1:10]
GO:0003674 GO:0005488 GO:0005515 GO:0003824 GO:0043167 GO:0016787 GO:0016740 0.0000000 0.2416331 0.3770188 0.3901089 0.7416274 0.7521026 0.7922330 GO:0097159 GO:1901363 GO:0043168 0.7980867 0.8008152 0.9179178
Feature2GOMF.sf$slim[1]
$`1` GO:0003824 GO:0005515 GO:0043167 GO:0097159 GO:1901363 GO:0004872 0.3901089 0.3770188 0.7416274 0.7980867 0.8008152 0.9331151
# extract hscore as a matrix with 3 columns (Feature_id, Term_id, Score) hscore <- Feature2GOMF.sf$hscore hscore_mat <- dcList2Matrix(hscore)
The input list has been converted into a matrix of 75504 X 3.
colnames(hscore_mat) <- c("Feature_id", "Term_id", "Score") dim(hscore_mat)
[1] 75504 3
hscore_mat[1:10,]
Feature_id Term_id Score [1,] "100879" "GO:0003674" "14.83" [2,] "100879" "GO:0003824" "14.83" [3,] "100879" "GO:0016740" "14.83" [4,] "100879" "GO:0016772" "14.83" [5,] "100879" "GO:0016779" "14.83" [6,] "100879" "GO:0017125" "6.85" [7,] "100879" "GO:0034061" "8.34" [8,] "100879" "GO:0003887" "1.84" [9,] "100895" "GO:0003674" "33.55" [10,] "100895" "GO:0003824" "6.73"