Function to calculate pair-wise semantic similarity between domains based on a direct acyclic graph (DAG) with annotated data

Description

dcDAGdomainSim is supposed to calculate pair-wise semantic similarity between domains based on a direct acyclic graph (DAG) with annotated data. It first calculates semantic similarity between terms and then derives semantic similarity between domains from terms-term semantic similarity. Parallel computing is also supported for Linux or Mac operating systems.

Usage

dcDAGdomainSim(g, domains = NULL, method.domain = c("BM.average", "BM.max", "BM.complete", 
  "average", "max"), method.term = c("Resnik", "Lin", "Schlicker", "Jiang", "Pesquita"), 
      force = TRUE, fast = TRUE, parallel = TRUE, multicores = NULL, verbose = TRUE)

Arguments

g
an object of class "igraph" or Onto. It must contain a node attribute called 'annotations' for storing annotation data (see example for howto)
domains
the domains between which pair-wise semantic similarity is calculated. If NULL, all domains annotatable in the input dag will be used for calcluation, which is very prohibitively expensive!
method.domain
the method used for how to derive semantic similarity between domains from semantic similarity between terms. It can be "average" for average similarity between any two terms (one from domain 1, the other from domain 2), "max" for the maximum similarity between any two terms, "BM.average" for best-matching (BM) based average similarity (i.e. for each term of either domain, first calculate maximum similarity to any term in the other domain, then take average of maximum similarity; the final BM-based average similiary is the pre-calculated average between two domains in pair), "BM.max" for BM based maximum similarity (i.e. the same as "BM.average", but the final BM-based maximum similiary is the maximum of the pre-calculated average between two domains in pair), "BM.complete" for BM-based complete-linkage similarity (inspired by complete-linkage concept: the least of any maximum similarity between a term of one domain and a term of the other domain). When comparing BM-based similarity between domains, "BM.average" and "BM.max" are sensitive to the number of terms invovled; instead, "BM.complete" is much robust in this aspect. By default, it uses "BM.average".
method.term
the method used to measure semantic similarity between terms. It can be "Resnik" for information content (IC) of most informative common ancestor (MICA) (see http://arxiv.org/pdf/cmp-lg/9511007.pdf), "Lin" for 2*IC at MICA divided by the sum of IC at pairs of terms (see http://webdocs.cs.ualberta.ca/~lindek/papers/sim.pdf), "Schlicker" for weighted version of 'Lin' by the 1-prob(MICA) (see http://www.ncbi.nlm.nih.gov/pubmed/16776819), "Jiang" for 1 - difference between the sum of IC at pairs of terms and 2*IC at MICA (see http://arxiv.org/pdf/cmp-lg/9709008.pdf), "Pesquita" for graph information content similarity related to Tanimoto-Jacard index (ie. summed information content of common ancestors divided by summed information content of all ancestors of term1 and term2 (see http://www.ncbi.nlm.nih.gov/pubmed/18460186))
force
logical to indicate whether the only most specific terms (for each domain) will be used. By default, it sets to true. It is always advisable to use this since it is computationally fast but without compromising accuracy (considering the fact that true-path-rule has been applied when running dcDAGannotate)
fast
logical to indicate whether a vectorised fast computation is used. By default, it sets to true. It is always advisable to use this vectorised fast computation; since the conventional computation is just used for understanding scripts
parallel
logical to indicate whether parallel computation with multicores is used. By default, it sets to true, but not necessarily does so. Partly because parallel backends available will be system-specific (now only Linux or Mac OS). Also, it will depend on whether these two packages "foreach" and "doMC" have been installed. It can be installed via: source("http://bioconductor.org/biocLite.R"); biocLite(c("foreach","doMC")). If not yet installed, this option will be disabled
multicores
an integer to specify how many cores will be registered as the multicore parallel backend to the 'foreach' package. If NULL, it will use a half of cores available in a user's computer. This option only works when parallel computation is enabled
verbose
logical to indicate whether the messages will be displayed in the screen. By default, it sets to true for display

Value

an object of S4 class Dnetwork. It is a weighted and undirect graph, with following slots:

  • nodeInfo: an object of S4 class, describing information on nodes/domains
  • adjMatrix: an object of S4 class AdjData, containing symmetric adjacency data matrix for pair-wise semantic similarity between domains

Note

For the mode "shortest_paths", the induced subgraph is the most concise, and thus informative for visualisation when there are many nodes in query, while the mode "all_paths" results in the complete subgraph.

Examples

# 1) Semantic similarity between SCOP domain superfamilies (sf) ## 1a) load onto.GOMF (as 'Onto' object) g <- dcRDataLoader('onto.GOMF')
'onto.GOMF' (from package 'dcGOR' version 1.0.5) has been loaded into the working environment
## 1b) load SCOP superfamilies annotated by GOMF (as 'Anno' object) Anno <- dcRDataLoader('SCOP.sf2GOMF')
'SCOP.sf2GOMF' (from package 'dcGOR' version 1.0.5) has been loaded into the working environment
## 1c) prepare for ontology appended with annotation information dag <- dcDAGannotate(g, annotations=Anno, path.mode="shortest_paths", verbose=FALSE) ## 1d) calculate pair-wise semantic similarity between 8 randomly chosen domains alldomains <- unique(unlist(nInfo(dag)$annotations)) domains <- sample(alldomains,8) dnetwork <- dcDAGdomainSim(g=dag, domains=domains, method.domain="BM.average", method.term="Resnik", parallel=FALSE, verbose=TRUE)
Start at 2015-07-23 12:48:32 First, extract all annotatable domains (2015-07-23 12:48:32)... there are 8 input domains amongst 1083 annotatable domains Second, pre-compute semantic similarity between 93 terms (forced to be the most specific for each domain) using Resnik method (2015-07-23 12:48:39)... Last, calculate pair-wise semantic similarity between 8 domains using BM.average method (2015-07-23 12:48:43)... 1 out of 8 (2015-07-23 12:48:43) 2 out of 8 (2015-07-23 12:48:43) 3 out of 8 (2015-07-23 12:48:43) 4 out of 8 (2015-07-23 12:48:43) 5 out of 8 (2015-07-23 12:48:43) 6 out of 8 (2015-07-23 12:48:43) 7 out of 8 (2015-07-23 12:48:43) Finish at 2015-07-23 12:48:43 Runtime in total is: 11 secs
dnetwork
An object of S4 class 'Dnetwork' @adjMatrix: a weighted symmetric matrix of 8 domains X 8 domains @nodeInfo (InfoDataFrame) nodeNames: 140741 81878 55608 ... 161070 56542 (8 total) nodeAttr: id
## 1e) convert it to an object of class 'igraph' ig <- dcConverter(dnetwork, from='Dnetwork', to='igraph')
Your input object 'dnetwork' of class 'Dnetwork' has been converted into an object of class 'igraph'.
ig
IGRAPH UNW- 8 28 -- + attr: name (v/c), id (v/c), weight (e/n) + edges (vertex names): [1] 140741--81878 140741--55608 81878 --55608 140741--51658 81878 --51658 [6] 55608 --51658 140741--161084 81878 --161084 55608 --161084 51658 --161084 [11] 140741--54593 81878 --54593 55608 --54593 51658 --54593 161084--54593 [16] 140741--161070 81878 --161070 55608 --161070 51658 --161070 161084--161070 [21] 54593 --161070 140741--56542 81878 --56542 55608 --56542 51658 --56542 [26] 161084--56542 54593 --56542 161070--56542
## 1f) visualise the domain network ### extract edge weight (with 2-digit precision) x <- signif(E(ig)$weight, digits=2) ### rescale into an interval [1,4] as edge width edge.width <- 1 + (x-min(x))/(max(x)-min(x))*3 ### do visualisation dnet::visNet(g=ig, vertex.shape="sphere", edge.width=edge.width, edge.label=x, edge.label.cex=0.7)
########################################################### # 2) Semantic similarity between Pfam domains (Pfam) ## 2a) load onto.GOMF (as 'Onto' object) g <- dcRDataLoader('onto.GOMF')
'onto.GOMF' (from package 'dcGOR' version 1.0.5) has been loaded into the working environment
## 2b) load Pfam domains annotated by GOMF (as 'Anno' object) Anno <- dcRDataLoader('Pfam2GOMF')
'Pfam2GOMF' (from package 'dcGOR' version 1.0.5) has been loaded into the working environment
## 2c) prepare for ontology appended with annotation information dag <- dcDAGannotate(g, annotations=Anno, path.mode="shortest_paths", verbose=FALSE) ## 2d) calculate pair-wise semantic similarity between 8 randomly chosen domains alldomains <- unique(unlist(nInfo(dag)$annotations)) domains <- sample(alldomains,8) dnetwork <- dcDAGdomainSim(g=dag, domains=domains, method.domain="BM.average", method.term="Resnik", parallel=FALSE, verbose=TRUE)
Start at 2015-07-23 12:49:02 First, extract all annotatable domains (2015-07-23 12:49:02)... there are 8 input domains amongst 3359 annotatable domains Second, pre-compute semantic similarity between 9 terms (forced to be the most specific for each domain) using Resnik method (2015-07-23 12:49:02)... Last, calculate pair-wise semantic similarity between 8 domains using BM.average method (2015-07-23 12:49:02)... 1 out of 8 (2015-07-23 12:49:02) 2 out of 8 (2015-07-23 12:49:02) 3 out of 8 (2015-07-23 12:49:02) 4 out of 8 (2015-07-23 12:49:02) 5 out of 8 (2015-07-23 12:49:02) 6 out of 8 (2015-07-23 12:49:02) 7 out of 8 (2015-07-23 12:49:02) Finish at 2015-07-23 12:49:02 Runtime in total is: 0 secs
dnetwork
An object of S4 class 'Dnetwork' @adjMatrix: a weighted symmetric matrix of 8 domains X 8 domains @nodeInfo (InfoDataFrame) nodeNames: PF01588 PF01040 PF03157 ... PF05301 PF08515 (8 total) nodeAttr: id
## 2e) convert it to an object of class 'igraph' ig <- dcConverter(dnetwork, from='Dnetwork', to='igraph')
Your input object 'dnetwork' of class 'Dnetwork' has been converted into an object of class 'igraph'.
ig
IGRAPH UNW- 8 10 -- + attr: name (v/c), id (v/c), weight (e/n) + edges (vertex names): [1] PF01588--PF01040 PF02590--PF00933 PF01588--PF00909 PF01040--PF00909 [5] PF03157--PF05301 PF01588--PF08515 PF01040--PF08515 PF03157--PF08515 [9] PF00909--PF08515 PF05301--PF08515
## 2f) visualise the domain network ### extract edge weight (with 2-digit precision) x <- signif(E(ig)$weight, digits=2) ### rescale into an interval [1,4] as edge width edge.width <- 1 + (x-min(x))/(max(x)-min(x))*3 ### do visualisation dnet::visNet(g=ig, vertex.shape="sphere", edge.width=edge.width, edge.label=x, edge.label.cex=0.7)
########################################################### # 3) Semantic similarity between InterPro domains (InterPro) ## 3a) load onto.GOMF (as 'Onto' object) g <- dcRDataLoader('onto.GOMF')
'onto.GOMF' (from package 'dcGOR' version 1.0.5) has been loaded into the working environment
## 3b) load InterPro domains annotated by GOMF (as 'Anno' object) Anno <- dcRDataLoader('InterPro2GOMF')
'InterPro2GOMF' (from package 'dcGOR' version 1.0.5) has been loaded into the working environment
## 3c) prepare for ontology appended with annotation information dag <- dcDAGannotate(g, annotations=Anno, path.mode="shortest_paths", verbose=FALSE) ## 3d) calculate pair-wise semantic similarity between 8 randomly chosen domains alldomains <- unique(unlist(nInfo(dag)$annotations)) domains <- sample(alldomains,8) dnetwork <- dcDAGdomainSim(g=dag, domains=domains, method.domain="BM.average", method.term="Resnik", parallel=FALSE, verbose=TRUE)
Start at 2015-07-23 12:49:31 First, extract all annotatable domains (2015-07-23 12:49:31)... there are 8 input domains amongst 8899 annotatable domains Second, pre-compute semantic similarity between 12 terms (forced to be the most specific for each domain) using Resnik method (2015-07-23 12:49:33)... Last, calculate pair-wise semantic similarity between 8 domains using BM.average method (2015-07-23 12:49:33)... 1 out of 8 (2015-07-23 12:49:33) 2 out of 8 (2015-07-23 12:49:33) 3 out of 8 (2015-07-23 12:49:33) 4 out of 8 (2015-07-23 12:49:33) 5 out of 8 (2015-07-23 12:49:33) 6 out of 8 (2015-07-23 12:49:33) 7 out of 8 (2015-07-23 12:49:33) Finish at 2015-07-23 12:49:34 Runtime in total is: 3 secs
dnetwork
An object of S4 class 'Dnetwork' @adjMatrix: a weighted symmetric matrix of 8 domains X 8 domains @nodeInfo (InfoDataFrame) nodeNames: IPR011772 IPR013670 IPR001384 ... IPR004616 IPR004105 (8 total) nodeAttr: id
## 3e) convert it to an object of class 'igraph' ig <- dcConverter(dnetwork, from='Dnetwork', to='igraph')
Your input object 'dnetwork' of class 'Dnetwork' has been converted into an object of class 'igraph'.
ig
IGRAPH UNW- 8 28 -- + attr: name (v/c), id (v/c), weight (e/n) + edges (vertex names): [1] IPR011772--IPR013670 IPR011772--IPR001384 IPR013670--IPR001384 [4] IPR011772--IPR003038 IPR013670--IPR003038 IPR001384--IPR003038 [7] IPR011772--IPR006367 IPR013670--IPR006367 IPR001384--IPR006367 [10] IPR003038--IPR006367 IPR011772--IPR010401 IPR013670--IPR010401 [13] IPR001384--IPR010401 IPR003038--IPR010401 IPR006367--IPR010401 [16] IPR011772--IPR004616 IPR013670--IPR004616 IPR001384--IPR004616 [19] IPR003038--IPR004616 IPR006367--IPR004616 IPR010401--IPR004616 [22] IPR011772--IPR004105 IPR013670--IPR004105 IPR001384--IPR004105 + ... omitted several edges
## 3f) visualise the domain network ### extract edge weight (with 2-digit precision) x <- signif(E(ig)$weight, digits=2) ### rescale into an interval [1,4] as edge width edge.width <- 1 + (x-min(x))/(max(x)-min(x))*3 ### do visualisation dnet::visNet(g=ig, vertex.shape="sphere", edge.width=edge.width, edge.label=x, edge.label.cex=0.7)
########################################################### # 4) Semantic similarity between Rfam RNA families (Rfam) ## 4a) load onto.GOBP (as 'Onto' object) g <- dcRDataLoader('onto.GOBP')
'onto.GOBP' (from package 'dcGOR' version 1.0.5) has been loaded into the working environment
## 4b) load Rfam families annotated by GOBP (as 'Anno' object) Anno <- dcRDataLoader('Rfam2GOBP')
'Rfam2GOBP' (from package 'dcGOR' version 1.0.5) has been loaded into the working environment
## 4c) prepare for ontology appended with annotation information dag <- dcDAGannotate(g, annotations=Anno, path.mode="shortest_paths", verbose=FALSE) ## 4d) calculate pair-wise semantic similarity between 8 randomly chosen RNAs alldomains <- unique(unlist(nInfo(dag)$annotations)) domains <- sample(alldomains,8) dnetwork <- dcDAGdomainSim(g=dag, domains=domains, method.domain="BM.average", method.term="Resnik", parallel=FALSE, verbose=TRUE)
Start at 2015-07-23 12:50:17 First, extract all annotatable domains (2015-07-23 12:50:17)... there are 8 input domains amongst 1377 annotatable domains Second, pre-compute semantic similarity between 4 terms (forced to be the most specific for each domain) using Resnik method (2015-07-23 12:50:17)... Last, calculate pair-wise semantic similarity between 8 domains using BM.average method (2015-07-23 12:50:17)... 1 out of 8 (2015-07-23 12:50:17) 2 out of 8 (2015-07-23 12:50:17) 3 out of 8 (2015-07-23 12:50:17) 4 out of 8 (2015-07-23 12:50:17) 5 out of 8 (2015-07-23 12:50:17) 6 out of 8 (2015-07-23 12:50:17) 7 out of 8 (2015-07-23 12:50:17) Finish at 2015-07-23 12:50:17 Runtime in total is: 0 secs
dnetwork
An object of S4 class 'Dnetwork' @adjMatrix: a weighted symmetric matrix of 8 domains X 8 domains @nodeInfo (InfoDataFrame) nodeNames: RF00995 RF00031 RF00888 ... RF00612 RF00025 (8 total) nodeAttr: id
## 4e) convert it to an object of class 'igraph' ig <- dcConverter(dnetwork, from='Dnetwork', to='igraph')
Your input object 'dnetwork' of class 'Dnetwork' has been converted into an object of class 'igraph'.
ig
IGRAPH UNW- 8 15 -- + attr: name (v/c), id (v/c), weight (e/n) + edges (vertex names): [1] RF00995--RF00888 RF00995--RF01178 RF00888--RF01178 RF00995--RF00096 [5] RF00888--RF00096 RF01178--RF00096 RF00995--RF01536 RF00888--RF01536 [9] RF01178--RF01536 RF00096--RF01536 RF00995--RF00612 RF00888--RF00612 [13] RF01178--RF00612 RF00096--RF00612 RF01536--RF00612
## 4f) visualise the domain network ### extract edge weight (with 2-digit precision) x <- signif(E(ig)$weight, digits=2) ### rescale into an interval [1,4] as edge width edge.width <- 1 + (x-min(x))/(max(x)-min(x))*3 ### do visualisation dnet::visNet(g=ig, vertex.shape="sphere", edge.width=edge.width, edge.label=x, edge.label.cex=0.7)
########################################################### # 5) Advanced usage: customised data for ontology and annotations # 5a) customise ontology g <- dcBuildOnto(relations.file="http://dcgor.r-forge.r-project.org/data/onto/igraph_GOMF_edges.txt", nodes.file="http://dcgor.r-forge.r-project.org/data/onto/igraph_GOMF_nodes.txt", output.file="ontology.RData")
An object of S4 class 'Onto' has been built and saved into '/Users/hfang/Sites/SUPERFAMILY/dcGO/dcGOR/ontology.RData'.
# 5b) customise Anno Anno <- dcBuildAnno(domain_info.file="http://dcgor.r-forge.r-project.org/data/InterPro/InterPro.txt", term_info.file="http://dcgor.r-forge.r-project.org/data/InterPro/GO.txt", association.file="http://dcgor.r-forge.r-project.org/data/InterPro/Domain2GOMF.txt", output.file="annotations.RData")
An object of S4 class 'Anno' has been built and saved into '/Users/hfang/Sites/SUPERFAMILY/dcGO/dcGOR/annotations.RData'.
## 5c) prepare for ontology appended with annotation information dag <- dcDAGannotate(g, annotations=Anno, path.mode="shortest_paths", verbose=FALSE) ## 5d) calculate pair-wise semantic similarity between 8 randomly chosen domains alldomains <- unique(unlist(nInfo(dag)$annotations)) domains <- sample(alldomains,8) dnetwork <- dcDAGdomainSim(g=dag, domains=domains, method.domain="BM.average", method.term="Resnik", parallel=FALSE, verbose=TRUE)
Start at 2015-07-23 12:50:47 First, extract all annotatable domains (2015-07-23 12:50:47)... there are 8 input domains amongst 8899 annotatable domains Second, pre-compute semantic similarity between 10 terms (forced to be the most specific for each domain) using Resnik method (2015-07-23 12:50:48)... Last, calculate pair-wise semantic similarity between 8 domains using BM.average method (2015-07-23 12:50:49)... 1 out of 8 (2015-07-23 12:50:49) 2 out of 8 (2015-07-23 12:50:49) 3 out of 8 (2015-07-23 12:50:49) 4 out of 8 (2015-07-23 12:50:49) 5 out of 8 (2015-07-23 12:50:49) 6 out of 8 (2015-07-23 12:50:49) 7 out of 8 (2015-07-23 12:50:49) Finish at 2015-07-23 12:50:50 Runtime in total is: 3 secs
dnetwork
An object of S4 class 'Dnetwork' @adjMatrix: a weighted symmetric matrix of 8 domains X 8 domains @nodeInfo (InfoDataFrame) nodeNames: IPR028700 IPR020797 IPR027542 ... IPR010222 IPR002611 (8 total) nodeAttr: id
## 5e) convert it to an object of class 'igraph' ig <- dcConverter(dnetwork, from='Dnetwork', to='igraph')
Your input object 'dnetwork' of class 'Dnetwork' has been converted into an object of class 'igraph'.
ig
IGRAPH UNW- 8 23 -- + attr: name (v/c), id (v/c), weight (e/n) + edges (vertex names): [1] IPR028700--IPR020797 IPR028700--IPR027542 IPR020797--IPR027542 [4] IPR028700--IPR008576 IPR020797--IPR008576 IPR027542--IPR008576 [7] IPR028700--IPR008977 IPR020797--IPR008977 IPR027542--IPR008977 [10] IPR008576--IPR008977 IPR028700--IPR000536 IPR020797--IPR000536 [13] IPR027542--IPR000536 IPR008576--IPR000536 IPR008977--IPR000536 [16] IPR028700--IPR010222 IPR020797--IPR010222 IPR027542--IPR010222 [19] IPR008576--IPR010222 IPR008977--IPR010222 IPR000536--IPR010222 [22] IPR027542--IPR002611 IPR010222--IPR002611
## 5f) visualise the domain network ### extract edge weight (with 2-digit precision) x <- signif(E(ig)$weight, digits=2) ### rescale into an interval [1,4] as edge width edge.width <- 1 + (x-min(x))/(max(x)-min(x))*3 ### do visualisation dnet::visNet(g=ig, vertex.shape="sphere", edge.width=edge.width, edge.label=x, edge.label.cex=0.7)