dcAlgoPredictPR
is supposed to assess the prediction performance
via Precision-Recall (PR) analysis. It requires two input files: 1) a
Glod Standard Positive (GSP) file containing known annotations between
proteins/genes and ontology terms; 2) a prediction file containing
predicted terms for proteins/genes. Note: the known annotations will be
recursively propagated towards the root of the ontology.
dcAlgoPredictPR(GSP.file, prediction.file, ontology = c(NA, "GOBP", "GOMF", "GOCC", "DO", "HPPA", "HPMI", "HPON", "MP", "EC", "KW", "UP"), num.threshold = 10, bin = c("uniform", "quantile"), verbose = T, RData.ontology.customised = NULL, RData.location = "https://github.com/hfang-bristol/RDataCentre/blob/master/dcGOR")
dcAlgoPredictMain
, containing three columns: 1st column
for 'SeqID' (actually these IDs can be anything), 2nd column for 'Term'
(ontology terms), 3rd column for 'Score' (predictive score).
Alternatively, the prediction.file can be a matrix or data frame,
assuming that prediction file has been read. Note: the file should use
the tab delimiter as the field separator between columnsRData.ontology.customised
below)dcBuildOnto
for
how to creat this objectRData.location="."
. If RData to load is already part of package
itself, this parameter can be ignored (since this function will try to
load it via function data
first). Here is the UNIX command for
downloading all RData files (preserving the directory structure):
wget -r -l2 -A "*.RData" -np -nH --cut-dirs=0
"http://dcgor.r-forge.r-project.org/data"
a data frame containing two columns: 1st column 'Precision' for precision, 2nd 'Recall' for recall. The row has the names corresponding to the score threshold.
Prediction coverage: the ratio between predicted targets in number and GSP targets in number F-measure: the maximum of a harmonic mean between precision and recall along PR curve
# 1) Generate prediction file with HPPA predicitions for human genes architecture.file <- "http://dcgor.r-forge.r-project.org/data/Algo/SCOP_architecture.txt" prediction.file <- "SCOP_architecture.HPPA_predicted.txt" res <- dcAlgoPredictMain(input.file=architecture.file, output.file=prediction.file, RData.HIS="Feature2HPPA.sf", parallel=FALSE)Start at 2015-07-23 12:33:07 Read the input file 'http://dcgor.r-forge.r-project.org/data/Algo/SCOP_architecture.txt' ... Predictions for 17467 sequences (with 4351 distinct architectures) using 'Feature2HPPA.sf' RData, 'sum' merge method, 'log' scale method and 'supra' feature mode (2015-07-23 12:33:08) ... ############################## 'dcAlgoPredict' is being called... ############################## Start at 2015-07-23 12:33:08 Load the HIS object 'Feature2HPPA.sf' (2015-07-23 12:33:08) ... 'Feature2HPPA.sf' (from https://github.com/hfang-bristol/RDataCentre/blob/master/dcGOR/Feature2HPPA.sf.RData?raw=true) has been loaded into the working environment Predictions for 4351 architectures using 'sum' merge method, 'log' scale method and 'supra' feature mode (2015-07-23 12:33:09)... 1 out of 4351 (2015-07-23 12:33:09) 436 out of 4351 (2015-07-23 12:33:09) 872 out of 4351 (2015-07-23 12:33:10) 1308 out of 4351 (2015-07-23 12:33:10) 1744 out of 4351 (2015-07-23 12:33:11) 2180 out of 4351 (2015-07-23 12:33:11) 2616 out of 4351 (2015-07-23 12:33:12) 3052 out of 4351 (2015-07-23 12:33:13) 3488 out of 4351 (2015-07-23 12:33:13) 3924 out of 4351 (2015-07-23 12:33:14) 4351 out of 4351 (2015-07-23 12:33:14) End at 2015-07-23 12:33:14 Runtime in total is: 6 secs ############################## 'dcAlgoPredict' has been completed! ############################## Preparations for output (2015-07-23 12:33:14)... The predictions have been saved into '/Users/hfang/Sites/SUPERFAMILY/dcGO/dcGOR/SCOP_architecture.HPPA_predicted.txt'. End at 2015-07-23 12:33:15 Runtime in total is: 8 secs# 2) Calculate Precision and Recall GSP.file <- "http://dcgor.r-forge.r-project.org/data/Algo/HP_anno.txt" res_PR <- dcAlgoPredictPR(GSP.file=GSP.file, prediction.file=prediction.file, ontology="HPPA")Start at 2015-07-23 12:33:15 First, load the ontology 'HPPA' (2015-07-23 12:33:15) ... 'onto.HPPA' (from package 'dcGOR' version 1.0.5) has been loaded into the working environment Second, import files for GSP and predictions (2015-07-23 12:33:15) ... Third, propagate GSP annotations (2015-07-23 12:33:16) ... At level 16, there are 2 nodes, and 5 incoming neighbors. At level 15, there are 7 nodes, and 9 incoming neighbors. At level 14, there are 21 nodes, and 42 incoming neighbors. At level 13, there are 54 nodes, and 82 incoming neighbors. At level 12, there are 105 nodes, and 105 incoming neighbors. At level 11, there are 274 nodes, and 188 incoming neighbors. At level 10, there are 463 nodes, and 294 incoming neighbors. At level 9, there are 782 nodes, and 441 incoming neighbors. At level 8, there are 1004 nodes, and 538 incoming neighbors. At level 7, there are 1182 nodes, and 581 incoming neighbors. At level 6, there are 1295 nodes, and 527 incoming neighbors. At level 5, there are 940 nodes, and 290 incoming neighbors. At level 4, there are 408 nodes, and 99 incoming neighbors. At level 3, there are 114 nodes, and 21 incoming neighbors. At level 2, there are 21 nodes, and 1 incoming neighbors. At level 1, there are 1 nodes, and 0 incoming neighbors. There are 3048 genes/proteins in GSP (2015-07-23 12:34:01). Fourth, process input predictions (2015-07-23 12:34:01) ... There are 7749 genes/proteins in predictions (2015-07-23 12:34:01). Fifth, calculate the precision and recall for each of 1561 predicted and GSP genes/proteins (2015-07-23 12:34:01). Finally, calculate the averaged precision and recall (2015-07-23 12:34:02). In summary, Prediction coverage: 0.51 (amongst 3048 targets in GSP), and F-measure: 0.16. End at 2015-07-23 12:34:02 Runtime in total is: 47 secsres_PRPrecision Recall 1 0.7183260 0.02993152 0.90001 0.6563963 0.03594915 0.80002 0.5810718 0.04575739 0.70003 0.5048114 0.05653384 0.60004 0.4678130 0.06626356 0.50005 0.4550127 0.07348753 0.40006 0.4463097 0.08034927 0.30007 0.4342131 0.08724687 0.20008 0.4242244 0.09054786 0.10009 0.4216690 0.09301885 1e-04 0.4006511 0.10253593# 3) Plot PR-curve plot(res_PR[,2], res_PR[,1], xlim=c(0,1), ylim=c(0,1), type="b", xlab="Recall", ylab="Precision")
dcAlgoPredictPR.r
dcAlgoPredictPR.Rd
dcAlgoPredictPR.pdf
dcRDataLoader
, dcConverter
,
dcDuplicated
, dcAlgoPredictMain