SKNN: Reuse of imputed data in microarray analysis increases imputation efficiency
Introduction:
SKNN is an imputation tool, design for the imputation of gene missing entries, the tool is available as an R package.
Algorithms:
Two novel algorithms for the gene microarray data imputation are proposed in SKNN. The proposed algorithms are the extension of k-nearest neighborhood (KNN) algorithm named as sequential KNN (SKNN) and expectation maximization SKNN (EM-SKNN). The EM-SKNN is the recursive application of SKNN algorithm, both methods are compared with KNN, maximum likelihood estimation (MLE) and multiple imputations (MI) algorithms for the imputation accuracy and computation speed on different datasets.
Working:
In SKNN method, 1-The genes are ordered by their missing rates and the imputation was performed sequentially from a gene that had a least missing rate. 2-The sequentially imputed genes were then used in the imputation of other missing genes.
The SKNN utilizes two sets for imputation 1) complete set and 2) incomplete set. Initially, all complete and incomplete genes were placed in set 1 and 2 respectively, once a gene is fully imputed it is moved from incomplete set to complete set.
The process can be executed in parallel for all missing values of a gene. This reduces the execution time compared to the other KNN based methods
Performance:
In this paper, five methods for the imputation of gene microarray data are discussed namely KNN, MLE, MI, SKNN, and EM-SKNN. All mentioned algorithm are compared for imputation accuracy and execution time on a different dataset with missing rates varies from (10% to 60%). The datasets consist of three types of data including Time-series, non-time series and mixed data. For the evaluation of imputation accuracy two performance measures i.e. root mean squared error (RMS) and Pearson correlation coefficients were used.
Conclusion:
For a gene expression with the high missing rates, the EM-SKNN is shown to outperform the SKNN, MLE, MI and KNN methods. In terms of accuracy and computational complexity, the SKNN showed the best performance compared to all other methods and MLE as the worst.
Example
You can download the package from http://bisyn.kaist.ac.kr/software/SKNN_1.0.0.tar.gz
install.packages("http://bisyn.kaist.ac.kr/software/SKNN_1.0.0.tar.gz", repos=NULL, type="source")
## Installing package into 'C:/Users/sypark/Documents/R/win-library/3.2'
## (as 'lib' is unspecified)
The function included in the package SKNN: SeqKNN
library(SKNN)
SKNN::SeqKNN
## function (data, k)
## {
## x <- as.matrix(data)
## N <- dim(x)
## p <- N[2]
## N <- N[1]
## nas <- is.na(drop(x %*% rep(1, p)))
## xcomplete <- x[!nas, ]
## xbad <- x[nas, , drop = FALSE]
## missing <- c()
## for (i in seq(nrow(xbad))) {
## missing[i] <- sum(is.na(xbad[i, ]))
## }
## missingorder <- order(missing)
## xnas <- is.na(xbad)
## xbadhat <- xbad
## cat(nrow(xbad), fill = TRUE)
## for (i in seq(nrow(xbad))) {
## j <- order(missingorder[i])
## xinas <- xnas[missingorder[i], ]
## xbadhat[missingorder[i], ] <- nnmiss(xcomplete, xbad[missingorder[i],
## ], xinas, K = k)
## xcomplete <- rbind(xcomplete, xbadhat[missingorder[i],
## ])
## }
## x[nas, ] <- xbadhat
## x
## }
## <environment: namespace:SKNN>
data <- read.table("http://bisyn.kaist.ac.kr/software/Example.txt")
library(knitr)
kable(head(data[,1:5]))
V2 | V3 | V4 | V5 | V6 |
---|---|---|---|---|
0.7733437 | -0.0781778 | -0.0844692 | 0.9656141 | 0.0756639 |
-2.4384048 | -2.4157538 | -1.6497392 | -2.3805466 | NA |
-0.4825622 | 0.4127717 | -0.2413075 | 0.6252965 | NA |
-2.7211354 | -2.8251460 | -2.8752861 | -1.7412565 | 0.2726953 |
-1.2170580 | -0.6262365 | -0.8894054 | -0.8453664 | -1.8413700 |
0.8278092 | 0.0544882 | -0.0274740 | 0.9496868 | 0.3279359 |
dim(data)
## [1] 2308 63
sum(is.na(data))
## [1] 7245
library(SKNN)
k=10
Imp <- SeqKNN(data,k);
## 2208
kable(head(Imp[,1:5]))
V2 | V3 | V4 | V5 | V6 |
---|---|---|---|---|
0.7733437 | -0.0781778 | -0.0844692 | 0.9656141 | 0.0756639 |
-2.4384048 | -2.4157538 | -1.6497392 | -2.3805466 | -1.6033726 |
-0.4825622 | 0.4127717 | -0.2413075 | 0.6252965 | 0.2038428 |
-2.7211354 | -2.8251460 | -2.8752861 | -1.7412565 | 0.2726953 |
-1.2170580 | -0.6262365 | -0.8894054 | -0.8453664 | -1.8413700 |
0.8278092 | 0.0544882 | -0.0274740 | 0.9496868 | 0.3279359 |
dim(Imp)
## [1] 2308 63
sum(is.na(Imp))
## [1] 0
Citation Trend
Duration | X2004.2005 | X2005.2006 | X2006.2007 | X2007.2008 | X2008.2009 | X2009.2010 | X2010.2011 | X2011.2012 | X2012.2013 | X2013.2014 | X2014.2015 | X2015.2016 | X2016.2017 | Total |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Year | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 13 |
Citations | 0 | 2 | 9 | 10 | 14 | 8 | 6 | 8 | 13 | 11 | 8 | 11 | 6 | 106 |
Methods | 0 | 0 | 5 | 7 | 9 | 5 | 3 | 1 | 6 | 3 | 2 | 7 | 4 | 52 |
Extras
Matlab version https://kr.mathworks.com/matlabcentral/fileexchange/60128-sequential-knn-imputation-method
Comparison of Imputation Algorithms
LSimpute: accurate estimation of missing values in microarray data with least squares methods, Trond Hellem Bo et. al., NAR, 2004 http://nar.oxfordjournals.org/content/32/3/e34
Missing value imputation using a fuzzy clustering-based EM approach, Md. Geaur Rahman et. al., Know. Inf. Syst., 2016 http://link.springer.com/article/10.1007/s10115-015-0822-y
Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments, Magalie et. al., BMC Genomics, 2010 http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-11-15
Missing value imputation for gene expression data: computational techniques to recover missing data from available information, Alan Wee-Chung Liew et. al., Briefings in Bioinformatics, 2010 http://bib.oxfordjournals.org/content/12/5/498.long
An experimental study on the use of nearest neighbor-based imputation algorithms for classification tasks, Jonathan de andrade silva et. al., Data & Know. Engineering, 2013 http://www.sciencedirect.com/science/article/pii/S0169023X12001176
Application of Imputation Algorithms
Imputing gene expression from selectively reduced probe sets, Yoni Donner et. al., Nature Methods, 2012 http://www.nature.com/nmeth/journal/v9/n11/full/nmeth.2207.html