SKNN: Reuse of imputed data in microarray analysis increases imputation efficiency


Introduction:

SKNN is an imputation tool, design for the imputation of gene missing entries, the tool is available as an R package.


Algorithms:

Two novel algorithms for the gene microarray data imputation are proposed in SKNN. The proposed algorithms are the extension of k-nearest neighborhood (KNN) algorithm named as sequential KNN (SKNN) and expectation maximization SKNN (EM-SKNN). The EM-SKNN is the recursive application of SKNN algorithm, both methods are compared with KNN, maximum likelihood estimation (MLE) and multiple imputations (MI) algorithms for the imputation accuracy and computation speed on different datasets.


Working:

In SKNN method, 1-The genes are ordered by their missing rates and the imputation was performed sequentially from a gene that had a least missing rate. 2-The sequentially imputed genes were then used in the imputation of other missing genes.

The SKNN utilizes two sets for imputation 1) complete set and 2) incomplete set. Initially, all complete and incomplete genes were placed in set 1 and 2 respectively, once a gene is fully imputed it is moved from incomplete set to complete set.

The process can be executed in parallel for all missing values of a gene. This reduces the execution time compared to the other KNN based methods


Performance:

In this paper, five methods for the imputation of gene microarray data are discussed namely KNN, MLE, MI, SKNN, and EM-SKNN. All mentioned algorithm are compared for imputation accuracy and execution time on a different dataset with missing rates varies from (10% to 60%). The datasets consist of three types of data including Time-series, non-time series and mixed data. For the evaluation of imputation accuracy two performance measures i.e. root mean squared error (RMS) and Pearson correlation coefficients were used.


Conclusion:

For a gene expression with the high missing rates, the EM-SKNN is shown to outperform the SKNN, MLE, MI and KNN methods. In terms of accuracy and computational complexity, the SKNN showed the best performance compared to all other methods and MLE as the worst.


Example

You can download the package from http://bisyn.kaist.ac.kr/software/SKNN_1.0.0.tar.gz


  1. Install
install.packages("http://bisyn.kaist.ac.kr/software/SKNN_1.0.0.tar.gz", repos=NULL, type="source")
## Installing package into 'C:/Users/sypark/Documents/R/win-library/3.2'
## (as 'lib' is unspecified)

The function included in the package SKNN: SeqKNN

library(SKNN)
SKNN::SeqKNN
## function (data, k) 
## {
##     x <- as.matrix(data)
##     N <- dim(x)
##     p <- N[2]
##     N <- N[1]
##     nas <- is.na(drop(x %*% rep(1, p)))
##     xcomplete <- x[!nas, ]
##     xbad <- x[nas, , drop = FALSE]
##     missing <- c()
##     for (i in seq(nrow(xbad))) {
##         missing[i] <- sum(is.na(xbad[i, ]))
##     }
##     missingorder <- order(missing)
##     xnas <- is.na(xbad)
##     xbadhat <- xbad
##     cat(nrow(xbad), fill = TRUE)
##     for (i in seq(nrow(xbad))) {
##         j <- order(missingorder[i])
##         xinas <- xnas[missingorder[i], ]
##         xbadhat[missingorder[i], ] <- nnmiss(xcomplete, xbad[missingorder[i], 
##             ], xinas, K = k)
##         xcomplete <- rbind(xcomplete, xbadhat[missingorder[i], 
##             ])
##     }
##     x[nas, ] <- xbadhat
##     x
## }
## <environment: namespace:SKNN>


  1. Working Example
data <- read.table("http://bisyn.kaist.ac.kr/software/Example.txt")
library(knitr)
kable(head(data[,1:5]))
V2 V3 V4 V5 V6
0.7733437 -0.0781778 -0.0844692 0.9656141 0.0756639
-2.4384048 -2.4157538 -1.6497392 -2.3805466 NA
-0.4825622 0.4127717 -0.2413075 0.6252965 NA
-2.7211354 -2.8251460 -2.8752861 -1.7412565 0.2726953
-1.2170580 -0.6262365 -0.8894054 -0.8453664 -1.8413700
0.8278092 0.0544882 -0.0274740 0.9496868 0.3279359
dim(data)
## [1] 2308   63
sum(is.na(data))
## [1] 7245
library(SKNN)
k=10
Imp <- SeqKNN(data,k);
## 2208
kable(head(Imp[,1:5]))
V2 V3 V4 V5 V6
0.7733437 -0.0781778 -0.0844692 0.9656141 0.0756639
-2.4384048 -2.4157538 -1.6497392 -2.3805466 -1.6033726
-0.4825622 0.4127717 -0.2413075 0.6252965 0.2038428
-2.7211354 -2.8251460 -2.8752861 -1.7412565 0.2726953
-1.2170580 -0.6262365 -0.8894054 -0.8453664 -1.8413700
0.8278092 0.0544882 -0.0274740 0.9496868 0.3279359
dim(Imp)
## [1] 2308   63
sum(is.na(Imp))
## [1] 0


Citation Trend

Duration X2004.2005 X2005.2006 X2006.2007 X2007.2008 X2008.2009 X2009.2010 X2010.2011 X2011.2012 X2012.2013 X2013.2014 X2014.2015 X2015.2016 X2016.2017 Total
Year 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 13
Citations 0 2 9 10 14 8 6 8 13 11 8 11 6 106
Methods 0 0 5 7 9 5 3 1 6 3 2 7 4 52


Extras

Matlab version https://kr.mathworks.com/matlabcentral/fileexchange/60128-sequential-knn-imputation-method


Comparison of Imputation Algorithms

LSimpute: accurate estimation of missing values in microarray data with least squares methods, Trond Hellem Bo et. al., NAR, 2004 http://nar.oxfordjournals.org/content/32/3/e34

Missing value imputation using a fuzzy clustering-based EM approach, Md. Geaur Rahman et. al., Know. Inf. Syst., 2016 http://link.springer.com/article/10.1007/s10115-015-0822-y

Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments, Magalie et. al., BMC Genomics, 2010 http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-11-15

Missing value imputation for gene expression data: computational techniques to recover missing data from available information, Alan Wee-Chung Liew et. al., Briefings in Bioinformatics, 2010 http://bib.oxfordjournals.org/content/12/5/498.long

An experimental study on the use of nearest neighbor-based imputation algorithms for classification tasks, Jonathan de andrade silva et. al., Data & Know. Engineering, 2013 http://www.sciencedirect.com/science/article/pii/S0169023X12001176


Application of Imputation Algorithms

Imputing gene expression from selectively reduced probe sets, Yoni Donner et. al., Nature Methods, 2012 http://www.nature.com/nmeth/journal/v9/n11/full/nmeth.2207.html