SKNN

SKNN: Reuse of imputed data in microarray analysis increases imputation efficiency

Introduction:

SKNN is an imputation tool, design for the imputation of gene missing entries, the tool is available as an R package.

Algorithms:

Two novel algorithms for the gene microarray data imputation are proposed in SKNN. The proposed algorithms are the extension of k-nearest neighborhood (KNN) algorithm named as sequential KNN (SKNN) and expectation maximization SKNN (EM-SKNN). The EM-SKNN is the recursive application of SKNN algorithm, both methods are compared with KNN, maximum likelihood estimation (MLE) and multiple imputations (MI) algorithms for the imputation accuracy and computation speed on different datasets.

Working:

In SKNN method, 1-The genes are ordered by their missing rates and the imputation was performed sequentially from a gene that had a least missing rate. 2-The sequentially imputed genes were then used in the imputation of other missing genes.

The SKNN utilizes two sets for imputation 1) complete set and 2) incomplete set. Initially, all complete and incomplete genes were placed in set 1 and 2 respectively, once a gene is fully imputed it is moved from incomplete set to complete set.

The process can be executed in parallel for all missing values of a gene. This reduces the execution time compared to the other KNN based methods

Performance:

In this paper, five methods for the imputation of gene microarray data are discussed namely KNN, MLE, MI, SKNN, and EM-SKNN. All mentioned algorithm are compared for imputation accuracy and execution time on a different dataset with missing rates varies from (10% to 60%). The datasets consist of three types of data including Time-series, non-time series and mixed data. For the evaluation of imputation accuracy two performance measures i.e. root mean squared error (RMS) and Pearson correlation coefficients were used.

Conclusion:

For a gene expression with the high missing rates, the EM-SKNN is shown to outperform the SKNN, MLE, MI and KNN methods. In terms of accuracy and computational complexity, the SKNN showed the best performance compared to all other methods and MLE as the worst.

Example

You can download the package from http://bisyn.kaist.ac.kr/software/SKNN_1.0.0.tar.gz

Install

install.packages("http://bisyn.kaist.ac.kr/software/SKNN_1.0.0.tar.gz", repos=NULL, type="source")

## Installing package into 'C:/Users/sypark/Documents/R/win-library/3.2'
## (as 'lib' is unspecified)

The function included in the package SKNN: SeqKNN

library(SKNN)
SKNN::SeqKNN

## function (data, k) 
## {
##     x <- as.matrix(data)
##     N <- dim(x)
##     p <- N[2]
##     N <- N[1]
##     nas <- is.na(drop(x %*% rep(1, p)))
##     xcomplete <- x[!nas, ]
##     xbad <- x[nas, , drop = FALSE]
##     missing <- c()
##     for (i in seq(nrow(xbad))) {
##         missing[i] <- sum(is.na(xbad[i, ]))
##     }
##     missingorder <- order(missing)
##     xnas <- is.na(xbad)
##     xbadhat <- xbad
##     cat(nrow(xbad), fill = TRUE)
##     for (i in seq(nrow(xbad))) {
##         j <- order(missingorder[i])
##         xinas <- xnas[missingorder[i], ]
##         xbadhat[missingorder[i], ] <- nnmiss(xcomplete, xbad[missingorder[i], 
##             ], xinas, K = k)
##         xcomplete <- rbind(xcomplete, xbadhat[missingorder[i], 
##             ])
##     }
##     x[nas, ] <- xbadhat
##     x
## }
## <environment: namespace:SKNN>

Working Example

data <- read.table("http://bisyn.kaist.ac.kr/software/Example.txt")
library(knitr)
kable(head(data[,1:5]))

V2	V3	V4	V5	V6
0.7733437	-0.0781778	-0.0844692	0.9656141	0.0756639
-2.4384048	-2.4157538	-1.6497392	-2.3805466	NA
-0.4825622	0.4127717	-0.2413075	0.6252965	NA
-2.7211354	-2.8251460	-2.8752861	-1.7412565	0.2726953
-1.2170580	-0.6262365	-0.8894054	-0.8453664	-1.8413700
0.8278092	0.0544882	-0.0274740	0.9496868	0.3279359

dim(data)

## [1] 2308   63

sum(is.na(data))

## [1] 7245

library(SKNN)
k=10
Imp <- SeqKNN(data,k);

## 2208

kable(head(Imp[,1:5]))

V2	V3	V4	V5	V6
0.7733437	-0.0781778	-0.0844692	0.9656141	0.0756639
-2.4384048	-2.4157538	-1.6497392	-2.3805466	-1.6033726
-0.4825622	0.4127717	-0.2413075	0.6252965	0.2038428
-2.7211354	-2.8251460	-2.8752861	-1.7412565	0.2726953
-1.2170580	-0.6262365	-0.8894054	-0.8453664	-1.8413700
0.8278092	0.0544882	-0.0274740	0.9496868	0.3279359

dim(Imp)

## [1] 2308   63

sum(is.na(Imp))

## [1] 0

Citation Trend

Duration	X2004.2005	X2005.2006	X2006.2007	X2007.2008	X2008.2009	X2009.2010	X2010.2011	X2011.2012	X2012.2013	X2013.2014	X2014.2015	X2015.2016	X2016.2017	Total
Year	2004	2005	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016	13
Citations	0	2	9	10	14	8	6	8	13	11	8	11	6	106
Methods	0	0	5	7	9	5	3	1	6	3	2	7	4	52

Extras

Matlab version https://kr.mathworks.com/matlabcentral/fileexchange/60128-sequential-knn-imputation-method

Comparison of Imputation Algorithms

LSimpute: accurate estimation of missing values in microarray data with least squares methods, Trond Hellem Bo et. al., NAR, 2004 http://nar.oxfordjournals.org/content/32/3/e34

Missing value imputation using a fuzzy clustering-based EM approach, Md. Geaur Rahman et. al., Know. Inf. Syst., 2016 http://link.springer.com/article/10.1007/s10115-015-0822-y

Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments, Magalie et. al., BMC Genomics, 2010 http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-11-15

Missing value imputation for gene expression data: computational techniques to recover missing data from available information, Alan Wee-Chung Liew et. al., Briefings in Bioinformatics, 2010 http://bib.oxfordjournals.org/content/12/5/498.long

An experimental study on the use of nearest neighbor-based imputation algorithms for classification tasks, Jonathan de andrade silva et. al., Data & Know. Engineering, 2013 http://www.sciencedirect.com/science/article/pii/S0169023X12001176

Application of Imputation Algorithms

Imputing gene expression from selectively reduced probe sets, Yoni Donner et. al., Nature Methods, 2012 http://www.nature.com/nmeth/journal/v9/n11/full/nmeth.2207.html

SKNN

Shujaat Khan, Seongyong Park

2016년 10월 14일