|It is first of all important to consider possible reasons for the missing values. If one has some prior knowledge about why values are missing, one should use that knowledge to handle the missing values. It could for instance be that values are missing if they are below a certain detection level. But even if one does not have any a priori knowledge about why values are missing one should try to understand the possible structure of the missing values, i.e. can one find some non random behavior in their occurrences that potentially can be understood from a biological or experimental viewpoint and that can be used in further analysis of the data or in future design of experiments.|
If the variables one is measuring in an experiment are statistically independent and the sample subset for which values are missing for a specific single variable is completely random, then one can argue that it is reasonable to delete the sample subset with missing values when analyzing that specific single variable alone and that this will lead to statistically sound results for this isolated single variable. In this idealized case a good strategy for imputing values if necessary would be to estimate the distribution for the specific single variable based on the non-missing measurements for that single variable and then draw a random number from this distribution as reconstructed values.
Normally though there are nonrandom structures in the patterns of missing values and/or nontrivial correlation patterns among the measured variables, and in these cases, deleting the samples with missing values when analyzing a selected statistical variable will lead to biases and unnecessary loss of potentially useful information. One can thus argue that, in the exploratory analysis of real life data, it is in general statistically sound to, before further analysis, try to impute the missing values using the information present in the data set. In multivariate data analysis it is in fact is often necessary to impute values if one wants to use statistical methods taking correlation patterns among variables into account, as one often needs a complete data matrix for suitable algorithms to be well defined.
It is always important to indicate all reconstructed (imputed) values in order to be able to evaluate the impact of imputed values on biological conclusions drawn from the analyses. In Qlucore Omics Explorer the indication differs between the plot types, but as an example is an Italic font used for reconstructed missing values in the table plot.
In Qlucore Omics Explorer (QOE) are two different methods for missing value reconstruction or imputation implemented. The default and the fastest method is the mean value method. It consists of replacing a missing value for a specific single variable with the mean value over all the samples with non-missing values for the specific variable in question. Qlucore Omics Explorer also offers the k-nearest neighbor method (kNN) which exploits the correlation structure between variables in the data to impute missing values. The kNN method consists of replacing a missing value for a given single variable by a weighted mean value over the k most similar variables with non-missing values for that specific sample (see Troyanskaya et al “Missing Value Estimation methods for DNA microarrays” Bioinformatics Vol 17 no. 6 pp 520-525 2001 for a detailed description and evaluation of this method in connection with DNA microarrays).
For data sets having very few variables the kNN method is in general not recommended and if selected it is in this case replaced by the standard mean value method in QOE.