ISOpureR: an R implementation of a computational purification algorithm of mixed tumour profiles

Background Tumour samples containing distinct sub-populations of cancer and normal cells present challenges in the development of reproducible biomarkers, as these biomarkers are based on bulk signals from mixed tumour profiles. ISOpure is the only mRNA computational purification method to date that does not require a paired tumour-normal sample, provides a personalized cancer profile for each patient, and has been tested on clinical data. Replacing mixed tumour profiles with ISOpure-preprocessed cancer profiles led to better prognostic gene signatures for lung and prostate cancer. Results To simplify the integration of ISOpure into standard R-based bioinformatics analysis pipelines, the algorithm has been implemented as an R package. The ISOpureR package performs analogously to the original code in estimating the fraction of cancer cells and the patient cancer mRNA abundance profile from tumour samples in four cancer datasets. Conclusions The ISOpureR package estimates the fraction of cancer cells and personalized patient cancer mRNA abundance profile from a mixed tumour profile. This open-source R implementation enables integration into existing computational pipelines, as well as easy testing, modification and extension of the model. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0597-x) contains supplementary material, which is available to authorized users.

1. Organize files according to the R package structure from the beginning. MATLAB code can have a complex directory hierarchy, whereas R packages do not. Disaggregating this hierarchy is a technically simple and essential first step; attempting to do it later is very time consuming. The minimum suggestion (even if a package is not immediately created) is to collect all functions in one directory with no sub-directories and to create corresponding .Rd files in another directory.
2. Arithmetic operations with integer or double data-types should be straightforward to implement, as both MATLAB and R align in large part with the IEEE 754 standard according to the online documentation (Section on "Avoiding Common Problems with Floating-Point Arithmetic" at http://www.mathworks.com/help/matlab/matlab_prog/floating-point-numbers.html, and Section C.8 of the "R Installation and Administration" manual at http://www.r-project.org/, where IEEE 754 is also denoted IEC 60559; both accessed 11/11/2014). However, operations with complex numbers, comparisons, and data type changes (especially automatic changes in R affecting the matrix data type) will be problematic. Some examples are included in the next section.
3. Numerical accuracy is more important than speed at first. Attempts to speed-up or optimize code can introduce numerical differences that are very difficult to pin down.
4. It is useful to eliminate randomness, for instance by loading values for initial conditions from a saved file for both MATLAB and R. Once code has no random functions, changing the random seed should not affect output in any way. There is no "numerical noise" -round-off, truncation, underflow and overflow errors within each language will be completely reproducible given different random seeds, but no random functions.
5. Use good coding practices. For instance, do not hard code pathnames in functions comparing MATLAB and R output data, and allow flexibility in the directories where the data is saved; to keep data and results organized, a new directory will need to be created for each test.
Code-level recommendations 1. Basic operators and functions are different in MATLAB and R on non-integer/double types or on "non-standard" input. Therefore, it is necessary to know if any intermediate operations result in NaN or NA, in complex numbers, in negative numbers input to a log, etc.
(a) (Greater/less than) Comparisons with NaN values in MATLAB return False (i.e. a value of 0), whereas in R they return NA. Including such comparisons in conditional statements will make R stop in an error, but not MATLAB. (There is also a difference in comparisons of complex values, but in the complex value case R produces an error.)

MATLAB:
>> 4  4. (Matrices) Use as.matrix() to convert dataframes to a matrix, but use matrix() to convert a vector into a row matrix; using as.matrix() on a vector will automatically make it a one-column matrix. Debugging in R 1. One of the most useful debugging tools in R is options(error=recover) which opens in browser mode at the occurrence of an error. When debugging a complicated, iterative function, it is very useful not to stop at every iteration of a troublesome function, or print pages of intermediate numbers, but be able to stop only at the location of the error.