- Open Access
An improved independent component analysis model for 3D chromatogram separation and its solution by multi-areas genetic algorithm
BMC Bioinformatics volume 15, Article number: S8 (2014)
The 3D chromatogram generated by High Performance Liquid Chromatography-Diode Array Detector (HPLC-DAD) has been researched widely in the field of herbal medicine, grape wine, agriculture, petroleum and so on. Currently, most of the methods used for separating a 3D chromatogram need to know the compounds' number in advance, which could be impossible especially when the compounds are complex or white noise exist. New method which extracts compounds from 3D chromatogram directly is needed.
In this paper, a new separation model named parallel Independent Component Analysis constrained by Reference Curve (pICARC) was proposed to transform the separation problem to a multi-parameter optimization issue. It was not necessary to know the number of compounds in the optimization. In order to find all the solutions, an algorithm named multi-areas Genetic Algorithm (mGA) was proposed, where multiple areas of candidate solutions were constructed according to the fitness and distances among the chromosomes.
Simulations and experiments on a real life HPLC-DAD data set were used to demonstrate our method and its effectiveness. Through simulations, it can be seen that our method can separate 3D chromatogram to chromatogram peaks and spectra successfully even when they severely overlapped. It is also shown by the experiments that our method is effective to solve real HPLC-DAD data set.
Our method can separate 3D chromatogram successfully without knowing the compounds' number in advance, which is fast and effective.
For thousands of years, plants have played a dominant role in the development of sophisticated traditional herbal medicine (HM) systems [1, 2]. And nowadays, HM has also attracted much interest of both patients and scientists . However, herbal medicines are extracted with boiling water during the decoction process, which makes it very difficult to realize quality control . In 1991 , the World Health Organization (WHO) accepted chromatography fingerprint, which reflects the complex chemical composition of the analyzed sample based on spectroscopic, chromatographic or electrophoretic techniques , as a methodology for the assessment of natural products. But, two disadvantages exist for chromatography fingerprint: it relies on retention time which is not stable; it is chosen from only one specific wavelength which misses much information from other wavelength. So, 3D chromatogram generated by High Performance Liquid Chromatography-Diode Array Detector (HPLC-DAD) was researched widely [7–10]. The construction of HPLC-DAD dataset is illustrated in Figure 1, in which there are three compounds contained in the solution for example.
A drop of sample is injected at the top of a column with absorbent. A cup of solvent, same as that in the sample, carries the sample through the column. Different compounds will receive different resistance when they go through the column. Given an ultraviolet detector at the bottom of the column, a chromatogram peak represented by is formed to reflect the concentration for corresponding compound. If using a DAD for detection, which has more than one thousand channels to detect multi-wavelength simultaneously, besides chromatogram peaks of the outflowing compounds, spectra represented by will also be recorded. The matrices of and represent ith compound and the mixture of all the compounds respectively. Their relationship is given as
where, the variable of n is the number of compounds contained in the solution, which equals to 3 for the example in Figure 1.
There are many methods to separate in (1), such as evolving factor analysis (EFA), heuristic evolving latent projections (HELP), window factor analysis (WFA), orthogonal projection resolution (OPR), evolving window orthogonal projections (EWOP), iterative target transformation factor analysis (ITTFA), alternating regression (AR), parallel factor analysis (PARAFAC1/2), multivariate curve resolution-alternating least squares (MCR-ALS) and interactive self-modelling mixture analysis (SIMPLISMA), alternating trilinear decomposition (ATLD) and immune algorithm (IA). However, all these method need the number of the compounds to be known in advance. And the method to obtain the compounds' number is based on Eigenvalue, which will miss small peaks especially when noise is severe. Recently, Independent Component Analysis (ICA) was introduced in this field, which considered compounds and noises as independent components. But two disadvantages existed: 1) noises was considered as independent components, which gave unexpected and useless information in the results; 2) identifying compounds from noises after separating all the independent components was still needed.
In order to extract compounds directly from the data set, this paper proposed a parallel model of Independent Component Analysis constrained by Reference Curves (pICARC) and its solution by multi-areas Genetic Algorithm (mGA). In section 2, the principle of pICARC and mGA were proposed. In section 3, simulations and experiments were provided to show the performance of our method. Finally, conclusions and future works were summarized in section 4.
The principle of our method is illustrated in Figure 2. The left thick box is the mathematical part, and the right thick box are corresponding data sets in the reality. Firstly, we construct a kind of Reference Curve (RC) with different parameter of based on the priori knowledge. Then inputting and into the pICARC model, Calculated Curves (CCs) will be obtained. The distances, , between and are calculated by the Measurement Operator (MO) . Combining all the elements contained in the dash polygon together, it is called generalized pICARC model, which has converted the separation problem to a multi-parameter optimization issue. Next, what is needed is to find the parameters of , which minimizes the value of .
In this paper, the algorithm of mGA, which is proposed with reference to the features of , is used to search . After all the have been found, will be obtained, whose row vectors are the chromatogram peaks for individual compounds. By using an estimator, the spectra of the compounds will be obtained. The priori knowledge has been introduced within our method. What the user needed to do is just to input your data set.
Following, the priori knowledge about chromatogram peaks, the pICARC model, MO, mGA and the estimator will be introduced one by one.
The priori knowledge
According to the physical principle of chromatography, each peak of chromatogram looks like certain curves such as Gaussian curve, Log-normal curve, Gamma curve and Weibull curve, etc. , among which, the Gaussian function was used prevalently for simulating chromatogram peaks in many relevant researches. [25, 26] So, we will use Gaussian curve, which is illustrated in equation (2) and Figure 3, in our research. There are two parameters: and .
where, the factor of is removed to set the maximum value to 1. The amplitude information will be found in the spectra.
The range of the parameter is decided by the data set, whose minimum value is and maximum value is . According to the quantile of 99.73%, we have the following inequality
The model of ICA is represented as (4)
where are the observation vectors; is the mixed matrix; are the source vectors; the subscript of t is the number of the samples in and . The purpose of this model is to obtain and only based on under four assumptions. According to the separability theorem of ICA, we could trust ICA to separate HPLC-DAD data set as long as the number of the wavelength is greater than the number of the compounds. In 1999, the algorithm of fastICA was proposed to solve (4)[29, 30], of which the parallel form is shown as
where, is the first derivative of .
As mentioned in the introduction, two major problems were found when applying the ICA model to HPLC-DAD data set. Therefore, suitable modification about ICA based on the priori knowledge of chromatogram peaks is needed to constrain the shape of the source signals. What should be noted is that the variable used for calculation in ICA model is the , which generates a signal of shown in equation (7). However, the curve that should be constrained is , which is a calculated signal to approximate . The is the whitening matrix. There is a difference between and , which is caused by pre-processing.
In order to avoid introducing a new variable, which should represent the difference between and , the model is constructed as
The purpose of MO is to measure the distance between and . Here, the vector of contains all these distances. We gave the definition of as
where, j represents every element in the vectors of and
Multi-area genetic algorithm
GAs are a family of computational models inspired by evolution. These algorithms encode a potential solution to a specific problem on a simple chromosome-like data structure, and apply recombination operators to these structures in such a way as to preserve critical information. Usually, GAs are used to find one global optimal point in the searching plane. But in our problem, several points in the plane should be found as solutions simultaneously. In order to search multi optimal points, we propose an algorithm named multi-areas Genetic Algorithm (mGA).
Areas are circles which are composed of chromosomes that are closed to one another, where the candidate solution will be found. If the fitness of one solution is much better than the others, the area around it will have better fitness as well. This will lead many elites assemble into this area. In order to balance the number of elites among all the areas, emigrant and immigrant policy are adopted. Except limited chromosomes left in every area, the others will be selected as emigrants, which will keep balance among population in all areas. Immigrant policy under certain criterion introduces new chromosomes into all areas.
The flow chart of mGA is illustrated in Figure 4. Firstly, parameters, which will be depicted in step 1), are initialized. Secondly, initial population is generated. Then, multi-areas have been formed among the population. Each of these areas has a center and a radius. Only in the areas, whose radius are big enough, the elite chromosomes have the right to match with each other and generate children. Children with high fitness values will replace the inferior individuals. Only a certain number of those top ranking chromosomes in an area will be kept for the next generation, the others will be selected as emigrants to avoid too many chromosomes converging in one dominant area. Later, the emigrants will be allocated into different areas under certain law, which will be descripted in step 6).
After the migration process, new generation has been formed. New multi-areas will be separated among the new generation again. The iteration continues until there is no area with the size big enough to generate children. Finally, only several areas left, which contain the candidates for solution. Solutions are found according to the decision as described in step 7).
Parameters initialization: The major parameters include the number of the population, the number of elites, length of the chromosome and the fitness function. The first three are decided according to the size of the search plane. The fitness of every chromosome is given by equation (9). Here, smaller the fitness is, more superior the chromosome is.
Population initialization: The population is equal to the row number of . In order to get initial population with better fitness, two steps were adopted. Firstly, the search space is separated into several sub spaces equally, and maximum number of chromosomes is randomly generated in every sub spaces. Then, top ones according to their fitness value are selected as initial population.
Multi-areas separation: Multi-areas are formed among the population according to following steps: (a) select the chromosome with best fitness as the center of an area; (b) give an user-defined radius and draw a circle; (c) calculate the distance between center and the other chromosomes contained in this area, update the radius of this area with the largest distance; (d) for chromosomes not belonging to a specific area, repeat Step (a) to (d) until no more chromosome left.
Some of these circles (areas) may overlap with each other. For those circles with severe overlapping, they should be merged as a big area for immigration; this will be further described in step 6).
Mating and reproduction: Every elite finds a mate which has the highest hamming distance, i.e. the number of different bits between them, by the sequence ordered of fitness. New chromosomes, the children, are generated by crossing every different bit between the mated pairs. If a child has better fitness value than any one of its parent, it will be classified as excellent child, which will be reserved; otherwise, it will be dropped.
Select migrants: Only limited chromosomes, whose number is user-defined depending on application, according to their fitness will stay in one area, the others are selected as emigrants. In this step, only the number of the migrants is decided. The generation of immigrants will be descripted in step 6).
Migrant quota among areas allocation: For every circle (area), more elites in the previous generation, less immigrants will be allowed, to avoid too many elites in one area. If two areas are severely overlapped, they will be merged as one big area to receive immigrants. Otherwise, the quota will be unfair because of the overlap. Take the experiments in this paper as an example. There is a severe overlap between area 1 and area 3. And there are 23 elites in area 1 and 1 elite in area 3. If the area 1 and area 3 are not combined as one, there will be a small number of migrants for area 1 but many for area 3. The immigrant in area 3 can also be considered as immigrant for area 1 because of the overlap. There is also an overlap between area 1 and area 5, but the center of area 5 is at the boundary of the plane, which is ignored for further evolution.
The chromosomes with the same number as the initial population will be generated randomly for every circle (area), but only top ones decided by the quota are selected as immigrants.
Find solution: As the algorithm proceeds, the radii of the areas will become smaller. When all the radii become small enough, the program ends. The center of each area is the candidate solution. If the center is a local minimum, it will be selected as a solution.
After obtaining all the chromatogram peaks, , by solving equation (8), the spectra can be estimated with and with considering the noise contained in . For simplicity, we ignore the noise in this paper to calculate spectra by
Results and discussion
In this section, a group of simulations were given to explain the principle of our method. Then several experiments on a real HPLC-DAD data set were given to demonstrate the practicability of our method. Two criteria were used to evaluate our method: 1) to see whether the compounds' number found by our method was right; 2) to see the errors between the true spectra and calculated spectra.
Simulations and discussion
As illustrated in Figure 5, five compounds' chromatogram peaks, which are represented by different parameters of respectively, are constructed in the simulation dataset. The parameters for the compounds' chromatogram peaks from 1st to 5th are: (50, 21), (75, 12), (90, 10), (155, 17) and (175, 9) respectively. These initial parameters' distribution on the plane is shown in Figure 6. There are many areas for the initial parameters, but few areas for the final parameters. The program is available from the corresponding author.
From the simulation, we can see:
The method proposed in this paper could separate 3D chromatogram into chromatogram peaks and spectra effectively without know the compounds' number in advance even severe overlap exist. The pICARC model transformed the separation problem to a multi-parameter optimization issue, which could be solved by swarm intelligent algorithm. The algorithm of mGA could find all the solutions simultaneously.
Sometimes, the result given by this method was incorrect. This is because that the chromosome is initialized randomly, which will cause undetermined situation. This problem can be solved by running program multiple times and compare the candidate results to obtain the final results. Ten times of simulation have been done in this simulation and seven times have the correct result, which is list in Table 1.
The implementation of the mGA is fast. Among ten times of simulation, the slowest one took 11 steps and 4.3652 seconds.
Experiments and discussion
The program of experiment is available for free from the corresponding author. The data set of "adataset.mat", which is used in the experiment, can be downloaded from http://www.mcrals.info for free . The data set is illustrated in Figure 7. The data set is a three-compound system with two pesticides identified and one unknown interferent. The three-way data set is formed by one matrix with the three compounds and two matrices of standards with one known compound.
Ten experiments were run totally. As the population initialization was randomly, we just list the initial population and initial multi-areas for the first experiment in Table 2. Among the ten experiments, eight gave the same results: (19, 7), (31, 13) and (60, 10). The initial population distribution and the final population distribution of one experiment are illustrated in Figure 8. The calculated results are illustrated in Figure 9.
From the results, we can see:
The method proposed in this paper can separate true HPLC-DAD data set into chromatogram peaks and spectra without know the compounds' number in advance. In the ten experiments, eight gave correct results. And the maximum step and time cost are 16 and 2.6854 seconds.
In (a) of Figure 9, the Profile is the projection of the data set along the axis of wavelength. And the chromatogram peaks are products of the calculated curves by the maximum value of the corresponding calculated spectra. The reason that the second peak is higher than the profile is that there is some values in the spectra are negative. The reason for the negative will be discussed in next item.
In (b) and (c) of Figure 9, there are still errors between the calculated spectra and true spectra. This could be caused by following two reasons: noise existed in the data set; difference between the reference curve and true chromatogram peaks. For the first reason, the estimator, shown in equation (10) should be improved. For the second reason, more detailed reference curve should be proposed to replace equation (2).
In this paper, the pICARC model and its solution by mGA were proposed. A priori knowledge of chromatogram peaks is introduced into ICA model. And the shape of the chromatogram peaks, which are the source signals in the ICA model, is constrained with certain kind of function with parameters. Because the Gaussian curve is used widely in relevant field, we use Gaussian curve to constrain the shape of the chromatogram peaks. In order to solve this model, which contained several objective points in the plane, we modified the algorithm of GA to propose the algorithm of mGA. This algorithm separated all the population into multi-areas according to their fitness and distances from each other. Immigrant policy was adopted to keep the variety of the population. In order to avoid gathering too many elites into one dominant area, the immigrants were allocated according to the former elite number of the destination area. Finally, simulations and experiments were done to prove the performance of our model and its solution algorithm. In this section, conclusions were summarized firstly, and then future works were prospected.
The pICARC model transformed the separation problem of HPLC-DAD data set to a multi parameters optimization issue, which can be solved by abundant optimization algorithms.
By introducing a priori knowledge of chromatogram into the ICA model, only the useful signals will be picked out from the mixed data set. It is not necessary for us to know the compounds' number in advance or to discriminate the signal from noises after finding out all the independent components. This means that this model improves the accuracy as well as saves the time for calculation.
The algorithm of mGA is a useful method to search multiple objective points in the plane simultaneously. The information of chromosome' fitness and their distance from each other were used to cluster them into multi-areas. Immigrant policy and quota allocation law were made to keep the variety for every areas. Genetic operators were used to keep the evolution among areas.
According to the discussions in section 3, three major works are needed to be done in the future:
Improved estimator should be developed to eliminate the effect caused by noise existing in the data set. The estimator used in this paper, shown by equation (10), ignores the noise. So there are errors existing in the results.
More accurate reference curves should be introduced into the pICARC model. The other functions referred in section II should be used to see whether better results could be obtained. In reality, the chromatogram curve maybe more complex than the functions referred in this paper, new reference curves with more parameters could be proposed based on the experiments for out model.
Facing new reference curves, which could have more parameters than that used in this paper, new optimization algorithm should be developed to fit the new parameters space.
- HPLC-DAD :
High Performance Liquid Chromatography-Diode Array Detector
- ICA :
Independent Component Analysis
- pICARC :
parallel Independent Component Analysis constrained by Reference Curve
- GA :
- mGA :
multi-area Genetic Algorithm.
Cragg GM, Grouthaus PG, Newman DJ: Impact of natural products on developing new anti-cancer agents. Chem Rev. 2009, B: 3012-3043.
Tistaert C, Dejaegher B, Heyden YV: Chromatographic separation techniques and data handling methods for herbal fingerprints: a review. Anal Chim Acta. 2011, 690: 148-161. 10.1016/j.aca.2011.02.023.
WHO: Traditional Medicine strategy 2002-2005. 2002, Geneva: Switzerland
Liang YZ, Xie P, Chan K: Quality control of herbal medicines. J Chromatogr B. 2004, 812: 53-70. 10.1016/j.jchromb.2004.08.041.
WHO: Guidelines for the assessment of herbal medicines. 1991, Geneva: Switzerland
Royal Society of Chemistry: IUPAC Compendium of Chemical Terminology. 1997, Cambridge: UK
Marini F, Aloise AD, Bucci R, Buiarelli F, Magri AL, Magri AD: Fast analysis of 4 phenolic acids in olive oil by HPLC-DAD and chemometrics. Chemom Intell Lab Syst. 2011, 106: 142-149. 10.1016/j.chemolab.2010.05.006.
Vosough M, Mojdehi NR, Salemi A: Chemometrics assisted dispersive liquid-liquid microextraction for quantification of seven UV filters in urine samples by HPLC-DAD. J Sep Sci. 2012, 3575-3585. 35
Liu X, Wu Z, Yang K, Ding H, Wu Y: Quantitative analysis combined with chromatographic fingerprint for comprehensive evaluation of Danhong injection using HPLC-DAD. J Pharm Biomed Anal. 2013, 76: 70-74.
Li G, Bu H, Yang MQ, Zeng X, Yang JY: Selecting subsets of newly extracted features from PCA and PLS in microarray data analysis. BMC Genomics. 2008, 9 (Suppl2): S24-
Maeder M: Evolving factor analysis for the resolution of overlapping chromatographic peaks. Anal Chem. 1987, 59: 527-530. 10.1021/ac00130a035.
Kvalheim OM, Liang Y: Heuristic evolving latent projections: resolving two-way multicomponent data. 1. Selectivity, latent-projective graph, datascope, local rank, and unique resolution. Anal Chem. 1992, 64: 936-946. 10.1021/ac00032a019.
Malinowski ER: Window factor analysis: theoretical derivation and application to flow injection analysis data. J Chemom. 1992, 6: 29-40. 10.1002/cem.1180060104.
Liang Y, Kvalheim OM: Diagnosis and resolution of multiwavelength chromatograms by rank map, orthogonal projections and sequential rank analysis. Anal Chim Acta. 1994, 292: 5-15. 10.1016/0003-2670(94)00064-6.
Xu C, Jiang J, Liang Y: Evolving window orthogonal projections method for two-way data resolution. Analyst. 1999, 124: 1471-1476. 10.1039/a903782i.
Vandeginste B, Essers R, Bosman T, Reijnen J, Kateman G: Three-component curve resolution in liquid chromatography with multiwavelength diode array detection. Anal Chem. 1985, 57: 971-985. 10.1021/ac00283a005.
Karjalainen EJ: The spectrum reconstruction problem use of alternating regression for unexpected spectral components in two-dimensional spectroscopies. Chemom Intell Lab Syst. 1989, 7: 31-38. 10.1016/0169-7439(89)80109-1.
Vosough Maryam, Amir Salemi: Exploiting second-order advantage using PARAFAC2 for fast HPLC-DAD quantification of mixture of aflatoxins in pistachio nuts. Food Chemistry. 2011, 127: 827-833. 10.1016/j.foodchem.2011.01.017.
Tauler R, Barcelo D: Multivariate curve resolution applied to liquid chromatography-diode array detection. Trends Anal Chem. 1993, 12: 319-327. 10.1016/0165-9936(93)88015-W.
Cao L, Harrington PB, Liu J: SIMPLISMA and ALS applied to two-way nonlinear wavelet compressed ion mobility spectra of chemical warfare agent simulants. Anal Chem. 2005, 77: 2575-2586. 10.1021/ac0486286.
Zhao J, Wu H, Niu J, Yu Y, Yu L, Kang C, Li Q, Zhang X, Yu R: Chemometric resolution of coeluting peaks of eleven antihypertensives from multiple classes in high performance liquid chromatography: A comprehensive research in human serum, health product and Chinese patent medicine samples. J chromatogr B. 2012, 902: 96-107.
Shao X, Liu Z, Cai W: Resolving multi-component overlapping GC-MS signals by immune algorithms. Trends Anal Chem. 2009, 28: 1312-1321. 10.1016/j.trac.2009.08.003.
Debrus B, Lebrun P, Ceccato A, Caliaro G, Govaerts B, Olsen BA, Rozet E, Boulanger B, Hubert P: A new statistical method for the automated detection of peaks in UV-DAD chromatograms of a sample mixture. Talanta. 2009, 79: 77-85. 10.1016/j.talanta.2009.03.009.
Grimalt J, Iturriaga H: The resolution of chromatograms with overlapping peaks by means of different statistical functions. Analytica Chimica Acta. 1982, 139: 155-166.
Lin D, Lu X: Resolution of noisy HPLC-DAD data by two-dimensional wavelet transform and subwindow factor analysis. Chem J Chin Univ. 2005, 26: 1039-1042.
Zhang z, Chen S, Liang Y: Peak alignment using wavelet pattern matching and differential evolution. Talanta. 2011, 83: 1108-1117. 10.1016/j.talanta.2010.08.008.
Lathauwer LD, Moor BD, Vandewalle J: An introduction to independent component analysis. J Chemom. 2000, 14: 123-149. 10.1002/1099-128X(200005/06)14:3<123::AID-CEM589>3.0.CO;2-1.
Eriksson J, Koivunen V: Identifiability, separability, and uniqueness of linear ICA models. IEEE Signal Process, Lett. 2004, 11: 601-604. 10.1109/LSP.2004.830118.
Hyvarinen A: Fast and Robust Fixed-Point Algorithms for Independent Component Analysis. IEEE Trans On Neural Networks. 1999, 10: 626-634. 10.1109/72.761722.
Hyvarinen A, Oja E: Independent Component Analysis: Algorithms and Applications. Neural Networks. 2000, 13: 411-430. 10.1016/S0893-6080(00)00026-5.
Belegundu AD, Chandrupatla TR: Constrained minimization. Optimization Concepts and applications in Engineering. 2011, 2nd edition. Edited by Cambridge University Press, 190-214.
Luenberger DG: Iterative methods of optimization. Optimization by vector space methods. 1969, Edited by John Wiley & Sons, Inc, 271-311.
Tauler R, Lacorte S, Barceló D: Application of multivariate curve self-modeling curve resolution for the quantitation of trace levels of organophosphorous pesticides in natural waters from interlaboratory studies. J of Chromatogr A. 1996, 730: 177-1. 10.1016/0021-9673(95)01206-0.
Lizhi Cui thanks school of information technologies, the University of Sydney for providing him with a Ph.D. fellowship. Lizhi Cui thanks China Scholarship Council for providing living expenses during the study in Sydney, and the students No. is 201206740061.
This supplement has not been supported by sponsorship.
This article has been published as part of BMC Bioinformatics Volume 15 Supplement 12, 2014: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2013): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S12.
The authors declare that they have no competing interests.
LC designed the pICARC model, designed the mGA algorithm and drafted the manuscript. JP and SKP supervised the manuscript, conceived of the whole method and revised the manuscript carefully. HC participated in the revision of this manuscript. JG and PK participated in the design of pICARC model and helped to draft the manuscript. KF supervised the manuscript from pharmacy view. ZL participated in the design of the simulations and experiments. All authors read and approved the final manuscript.
Lizhi Cui, Simon K Poon, Hao Chen, Junbin Gao, Paul Kwan, Kei Fan and Zhihao Ling contributed equally to this work.
About this article
Cite this article
Cui, L., Poon, J., Poon, S.K. et al. An improved independent component analysis model for 3D chromatogram separation and its solution by multi-areas genetic algorithm. BMC Bioinformatics 15, S8 (2014). https://doi.org/10.1186/1471-2105-15-S12-S8
- Independent Component Analysis
- Immigrant Policy
- Reference Curve
- Separation Problem
- Chromatogram Peak