Frequent contiguous pattern mining over biological sequences of protein misfolded diseases

Background Proteins are integral part of all living beings, which are building blocks of many amino acids. To be functionally active, amino acids chain folds up in a complex way to give each protein a unique 3D shape, where a minor error may cause misfolded structure. Genetic disorder diseases i.e. Alzheimer, Parkinson, etc. arise due to misfolding in protein sequences. Thus, identifying patterns of amino acids is important for inferring protein associated genetic diseases. Recent studies in predicting amino acids patterns focused on only simple protein misfolded disease i.e. Chromaffin Tumor, by association rule mining. However, more complex diseases are yet to be attempted. Moreover, association rules obtained by these studies were not verified by usefulness measuring tools. Results In this work, we analyzed protein sequences associated with complex protein misfolded diseases (i.e. Sickle Cell Anemia, Breast Cancer, Cystic Fibrosis, Nephrogenic Diabetes Insipidus, and Retinitis Pigmentosa 4) by association rule mining technique and objective interestingness measuring tools. Experimental results show the effectiveness of our method. Conclusion Adopting quantitative experimental methods, this work can form more reliable, useful and strong association rules i. e. dominating patterns of amino acid of complex protein misfolded diseases. Thus, in addition to usual applications, the identified patterns can be more useful in discovering medicines for protein misfolded diseases and thereby may open up new opportunities in medical science to handle genetic disorder diseases.

To survive, all living being need proteins, either in muscles or in cell membrane. Protein is one among the important factors and acts as constituents of all living organisms [2]. Protein is building blocks of hundreds of Amino acids joined together by peptide bonds. To be functionally active, amino acids chain folds up in complex way to give each protein a unique 3D shape. Protein folding is crucial for living organism as it affects gene skeleton. A small error in the folding process results in a misfolded structure, which can sometimes be lethal [3]. Protein misfolding is believed to be one of the primary causes of genetic disorder diseases such as Alzheimer's disease, Parkinson's disease, Huntington's disease, Sickle cell anemia, Cystic fibrosis, Cancer and many other degenerative and neurodegenerative disorders [4]. Protein misfolding may occur due to an unwanted mutation in their amino acids or because of an error in the folding process. Thus, the relationship between these amino acids is very vital in case of protein misfolded diseases.
Frequent pattern mining is helpful to find the recurring relationships, association and correlation in a given data set [1]. Patterns can be represented as association rules and association rules are said to be strong if it satisfies both a minimum support threshold and a minimum confidence threshold. Therefore, frequent pattern mining can provide the solution for association rules formation among the most dominating amino acids for different protein misfolded diseases. To the best of our knowledge, three studies [2,5,6] have been identified on this issue. But all these were focused to predict pattern and association rules of the most dominating amino acids which cause the Chromaffin Tumor disease only. However, predicting the pattern and associations between more complex diseases are yet to be attempted in literature. Moreover, association rules obtained by these studies were not verified by usefulness measures.
The aim of this paper was to analyze protein sequences associated with complex protein misfolded diseases (i.e. Sickle Cell Anemia, Breast Cancer, Cystic Fibrosis, Nephrogenic Diabetes Insipidus and Retinitis Pigmentosa-4) and identify frequent patterns among their amino acids. Here, association rule mining was used to predict patterns. Association rules were considered to be strong if it had satisfied a minimum support and a confidence threshold. Then only useful rules were finally sorted out with the use of interestingness measures (i.e. Lift, Bi-lift, Bi-improve and Bi-confidence). Adopting quantitative experimental method, this work forms more reliable and strong association rules among the most dominating amino acids of corresponding proteins and identify the dominating patterns of amino acid of complex protein misfolded diseases. Identification/reporting of such variant of amino acids for those particular five genetic diseases may have versatile implications. An improved capacity in identifying the relations among the most dominating amino acids in protein sequences related to disease will have an immediate impact on the diagnosis, treatment, and prevention of genetic disorders and thus may open up new opportunities in medical science to handle the concerned genetic disorder diseases. This paper is organized as follows. "Theoretical framework" section presents theoretical background of related issues. "Literature review" section highlights an overview of the related works. The experimental design is presented in "Methodology" section and "Experimental results" sectionrepresents the data analysis and results. In "Comparison with previous studies" section some comparative analysis with previous studies has been made. Potential implications of the finding of this work are focused in "Implication of the findings" section. The concluding remarks and the future work are presented in the final section.

Theoretical framework
Some of the concepts and issues such as protein structure, protein associated diseases, association rule mining and their interestingness measures which have been considered in this paper are discussed below.

Amino acid and protein
To survive, all living being needs proteins. The biological activity of the protein is determined by the chemical properties of the amino acids. Amino acids are made from carbon, hydrogen, nitrogen and oxygen. Though more than 50 amino acids have been discovered; only 20 are used to make proteins in human body. These 20 amino acids convey a vast array of chemical versatility within proteins [7]. Proteins are complex molecules, made up of hundreds of amino acids that are attached to one another by peptide bonds (Fig. 1), forming a long chain [8]. Amino acids sequences contain the necessary information, basing on which, protein determine how that protein will fold into a 3D structure and the stability of the resulting structure.

Protein misfolding
A protein can be functionally active when it acquires a unique 3D conformation through the complicated folding of the polypeptide chain coded from the nuclear genome (Fig. 2). Protein may have adverse effect on its functionality if not folded properly. Proteins that are not able to achieve native state, due either to an unwanted mutation in their amino acid sequence or simply because of an error in folding process, are recognized as misfolded.

Protein misfolding diseases
For the last couple of years, protein misfolding and its effects have become a matter of great concern. According to the prion researcher Susan Lindquist, 'protein misfolding could be involved in up to half of all human diseases' [9]. Many cancers and other  [13] and Retinitis Pigmentosa 4 [14]) have been experimented.

Frequent pattern mining in bioinformatics
Frequent patterns are either itemsets or subsequences or substructures which appear in a data set with a frequency that is equal to or higher than a threshold specified by the user. Data mining can be the most active technique to infer structure and principles of biological datasets and to solve biological problems. Pattern mining is useful in bioinformatics for predicting rules of certain elements in genes, for protein function prediction, for gene expression analysis, for protein fold recognition and for motif discovery in DNA sequences [13]. Thus, frequent pattern mining can be used to find recurring relationships, association and correlation between amino acids for protein misfolded diseases.

Association rule mining
Association rule mining is one sorts of pattern mining which is built from frequent itemset mining. In data mining, association rule learning is a popular and well researched method for discovering interesting relations between variables in large databases [15]. Patterns can be represented as association rules and the association rules are said to be strong if it satisfies both a minimum support threshold and a minimum confidence threshold. Therefore, frequent pattern mining can provide solution for association rules formation among the most dominating amino acids for different protein misfolded diseases. To analyse, predict and manage bulk biological data, numerous computer algorithms and methods are developed which help to compare and align biological sequences and predict bio-sequence patterns [1]. In this work, as tools of association rule mining, Apriori algorithm was used to analyse, predict and identify desired pattern of dominating amino acids in the protein sequences.

Interestingness measures for association rules mining
Association rules mining algorithm can generate a lot of association rules or patters or knowledge, but most of them have redundant information and limited resources. Therefore, it is essential to evaluate the interestingness (or usefulness) of the association rules before their practical use. In this work objective measures were used for evaluating the interestingness of the rules. Benefit of using objective measures is that they mainly use statistical methods and a quantitative value to determine the interestingness of rules which is reliable, easy to operate and convincing. Objective measures are Support, Confidence, Lift, Improve, Validity, Influence, Conviction and Bi-lift, Biimprove and Bi-confidence for Lift, Improve and Confidence, respectively etc. [16].
Objective measures support, confidence, lift and improve [17] were used by Islam et al. [18] to generate and detect strong and interesting association rules.

Support:
The support of an itemset X, supp (X) is defined as proportion of transaction in data set in which the item X appears. It indicates popularity of an itemset.

Confidence:
The confidence of a rule is defined as:

Lift:
The lift of a rule is defined as: The rule (X → Y) will be considered as positively correlated rule if its Lift value is greater than 1. Thus, those rules are useful only whose Lift value is greater than 1.

Improve:
Improve is a relatively new interestingness measure method of association rules based on the description of the defects of the traditional interestingness measurement method and defined as: However, Support, Confidence, Lift and Improve have their own limitation.

Limitation of support and confidence
Due to subjectively selected support threshold value, many infrequent itemsets which have been discarded may have potential value. The rules are called strong association rules if the Support and Confidence are larger than the respective minimum support and minimum confidence threshold. But strong association rules are not always effective, some are not what users are interested in, and some are even misleading [19].
(1) supp(X) = No. of transactions in which itemset X appeared Total no. of transactions

Limitation of lift Lift takes events A and B in equivalence position. According to
the Lift, (A → B) and (B → A) are the same; that means, if we accept rule (A → B), (B → A) should also be accepted, but fact is not like this [19]. [20] Firstly, how much improvement of probability can be called improvement? Secondly, the probability of former pieces' occurrence will seriously affect Improve evaluation in such a way that when it is high, the improve value will be very small all the time.

Limitation of improve
To overcome the shortcomings of Lift, Improve and Confidence, literature [19] suggests following corrections to the measures: 1. Bi-lift [19]  In order to eliminate the influence, correction was given by multiplying the ratio of the occurrence possibility of antecedent to the no occurrence probability of antecedent. Bi-improve formula is as follows: The higher the Bi-improve (A → B), the better the rule A → B is. 3. Bi-confidence [19] The confidence of association rules only thinks about the occurrence possibility of "B" when "A" occurs, but not consider the relationship between "A" and "B" when "A" does not occur. So, it makes a lot of association rules mining invalid. For the above problems, concept of Bi-confidence is defined as follows: The value range of Bi-confidence is [− 1, 1]. If the value of Bi-confidence is greater than 0, then A and B have positive correlation. If the Bi-confidence is equal to 1, then it shows that "A" and "B" in record set appear together or not. If the Bi-confidence is equal to 0, then "A" has no relation with "B". If the Bi-confidence is less than 0, then it shows that "A" and "B" have the negative correlation. The higher the Bi-confidence (A → B), the better the rule A → B is.

Literature review
Frequent Contiguous Patterns (FCP) are small patterns that repeatedly occurs in a database, specially high in bio-sequences. Biological sequences such as DNA and protein sequences consist of long linear chain of chemical components and typically contain a large number of items [21]. Frequent pattern mining is helpful to find the recurring relationships, association and correlation in a given data set [1]. In data mining, association rule learning is a popular and well researched method for discovering interesting relations between variables in large databases [15]. The challenging task in pattern finding of biological sequences is to find frequent contiguous patterns [1]. Data Mining has increased popularity in classifying biological sequences and structures based on their critical features and functions [2]. Protein is one among the important factors and acts as the constituents of all living organisms [2]. Protein misfolding is believed to be the primary cause of genetic disorder diseases such as Alzheimer's disease, Parkinson's disease, Huntington's disease, Sickle cell anemia, Cystic fibrosis, Cancer and many other degenerative and neurodegenerative disorders [4]. Proteins are made up of smaller building blocks called amino acids, joined together in chains [22]. The chains of amino acids fold up in complex ways, giving each protein a unique 3D shape. Thus, relationship between these amino acids is very vital in case of protein misfolded diseases. Frequent pattern mining can provide the solution for association rules formation among the most dominating amino acids for different protein misfolded diseases. To the best of our knowledge, three studies [2,5,6] have been identified on this issue.
Lakshmi and Hariharan [5] aimed to predict patterns applying strong association rules over the frequent itemsets of the protein sequence named Succinate dehydrogenase which is involved in chromaffin tumor disease. The system generated frequent itemsets from the protein sequence and constructs a frequent pattern tree. Thereafter strong association rules were generated based on 90% confidence threshold to identify the dominating amino acids.
Lakshmi and Hariharan [2] conducted another similar research in finding the most dominating amino acids (in Succinate dehydrogenase protein) which causes the disease chromaffin tumor. Here, Apriori algorithm was used in finding frequent items using candidate generation and then generating association rules from those frequent itemsets. In predicting the pattern, this work considered 5 as minimum Support count and 90% Confidence threshold.
Dhumale carried out similar work [6] to find dominating amino acids responsible to cause five diseases, i.e. Epilepsy, Hartnup, Cystinuria, Alzheimer and Chromaffin Tumor. As deduction, the author claimed five amino acid patterns (association rules), each to be responsible for an individual diseases. This work suffers serious limitations. Firstly, the experimented protein sequence is anonymous. Secondly, all the mentioned diseases might not be associated with a single protein. The author did not provide any credibility of the information. Moreover, no authentic literature was found in this regard. It is to mention that all diseases are not associated with the protein changes. Some are multifactorial diseases; some are infectious diseases and so on. Thirdly, the author arbitrarily increased the minimum Support count from 2 to 5, generated association rules with confidence threshold 90% and declared set of amino acid pattern (association rule) as responsible for each of the disease. But on what basis this deduction was arrived was not at all cleared.
The above three works were focused to predict the pattern and association rules of amino acids which causes the Chromaffin Tumor disease only. However, finding patterns of other protein associated diseases or more complex protein misfolded diseases ate yet to be attempted in the literature. Moreover, it is also important to predict interesting association rules for practical use. But association rules obtained by these studies were not verified by usefulness measures.

Methodology
In this study, five protein misfolded diseases were taken in consideration. The protein sequences associated with each of the diseases were collected from a well-recognised protein data bank. Then the associative patterns among the amino acids were identified using a data mining technique. To generate the strong association rules from the amino acids of the protein associated diseases, support count were raged between 3 to 5 and minimum confidence as 90%. Based on the strong association rules, this proposed system was focused on predicting the most dominating amino acids than the other amino acids that cause the disease from the protein data sets.

General work flow
The proposed system works in five steps. General work flow of the proposed system is shown in Fig. 3.
(1)  (amino acid chain) associated with these diseases were collected from protein data bank named Universal Protein Resource (www. unipr ot. org/) in FASTA form. It is to note that the UniProt is a comprehensive resource for protein sequence and annotation data. The mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information. Due to its world-wide acceptance and high degree of reliability, protein sequences for this work were collected from UniProt protein knowledgebase. Table 1 shows the experimented human diseases, their associated proteins and their lengths.
(2) Partitioning data set Each of the protein sequences (amino acid chain) were subdivided into amino acid sub sequences of length 10. For example, Hemoglobin Subunit Beta protein sequence (associated with Sickle Cell Anemia disease) contained amino acids of 147 length which was partitioned into 15 sub sequences of length 10 each as shown in Table 2.
(3) Association rule mining The sub sequences of amino acids were then used for associative pattern identification through Apriori Algorithm data mining technique.  Association rules were generated based on minimum support count threshold and minimum 90% confidence level. It is to mention that the value of the minimum support count is usually subjectively decided by the researchers. Higher the minimum support count, smaller and stronger the association rules for a particular confidence level. However, if the support count is too high then many interesting association rules may be discarded. In this work, the lengths of protein sequences were not uniform and thus to generate and analyse a significant number of association rules, the minimum support count was subjectively selected 3 for Hemoglobin Subunit Beta protein, 5 for Breast Cancer Type 1 susceptibility and Cystic Fibrosis Transmembrane Conductance Regulator proteins and 4 for Vasopressin V2 Receptor and Rhodopsin proteins.
(4) Measuring interestingness of association rules In the previous steps, association rule algorithm would generate a significant number of rules. However, all these association rules may not be practically useful. Therefore, the interestingness of these rules were measured and evaluated. This evaluation would be conducted by objective or subjective measures. Considering the effectiveness and stability in results, improved objective measuring tools (i.e. Bi-lift, Bi-improve and Bi-confidence) were used to evaluate the association rules comprehensively. As such, Bi-lift, Bi-improve and Bi-confidence value of each of the association rules were calculated to finally prune the useful association rules.
(5) Identification of patterns Based on the strong and useful association rules, this proposed system focused on predicting the most dominating amino acids, and thus the associative patterns among the amino acids were identified for each protein misfolded disease.
Combining Support and Confidence with Lift, Bi-lift, Bi-improve and Bi-confidence, a reasonable framework for identifying strong and interesting association rules was developed. In this work, the associative patterns among the amino acids were generated and measured by using following sequences: 1. Firstly, Support and Confidence threshold was used to filter out frequent itemsets and strong association rules 2. Secondly, Lift, Bi-lift, Bi-improve, and Bi-confidence value were calculated 3. Then, according to the Bi-lift, Bi-improve and the Bi-confidence value, useful association rules were sorted out Actually, the final evaluation results of these three kinds of measure methods are very close and give perfect results.

Algorithm
In this work, the algorithm used takes four inputs: (i) the protein sequence of a particular protein misfolded disease, (ii) minimum support count (iii) the threshold confidence level and (iv) usefulness measuring parameter. Then the algorithm returns the strong and useful association rules of the most dominating amino acids for the concerned protein misfolded disease. Pseudocode as follows: Islam et al. BMC Bioinformatics (2021)   The procedure starts with the method generate_association_rules().
Step-1 In this step, the Dataset is generated by calling gen_subseq_dataset(Protein_ Seq). This method splits the protein sequence after each 10 elements of the given misfolded protein sequence and insert them into the Dataset and return it.
Step-2 In this step, L 1 is generated which denotes the frequent itemset of length 1 by calling the method named find_frequent_itemset_of_length_1 (Dataset).
Steps-3, 4 In this step, a loop runs until L i−1 becomes empty. Here, L i denotes the ith frequent itemset. L i is generated by calling find_frequent_itemset(Dataset, L i−1 ). This procedure generates the ith frequent itemset from the (i − 1)th frequent itemset. It runs a nested loop where it takes each two item from (i − 1)th frequent itemset and if it matches all the protein except the last one between that two itemset, then it joins that two itemset and check if the itemset is frequent or not. If the itemset is frequent, then it insert that itemset into the ith frequent itemset. After completing this procedure, it returns the ith frequent itemset.
Steps-6, 7 In this step, a loop runs until L i−1 becomes empty starting from L 2 and find the association rules by calling find_association_rules(L). In each iteration of the loop it takes an item from the ith frequent itemset and splits it into two parts from first to last. Then it calculates the confidence and inserts the rules having confidence above the given confidence and returns the set of rules. Finally, the association rules are stored in Rules.
Step-8 In this step, a loop runs over all items of Rules by calling find_association_ measures (Rules). Then it calculates bi_lift, bi_confidence and bi_improve for each of the items of Rules. Finally, the rules with metrics for association rules measuring are stored in R.

Experimental results
The algorithm of the experiment had been implemented using C + + in a laptop computer with an Intel Core i5-7200U CPU (clock frequency 2.7 GHz and 4 GB RAM). Experimental results were obtained from each of the protein sequences. During the computation, the number of iterations was not fixed. The algorithm was continued till no further successful extensions were found. The work thus followed three basic actions: a. Frequent itemsets generation b. Generation of strong association rules c. Identifying interesting/useful association rules In doing so, following considerations were made: a. Support count threshold 3, 4 and 5 for frequent itemset generation. b. Minimum 90% confidence level to obtain strong association rules. c. Using Lift, Bi-lift, Bi-improve and Bi-confidence as measuring instrument to find useful strong association rules.

Frequent itemsets generation
Frequent itemsets generation means the frequent amino acid sets generation from the transactional protein datasets (sub sequences). For every protein sequences, frequent itemsets were generated. The algorithm maintains list of frequent amino acid sets to further generate strong association rules.
(1) Disease-1: sickle cell anemia For Sickle Cell Anemia, protein sequence Hemoglobin Subunit Beta was loaded as input file. Here, 3 was considered as minimum support count. The process continued up to 5th iteration and garnered total 135 itemsets (comprising 1-itemsets to 5-itemsets) of amino acids. A few of the generated frequent itemsets for Sickle Cell Anemia is graphically represented in Fig. 4.
(2) (Disease-2: Breast cancer For Breast Cancer disease, protein chain sequence Breast Cancer Type 1 Susceptibility Protein was loaded in the process as the input file. This protein chain sequence was consisted of total 1863 amino acids. Here, due to the long length, 5 was considered as the minimum support count. The process satisfied the

Support Count
Itemsets (Amino Acids)

Fig. 4 A few frequent 4-itemsets and 5-itemsets obtained from protein sequence for Sickle Cell Anemia
threshold support count unto 6th iteration and generated total 1806 itemsets (comprising 1-itemsets to 6-itemsets) of amino acids. Among this, frequent 1-itemsets were 20 in number, frequent 2-itemsets were 176, frequent 3-itemsets were 669, frequent 4-itemsets were 744, frequent 5-itemsets were 191 and frequent 6-itemsets were 6. A concise list of frequent itemsets generated for this disease is shown in Fig. 5.

(3) Disease-3: Cystic fibrosis
For Cystic Fibrosis disease, protein chain sequence Cystic Fibrosis Transmembrane Conductance Regulator (CFTR) (length 1480 amino acids) was loaded in the process as the input file. Here, due to long length, minimum support count 5 was considered. The process continued up to 6th iteration and garnered total 1464 itemsets (comprising 1-itemsets to 6-itemsets) of amino acids. Among this, frequent 1-itemsets were 20 in number, frequent 2-itemsets were 178, frequent 3-itemsets were 607, frequent 4-itemsets were 563, frequent 5-itemsets were 95 and frequent 6-itemsets were only 1. A concise list of frequent itemsets generated for this disease is shown in Fig. 6. Here, due to moderate length (371), minimum support count 4 was considered. The process continued up to 5th iteration and generated total 234 itemsets. A few of generated frequent itemsets for Nephrogenic Diabetes Insipidus is shown in Fig. 7. (5) Disease-5: Retinitis pigmentosa 4 Protein sequence Rhodopsin (Opsin-2) was loaded in the process as input for Retinitis Pigmentosa 4 (RP4) disease. Here, 4 was considered as the minimum support count. The process continued up to 5th iteration and generated total 268 itemsets. Few generated frequent itemsets for Retinitis Pigmentosa 4 is graphically represented in Fig. 8.

Strong association rules generation
The algorithm maintains list of frequent itemsets (amino acid sets) for each protein sequence and from this list corresponding strong association rules are generated considering 90% confidence threshold in each case. (1) Disease-1: Sickle cell anemia: The process generated 698 association rules from 135 frequent itemsets. Among these rules, only 95 rules satisfied the minimum confidence level (90%) and were considered as accepted strong association rules and rest 603 rules were rejected. Examples of few association rules in this phase are shown in Table 3.
(2) Disease-2: Breast cancer In case of Breast Cancer, the algorithm handled the protein sequence of Breast Cancer Type 1 Susceptibility protein and generated total 1806 frequent itemsets of amino acids considering minimum support count 5. Here, total 20,884 association rules were generated from 1806 frequent itemsets. Among these, only 80 rules satisfied the minimum confidence level (90%) and were considered as accepted strong association rules and rest rules were rejected. Few of these accepted rules are shown in Table 4. (

3) Disease-3: Cystic fibrosis
Here, the algorithm handled the protein sequence of Cystic Fibrosis Transmembrane Conductance Regulator (CFTR) protein and generated total 1464 frequent itemsets of amino acids considering minimum support count 5. Total 14,792 association rules were generated from 1464 frequent item sets. Among these, only 96 rules satisfied the minimum confidence level (90%). Hence, these rules were considered as accepted strong association rules and rest rules were rejected. Few of these accepted rules are shown in Table 5.

(4) Disease-4: Nephrogenic diabetes insipidus
Here, total 1152 association rules were generated from 234 frequent itemsets. Among these, only 54 rules satisfied the minimum confidence level (90%) and were considered as accepted strong association rules and rest rules were rejected. Few of the accepted rules are shown in Table 6.

(5) Disease-5: Retinitis pigmentosa 4
Here, total 1252 association rules were generated from 268 frequent itemsets where only 49 satisfied minimum confidence level (90%) and were considered as accepted strong association rules and rest rules are rejected. A few of the accepted rules are shown in Table 7.

Useful association rules identification
The strong association rules obtained by the previous process were required to be evaluated by some measuring tools to identify useful strong association rules. Objective measuring tools Lift and Improve were used for this purpose [18]. However, Lift and Improve have some limitation as discussed in para II(F). Thus considering the effectiveness and stability in results, in this work (as mentioned earlier) improved objective measuring tools (i. e. Bi-lift, Bi-improve and Bi-confidence) were used to evaluate the association rules comprehensively.
Lift, Bi-lift, Bi-improve and Bi-confidence value of each of the association rules were calculated and finally only useful rules were sorted out based on the following criteria: • The rule (A → B) will be considered as positively correlated rule (emergence of "A" promotes the emergence of "B, ") if its Lift value is greater than 1. Thus, those rules are useful only whose Lift value is greater than 1. • If the Bi-confidence value is greater than 0, then P(AB) > P(A)P(B), which shows that "A" and "B" have the positive correlation. Thus, those rules are useful only whose Bi-confidence value is greater than 0. The higher the Bi-confidence (A → B) value, the better the rule A → B is.

Disease-1: Sickle cell anemia
In case of Sickle Cell Anemia, 95 rules were considered as accepted strong association rules (as per previous step) which were further evaluated to determine their usefulness. In doing so, Lift, Bi-lift, Bi-improve and Bi-confidence values of each of these association rules were calculated and shorted out based on the criteria stated in the earlier paragraph. Finally 59 rules were selected as useful strong association rules (Table 6) and rest 36 rules were redundant or might be misleading and thus not effective (Table 8).
In this case, the first accepted useful association rule is GT → AN as it satisfies the required criteria as shown below: Criteria-1: Lift value should be greater than 1. Test: Here, lift (GT → AN) = 3.75, which is greater than 1. So, criteria-1 is satisfied. Criteria-2: The higher the Bi-lift(A → B) value, the better the rule (A → B) is. Test: Here, Bi-lift(GT → AN) = 12, which is a positive higher value. So, criteria-2 is satisfied.
(2) Disease-2: Breast cancer Similarly, in case of Breast Cancer, Lift, Bi-lift, Biimprove and Bi-confidence values of 80 accepted rules were calculated and evaluated. Finally 19 rules were selected as useful strong association rules and rest 61 rules were redundant or might be misleading and thus not effective (Table 9).
(3) Disease-3: Cystic fibrosis In case of Cystic Fibrosis, the algorithm handled the corresponding protein sequence and generated 96 accepted strong association rules. Basing on Lift, Bi-lift, Bi-improve and Bi-confidence values of these rules, finally 35 rules were sorted out as useful strong association rules and rest 61 rules were redundant or might be misleading and thus not effective (Table 10).

Summary of the result
Considering the limitation of earlier studies, this work designed a uniform method to predict the patterns and association rules of the most dominating amino acids for different protein misfolded diseases. The support thresholds were kept relatively low to examine large amount of frequent patterns and their association rules. And the rules were then tested using improved objective measuring tools (Bi-lift, Bi-improve and Bi-confidence) to evaluate the association rules comprehensively. Finally following patterns and useful strong association rules of the most dominating amino acids for experimented protein misfolded diseases were found as outcome:

Comparison with previous studies
It has been already mentioned that all the previous studies, in this aspect, were focused to predict the pattern and association rules of the most dominating amino acids which were associated with Chromaffin Tumor disease only. As per the literature [2,5,6], following are the accepted strong association rules as generated for Chromaffin Tumor disease: • PI → K [2,6] • I → K [5] • V → L [5] In this work, the same protein sequence (involved with Chromaffin Tumor disease) was tested and the result is shown in Table 13.
From this table is evident that PN → L and PI → K rules as generated by the literature [2,5,6] are useful strong association rules and I → K and V → L are redundant and should be thus rejected. On the other hand F → D, DN → L and KLY → P are useful strong association rules which were discarded by the literature.

Implication of the findings
Patterns in protein sequences possess multifarious importance. Pattern identification can be used for predicting protein functions, protein fold (structure) recognitions, protein family detection, multiple sequence alignment, etc. Moreover, protein patterns can be used to predict the functions of newly discovered or unknown proteins or to screen genomic databases for other proteins with similar functionality [23]. This work is focused to predict the pattern and association rules of the most dominating amino acids in the protein sequences associated with particular protein misfolded diseases.
Thus identification/reporting of such variant of amino acids for those particular five genetic diseases may have versatile implications. Some implication of such findings are related to medical science, some are concerned to Genetics, Bioinformatics and Biotechnology or some are of Protein Sequencing Research as highlighted below: • It can be applied for gene study through DNA sequencing, thus particular mutation can be edited through research. • With the information of such data mining, prenatal diseases can be identified, • An improved capacity in identifying the relations among the most dominating amino acids in protein sequences related to disease will have an immediate impact on the diagnosis, treatment, and prevention of genetic disorders. As more population-based data are accumulated, amino acids based diagnosis will become more common and the potential for somatic cell gene therapy will increase. Furthermore, the availability of molecular probes for specific gene loci will permit detection of the carriers of disease-associated genes. (G. N. N. Sultana, personal communication, Jun 23, 2019) • Overall, in addition to the treatment action, such data gives the physicians to take the necessary genetic counselling. Thereby this work may open up new opportunities in medical science to handle genetic disorder diseases. • Disease susceptibility can be predicted through most dominating amino acid changes. • Understanding the complex interplay between genes and proteins requires integration of data from a wide variety of sources, i.e. gene expression, genetic linkage, protein interaction, and protein structure among others. Thus, this database can become critical for the integration, representation and visualization of heterogeneous biomedical data. (G. N. N. Sultana, personal communication, Jun 23, 2019) • Biotechnologically, such data might allow development of new drugs for treatment and tools/biomarker for disease diagnosis. • Identifying the relations among the most dominating amino acids in protein sequences can be implemented by focusing on how a protein leads to the heritable form of the respective disease. So research on understanding the normal function of genetically associated proteins in such diseases can be marginalized the complex roles of these proteins play in their respective disorders. • In our work, we partitioned the whole amino acids sequence into sub sequences of length ten to find association rules. This type of consideration has the shortcoming of losing the support count of association rules in the border of window. However, for making the computational tasks easier we have considered the partitioning of length ten. This type of partitioning problem can be solved using windows overlapping. Another approach can be the used of random partition windows. In this case, for each rule, the bias in the border of window will be averaged via the average support count of many times of partition, so that the bias can be ignored approximately. Due to computational costs in this paper, we do not consider these two solutions. In fact, there is a trade-off between the fixed length partitioning and other two ways of partitioning. In future, we plan to test the performance considering two other above mentioned scenarios.

Conclusion
Protein, being an integral part of every living organism, if not folded properly may cause critical genetic diseases. As amino acids are the building blocks of protein, relationship among the dominating amino acids and identification of their patterns is an important issue. This work focused to recognize frequent patterns among five complex protein misfolded genetic disorder human diseases and the relationship of the dominating amino acids using association rule mining. In doing so, itemsets and association rules were generated from the protein sequences. These rules were further evaluated and sorted out with objective measuring tools so that the only strong and interesting patterns are obtained. However, the proposed algorithm may be used to identify pattern of amino acids from associated proteins of other diseases also. Patterns in protein sequences usually have functional, structural or family classification importance. Pattern identification can be used for predicting protein functions, protein fold (structure) recognitions, protein family detection, multiple sequence alignment, etc. The patterns acquired from this work are quite impressive. In addition to the above usual applications, an improved capacity in identifying the relations among the most dominating amino acids in protein sequences related to disease will have an immediate impact on the diagnosis, treatment, and prevention of protein misfolded diseases. And thereby this work may open up new opportunities in medical science to handle genetic disorder diseases.

Future work
In this work, only five protein misfolded diseases were experimented. Again, protein sequence length of some of the diseases was relatively small. However, in future, more complex protein misfolded diseases and associated with larger length of protein sequences may be considered for experimentation. On the other hand, in this work Apriori algorithm was used as a pattern mining technique for association rule mining. However, as a newer method, Fuzzy Association rule mining technique may be adopted to generate more reliable association rules and test accordingly. In this work, the protein sequences were partitioned into subsequences of length 10. If the length of the subsequences is changed, the generated rules may also be changed. As such, rules can be generated considering the length as 10, 15, 20,.... and thereafter only the common rules between each list can be sorted out. Generating rules in this way may have better potentiality and validity.