Method overview
The method described here uses three data sets from the Zebrafish Information Network as input to a linear regression model to predict the number of gene expression experiments per gene. Figure 7 provides a flow chart of the steps taken from data input through model output.
Data files
Three data files were combined to build the predictive model. All three are provided as supplementary files to this manuscript. The MachineLearningReport.txt file (Additional file 2) is a custom report consisting of one row per gene in the ZFIN database, generated on Nov. 29, 2016. Data columns included the ZDB-GENE ID, gene symbol, gene name, count of gene expression experiments, count of journal publications attributed for gene expression annotations, count of Gene Ontology annotations, and count of journal publications attributed for Gene Ontology annotations. The columns related to the Gene Ontology had no value for predicting the number of gene expression experiments, so they were excluded from further analysis.
The GenePublication.txt (Additional file 3) and ConstructComponents.txt (Additional file 4) files are generated daily at ZFIN and made available via the ZFIN downloads page (https://zfin.org/downloads). The GenePublication.txt file was obtained on Nov. 30, 2016. The columns were gene symbol, ZDB-GENE ID, ZDB-PUB ID, publication type, and PubMed ID when available. The ConstructComponents.txt file was obtained on Dec. 19, 2016 and included columns for the ZFIN construct ID, construct name, construct type, related gene ZDB-GENE ID, related gene symbol, related gene type, a relationship between the gene and the construct, and two ontology term IDs from the sequence ontology [16] to specify the type of construct and the type of related marker. For this study, the only data used was a count of constructs related to each gene, which was computed from the ConstructComponents.txt file.
Data preparation and modeling
Manipulations of input data files, feature selection and engineering, model building, training, evaluation, model selection, and final model scoring were all done using modules provided in Microsoft Azure Machine Learning Studio (https://studio.azureml.net) using a free workspace level account. Features per gene used to train and test the linear regression model included the gene symbol, the number of journal publications attributed for gene expression, the number of gene expression experiments (the label), total number of journal publications, the percentage of journal publications with curated expression data, and the number of transgenic constructs associated with each gene.
The set of all gene records in the ZFIN database (36,655 genes as of Nov. 29, 2016) was filtered to exclude genes that were unlikely to be useful in this analysis including withdrawn genes, microRNA genes, genes with a colon in the name (typically not yet studied), genes with symbols starting with “unm_” (typically not yet studied), and genes with no associated journal publications as determined by data from the GenePublication.txt file. Genes with more than 200 existing expression experiments were also excluded because they are already heavily annotated for gene expression, many were found to be anatomical marker genes of less interest for the purposes of this work (eg. egr2b), and their heavy annotation may give them undesirable leverage that could negatively affect model performance for genes of interest which may have few annotations. Those excluded genes having more than 200 expression experiments have red symbols in Fig. 3. The resulting gene set used as input for model training and testing included 9870 genes. Any null numeric values generated in the data during file joining were set to 0 using the AML Clean Missing Data module, and no duplicate rows were present. A stratified split keyed on the expression experiment count was used in the Split Data module to select 25% of the genes (2483 genes) for training the model and 75% (7387 genes) for scoring the model. The Linear Regression, Train Model, and Score Model modules were used to train and score the model. The Linear Regression module used the following parameters: Solution method: ordinary least squares; L2 regularization weight: 10; Include intercept term: unchecked; Allow unknown categorical levels: checked; Random number seed: 112. Model performance was assessed using the Azure Machine Learning Evaluate Model module. The trained model was used to predict the number of expression experiments for the 7387 genes that were not used in model training. The resulting prediction was appended as a new column to the input data set.
Analysis and data visualizations
Model results, including the input data plus the predicted number of expression experiments, for the 7387 genes were exported from Azure Machine Learning Studio as a tab delimited file and imported into Microsoft Excel for Mac v16.27 for data validation and analyses (Additional file 5). Residuals were calculated as the actual expression experiment count minus the number of expression experiments predicted by the model. The 95% confidence interval of the model, computed as 2 times the root mean squared error (RMSE), was used to establish significance of the residuals. Genes with residuals outside or inside the 95% confidence interval were then considered as being predicted to be missing or not missing expression annotation respectively. One hundred genes inside and outside the negative 95%CI were randomly selected for manual testing by sorting the genes in the Excel spread sheet based on a randomly generated number column and copying the first genes from each set into a new Excel sheet. To remain blinded during the evaluation step, those genes were randomized again as a set by sorting based on a randomly generated number column. That gene selection process was repeated for 50 genes inside and outside the positive 95% CI. A manual evaluation was then done for each journal publication not already curated for expression data that was associated with each of the selected genes. The publications for each gene were sorted oldest to newest based on publication date and were then evaluated in order, starting with the oldest publications. Publication assessment for each gene continued until either all the publications were examined for a gene or a publication with missing expression data for that gene was identified, whichever came first. The result was recorded along with the assessment date and the ZDB-PUB ID for the publication that was missing the expression data, if one was found. The results of this data validation was used to produce a confusion matrix describing model precision and recall around the upper or lower 95% CI.
Publication records in ZFIN each have a unique ZDB-PUB ID, for example ZDB-PUB-161203-17. The first six digits indicate the date the record was created in YYMMDD format. Those data were parsed out of the list of IDs for publications that were recorded as containing uncurated gene expression data. The year component was then used to group those data to get a count of the number of genes per year that were found to have uncurated gene expression data. Even though it was the year of publication entry into ZFIN that was being counted, only the first paper encountered with uncurated expression data was recorded per gene, so the count is equal to the number of genes in the sample having uncurated expression data from each year.
Data visualizations were created using both Excel and Tableau Desktop Professional Edition v10.1.4.