### Datasets

Four datasets were used in this work, including a training dataset, a validation dataset, and two test datasets. To construct the datasets, we first downloaded the *Human training dataset* and the *Human test dataset* from work of [11] as our training dataset and validation dataset separately. The noncoding RNAs in these two datasets were from Ensembl release 90 [25] (ftp://ftp.ensembl.org/pub/release-90/fasta/homo_sapiens/ncrna/) and the mRNAs were from NCBI RefSeq [26]. Next we downloaded the *New human test dataset* from work of [16] as our first independent test dataset, denoted as test dataset I. The noncoding RNAs in test dataset I were from Ensembl release 97 [25] but not overlapped with release 90, and the mRNAs were from NCBI RefSeq [26] released after 18th May 2018. Besides, we constructed the second independent test dataset, namely test dataset II. We collected the noncoding RNAs from Ensembl release 101 [25] and removed the samples which were overlapped with release 97. And we collected the mRNAs from NCBI RefSeq [26] released between 20th Jun. 2019 and 1st Oct. 2020. The search limitations for all mRNAs in above datasets remained the same as: *“Homo sapiens”[Organism] AND srcdb_refseq_known[prop] AND biomol_mrna[prop]*. We then filtered the above noncoding RNA datasets by only keeping the lncRNAs according to the transcript biotype annotation provided by the Ensembl database. Moreover, the sequences which contain letter other than *’ATGC’* were also removed. The final data amount for each dataset is listed in Table 4 and the dataset construction process is briefly demonstrated in Fig. 4a.

### Features

The feature representation methods we used in this work including: maximum ORF (open reading frame) length and coverage [9], nucleotide bias [16], k-mer, and k-merscore [27].

The longest ORF is often considered to be the coding region [28], hence the features related to the longest ORF are popular to be used in distinguishing mRNAs and lncRNAs. Maximum ORF length feature represents the length of the longest ORF in each RNA sequence, which is denoted as \(l_{maxORF}\). Maximum ORF coverage feature is obtained by dividing \(l_{maxORF}\) by the length of whole sequence \(l_{RNA}\). Because some lncRNAs do not have complete ORF, we represent the maximum ORF coverage for such samples as \([0,\frac{l_{maxORF}}{l_{RNA}}]\) while represent those with complete ORF as \([\frac{l_{maxORF}}{l_{RNA}},0]\). Nucleotide bias feature measures the information around start codon, as studies show that nucleotide around start codon can affect the regulation of translation initiation [29, 30]. Nucleotide bias feature is calculated as

$$\begin{aligned} \begin{aligned} nucleotide \ bias = \frac{1}{6}\sum _{i\in \{-3,-2,-1,4,5,6\}}log\frac{p_{mRNA}(x_{i})}{p_{lncRNA}(x_{i})},\\ x\in \{A,C,G,T\} \end{aligned} \end{aligned}$$

(1)

where \(p(x_{i})\) denotes the frequency of nucleotide *x* at position *i* in training data, and the set of \(\{-3,-2,-1,4,5,6\}\) refers to the positions of the first codon before and after the start codon. k-mer feature counts the frequency of *k* neighboring nucleotide into a vector of length \(4^{k}\). Here we count the 1-mer feature of the whole sequence and the 3-mer feature of the longest ORF sequence. k-merscore feature represents a relative k-mer bias, its calculation is similar to nucleotide bias:

$$\begin{aligned} kmer \,\, score=\frac{1}{4^{k}}\sum _{i=1}^{4^{k}}log\frac{M_{mRNA}(K_{i})}{M_{lncRNA}(K_{i})} \end{aligned}$$

(2)

where \(M_{mRNA}(K_{i})\) and \(M_{lncRNA}(K_{i})\) represents the mean value of k-mer composition for mRNA and lncRNA training data. Here we calculate the k-merscore feature with \(k=1-6\) for both the whole sequence and the longest ORF sequence. The feature extraction process is illustrated in Fig. 4b. After applying the above feature representation methods, each RNA sequence would be converted to a fix-length vector with length of 84 (Maximum ORF length (1) and coverage (2), nucleotide bias (1), 1-mer for whole sequence (\(4^{1}\)), 3-mer for the longest ORF sequence (\(4^{3}\)), and k-merscore feature with \(k=1-6\) for both whole and the longest ORF sequences (6+6)).

### Model structure

We propose Class Similarity Network to classify mRNAs and lncRNAs. The Class Similarity Network is developed on the basis of CNN but considers the relationships among input samples in a direct way. It employs the similar concept of SNN [17] to measure the similarity among input samples. But different from SNN, Class Similarity Network introduces loss function to measure the similarities in each channel, which specifically targets to a two-class classification problem: one for positive class and the other one for negative class. In this way, there are two inputs as references but one input to be contrast, hence it is not suitable to train the same parameters and weights for all subnetworks like what SNN does. To solve this, we design the network to learn the proper parameters itself to encode the input samples from different channels. Besides, different dense branches are integrated to measure the similarity to each class simultaneously. As shown in Fig. 4c, the Class Similarity Network comprises three modules: Class Similarity Measurement module, Fully Connected module, and Decision module. The details are described as follows.

#### Class similarity measurement module

We first measure the similarities between each input sample and a random sample from each class via the high-level features learned by the network, hence we name this module as the Class Similarity Measurement Module. Let \(\mathbf {X}_{i}\in \mathbb {R}^{d}, i=1,2,...,k\) denotes the input dataset whose sample size is *k* and feature size is *d*. In the training step of a two classes classification problem, we have a positive dataset \(\mathbf{X} _{pos}\) and a negative dataset \(\mathbf{X} _{neg}\), where \(\mathbf{X} _{pos}\cup \mathbf{X} _{neg}=\mathbf{X}\) and \(\mathbf{X} _{pos}\cap \mathbf{X} _{neg}=\emptyset\). We randomly resample from \(\mathbf{X} _{pos}\) and \(\mathbf{X} _{neg}\) with size of *k* separately and obtain two new sets \(\mathbf{X} _{pos}^{'}\) and \(\mathbf{X} _{neg}^{'}\). As shown in Fig. 4c, the Class Similarity Measurement Block has three inputs: \(\mathbf{X} _{pos}^{'}\), \(\mathbf{X}\), and \(\mathbf{X} _{neg}^{'}\), the samples in each of them will go though *l* 1D convolutional layers and output as \(Conv_{l}(\mathbf{x} _{pos}^{'})\), \(Conv_{l}(\mathbf{x} )\), and \(Conv_{l}(\mathbf{x} _{neg}^{'})\), respectively. The convolutional layers here convert the raw features to high-level features. With the high-level features, the similarities between input samples and positive samples, and the similarities between input samples and negative samples can be represented as:

$$\begin{aligned} f_{ps}(\mathbf{x} )= & {} Conv_{l}(\mathbf{x} )-Conv_{l}(\mathbf{x} _{pos}^{'}) \end{aligned}$$

(3)

$$\begin{aligned} f_{ns}(\mathbf{x} )= & {} Conv_{l}(\mathbf{x} )-Conv_{l}(\mathbf{x} _{neg}^{'}) \end{aligned}$$

(4)

The selections of \(\mathbf{X} _{pos}^{'}\) and \(\mathbf{X} _{neg}^{'}\) in the test process are different from that of the training. Rather than using the randomly resampled vectors, we use the mean value of \(\mathbf{X} _{pos}\) and \(\mathbf{X} _{neg}\) from training. Such practice can avoid the potential slightly different prediction results caused by using different resampled datasets.

#### Fully connected module

Taken the \(\mathbf{x} _{ps}=f_{ps}(\mathbf{x })\) and \(\mathbf{x} _{ns}=f_{ns}(\mathbf{x })\) obtained from Class Similarity Measurement module as the inputs, the Fully Connected module converts each of them to a single value which represents the similarity with positive or negative samples. We represent \(\mathbf{x} _{ps}\) with dropout ratio \(r_{1}\) as \(\mathbf {x}_{ps}^{r_{1}}\) and with dropout ratio \(r_{2}\) as \(\mathbf {x}_{ps}^{r_{2}}\). From Fig. 4c, we have three dense layers in the Fully Connected module, where \(z_{2}\) is obtained from \(\mathbf {x}_{ps}^{r_{1}}\) going through two dense layers via \(\mathbf {z_{1}}\), and \(z_{3}\) is obtained from \(\mathbf {x}_{ps}^{r_{2}}\) with one dense layer. The \(\mathbf {z_{1}}\), \(z_{2}\) and \(z_{3}\) are obtained as follows:

$$\begin{aligned} \mathbf {z_{1}}= & {} \sigma _{1}\left( \mathbf {W}_{1j_{1}}^{T}\mathbf {x}_{ps}^{r_{1}}+\mathbf {b}_{1j_{1}}\right) \end{aligned}$$

(5)

$$\begin{aligned} z_{2}= & {} \sigma _{2}\left( \mathbf {W}_{2j_{2}}^{T}\mathbf {z_{1}}+\mathbf {b}_{2j_{2}}\right) \end{aligned}$$

(6)

$$\begin{aligned} z_{3}= & {} \sigma _{3}\left( \mathbf {W}_{3j_{3}}^{T}\mathbf {x}_{ps}^{r_{2}}+\mathbf {b}_{3j_{3}}\right) \end{aligned}$$

(7)

where \(\mathbf {W}_{j}\) denotes the weight matrix, \(\mathbf {b}_{j}\) denotes the bias vector, and \(\sigma (\cdot )\) denotes the activation function. Here we take \(\sigma _{1}\) as the ReLU activation function, and take \(\sigma _{2}\) and \(\sigma _{3}\) as the Sigmoid activation function, and we set the dropout ratio \(r_{1}\) and \(r_{2}\) as 0.2 and 0.5 separately. Therefore, the positive similarity node is represented as:

$$\begin{aligned} y_{ps}=z_{2}+z_{3} \end{aligned}$$

(8)

Similarly, the negative similarity node \(y_{ns}\) can be obtained by taking \(\mathbf{x} _{ns}\) as the input.

#### Decision module

For each input sample, the predicted target \(\hat{\mathbf{y }}\) is obtained by concatenating the positive similarity node and the negative similarity node as \(\hat{\mathbf{y }}=[y_{ps},y_{ns}]\). Given the target \(\mathbf{y}\), our goal is to minimize \(\left\| \mathbf {y}-\mathbf {\hat{y}} \right\|\). If the positive target is represented as \(\mathbf{y} =[1,0]\) and the negative target is represented as \(\mathbf{y} =[0,1]\), the values of \(y_{ps}\) and \(y_{ns}\) represent the similarity to positive and negative samples separately; if the positive target is represented as \(\mathbf{y} =[0,1]\) and the negative target is represented as \(\mathbf{y} =[1,0]\), the values of \(y_{ps}\) and \(y_{ns}\) represent the difference to positive and negative samples separately. Here we use Mean squared error as the loss function, hence our goal is to learn parameters \(\theta\) such that:

$$\begin{aligned} arg \ minimise_{\theta } \ \frac{\sum _{i}^{N}\left( y_{i}-\hat{y}_{\theta | i}\right) ^{2}}{N} \end{aligned}$$

(9)

#### Why class similarity network?

Let *C* denotes the cost function, *z* denotes the output of the dense layer, and *a* denotes the input of the dense layer. In neural networks, the forward pass will calculate the output of the *l*th dense layer as:

$$\begin{aligned} z^{l}=\sigma \left( \mathbf {\omega } ^{l}\mathbf {a}^{l-1}+\mathbf {b}^{l}\right) \end{aligned}$$

(10)

The weight \({\varvec{\omega }}\) and bias *b* will be updated at the backpropagation step as:

$$\begin{aligned} \omega _{jk}^{l}= & {} \omega _{jk}^{l}-\eta \frac{\partial C}{\partial \omega _{jk}^{l}} \end{aligned}$$

(11)

$$\begin{aligned} b _{jk}^{l}= & {} b _{jk}^{l}-\eta \frac{\partial C}{\partial b_{jk}^{l}} \end{aligned}$$

(12)

with learning rate \(\eta\). Therefore, the input samples are only related to each other through the cost function and their relationships are only considered in an indirect way. However, in our newly proposed Class Similarity network, the input of the dense layer actually is the differences of high-level features between two samples, i.e. \(\Delta \mathbf {a}\), and \(C \sim \mathbf {\omega } \Delta \mathbf {a}+\mathbf {b}\). In this way, the differences between two input samples are amplified, and the relationships between different input samples are taken into consideration in a direct way when training the network.

Besides, Class Similarity network introduces filters specified to each class and subtracts the high-level features rather than training the differences of raw features directly. Such practice could help the network to find the proper input values for parameter adjustment in Fully Connected module.

### Evaluation metrics

We use the criteria of Accuracy (Acc), Sensitivity (Sn), Specificity (Sp), Precision (Pre), and F1-score to evaluate different prediction methods, which are calculated as follows:

$$\begin{aligned} Sp= & {} \frac{TN}{TN+FP} \end{aligned}$$

(13)

$$\begin{aligned} Sn= & {} \frac{TP}{TP+FN} \end{aligned}$$

(14)

$$\begin{aligned} Acc= & {} \frac{TP+TN}{TP+FN+TN+FP} \end{aligned}$$

(15)

$$\begin{aligned} Pre= & {} \frac{TP}{TP+FP} \end{aligned}$$

(16)

$$\begin{aligned} F1\text {-}score= & {} \frac{2\times Pre\times Sn}{Pre+Sn} \end{aligned}$$

(17)

where TP, FN, TN and FP denote the numbers of true positive, false negative, true negative, and false positive, respectively. Besides, the McNemar test is further adopted to compare the models, which is implemented by using the *mcnemar* function in python package *statsmodels.stats.contingency_tables* with continuity corrected \(\chi ^{2}\) distribution.