Effect of the number of removed lymph nodes on prostate cancer recurrence and survival: evidence from an observational study

Background The aim of this article is to analyze the effect on biochemical recurrence and on overall survival of removing an extensive number of pelvic lymph nodes during prostate cancer surgery. The lack of evidence from randomized clinical trials to address this specific question has hampered the ability to determine the true effect of the number of nodes removed. Results Our analysis is based on a large observational study, and this can lead unadjusted estimates to be very sensitive to confounding bias due to the different prognosis of individuals. We assess the effect of the number of lymph nodes removed by means of an Inverse Probability Weighting adjustment based on a Poisson regression model, and by a Doubly-robust adjustment. Conclusions Our findings suggest that a large number of nodes removed is associated with a significant improvement in time to biochemical recurrence. However, it appears to have no impact on overall survival.


Background
In prostate cancer (PCa) studies it is still debatable whether more extensive pelvic lymph node dissections are associated with better oncological outcomes. Nowadays, no prospective randomized study aimed at assessing the benefit on cancer control of anatomically defined extended pelvic lymph node dissection (PLND) as compared to limited or no PLND is indeed currently available.
Recent literature [1] reports that extended PLND is associated with higher rates of node positive findings, as the probability of detecting lymph node invaded by prostate cancer invasion is directly related to the extent of PLND. Another retrospective trial found a statistically significant association between the extent of PLND and cancer specific survival. The hazard of cancer-related death was found to be significantly lower for higher numbers *Correspondence: chiara.gigliarano@uninsubria.it 1 University of Insubria, Varese, Italy Full list of author information is available at the end of the article of nodes removed [2]. The main limitation of this multicentre study is the lack of a homogeneous and standardized pathologic assessment of the removed lymph nodes. Given the long natural history of treated PCa, it is likely that if a potential beneficial effect of extended PLND existed, this is more likely to be detected in men with more aggressive cancer characteristics. Such hypothesis has been tested in more recent retrospective studies focusing on men with adverse pathological features. For example, Moschini et al. analyzed a large cohort of 1586 patients with locally advanced PCa treated with RP plus ePLND and found that a higher number of removed LNs was an independent predictor of increased cancer-specific survival, suggesting a potential therapeutic role of a more extensive PLNDs in this subgroup of patients [3].
Taken together, all of these data show that the impact of PLND as a curative treatment remains an open question. After the withdrawal of the only available prospective study reported on this topic [4], the scientific community is eagerly awaiting the results from the two ongoing randomized clinical trials comparing extended and limited PLND during RP in intermediate and highrisk PCa to provide definitive evidence to this highly controversial topic.
Overall, a general agreement exists that whenever PLND is indicated, this should be anatomically extended at least for diagnostic purposes [5,6]. Nevertheless, whether this approach improves patient outcomes is still unknown. In case of detection of positive nodes at PLND, patients may be offered different post-operative adjuvant treatments, such as: (i) radiotherapy, (ii) hormone therapy, (iii) a combination of both. Correct patient staging may help in tailoring the optimal post-operative patient management and thus indirectly improve patient outcomes. Therefore, the correct identification of men potentially harboring prostate cancer nodal metastases is crucial to optimize patient outcome. Moreover, assessing whether such extensive dissections are directly associated with improved cancer control rates using methodologically sound approaches, which can account for the effect of patient selection biases, is still an unmet need.

Methods
The aim of this article is to examine the effect on BCR and on all-causes t-month survival of removing different numbers of nodes during prostate cancer surgery. Our analysis is based on non-randomized observational study including 3046 PCa patients. In randomized studies the distributions of the patients characteristics are balanced across groups, so that groups are similar expect for the treatment. This constitutes a critical issue, since in nonrandomized (or observational) studies treatment exposure may be associated with covariates that are also associated with the survival function, and thus "treated" patients may differ from "untreated" ones. We specify that in the paper we refer to "treatment" meaning the removal of a given number of lymph nodes. In this framework the usual analysis based on unweighted Kaplan-Meier survival curves may be misleading due to confounding [7].
Interpreting differences in groups or effects of covariates in observational studies is a major debatable point, since it is tempting to attribute the differences to the lack of random treatment assignment. Indeed, the lack of randomization can lead observational effect estimates to be very sensitive to confounding bias due to the different prognosis of individuals between treatment groups. As a consequence, observational randomized discrepancies cannot be automatically attributed to randomization itself.
Thus, observational studies need adjustments for baseline confounders, for time varying confounding an for selection bias. Guidelines that rely on observational data due to the absence of randomized trials benefit when the analysis mimics the analysis of a hypothetical target trial. To this extent inverse probability weight (IPW) can be considered as an adjustment for pre-or postrandomization variables in observational studies. In particular, in this article we use IPW to emulate the random allocation of patients to treatment groups (defined by the number of nodes removed), within a prostate cancer observational cohort study. We consider only pretreatment variables for the adjustment.
The IPW approach, although being one of the most popular methods used for adjusting for confounding, is subject to some criticism. In particular, it tends to be sensitive to misspecification of the treatment assignment model. To overcome this problem we also exploit the Doubly robust method proposed by [8,9], which tends to be more robust against misspecification either of the treatment assignment model or of the survival model. In the Appendix we report on a simulation study aimed at illustrating and exploring such properties for large and moderate sample sizes.
The rest of this paper is structured as follows. We first describe the data, and show by means of some preliminary analysis how the number of nodes removed clearly affects the probability of missing a positive-node patient modelled as beta binomial distribution. We then analyze whether the removal of more lymph nodes during prostatectomy may improve patient time from surgery to BCR and OS through better treatment assignment due to a more precise nodal staging assessment. For that reason, our analysis is split into two parts: the first part of our "Results" section focuses on time to BCR, while the second part reports the results of the analyses on overall survival (OS). Concluding remarks are reported in the "Discussion and conclusion" section.

Data and patient classification
Our analysis is based on a prostate cancer dataset collected at San Raffaele Hospital (HSR) in Milan, Italy, between years 1991 and 2012. We consider a sample of 3046 patients with localized prostate cancer treated with radical prostatectomy and pelvic lymph node dissection.
The median follow-up is 38 months. During the entire follow-up period, 359 patients (12%) experienced BCR, while 84 patients (3%) died. Patients are aggregated in three groups according to the number of nodes removed during surgery: (i) 1-10 nodes, (ii) 11-20 nodes and (iii) more than 20 nodes removed. Preliminary descriptive statistics are shown in Table 1.
We note that almost half (45%) of the patients in the study belong to the second nodes group, with number of removed nodes between 11 and 20, and that the overall average number of nodes removed is 16.6. also reveals that patients with larger numbers of nodes removed are more likely to have higher T stage, to be slightly younger, to have higher PSA values as well as higher Gleason scores than patients with fewer nodes removed. Therefore the distribution of the covariates does not seem to be independent from the treatment (the number of nodes removed). Moreover, the average number of positive nodes for all patients is 0.4, a value that is doubled for the groups of patients with more than 20 nodes removed. Hence, there seems to exist a positive association between the number of nodes removed and the number of positive nodes detected: the higher the number of nodes removed the higher is the probability of detecting a positive node. For each i-th node-positive patient with number of removed nodes Nex i , let X i be the number of observed positive nodes among the given number Nex i of removed nodes, and p i be the unknown probability that an examined node is positive.
The idea is to estimate the probability of missing a positive node (thus classifying incorrectly a patient -false negative) for a node-positive patient, based on the total number of removed nodes (Nex) and the number of nodes found positive (x).
The model assumes that X follows, conditionally on p, a binomial distribution with Nex (assumed independent) trials and success probability p. To allow for individual heterogeneity, it also assumes that p comes from a beta distribution with parameters α and β. The marginal distribution of X will be, therefore, the Beta-binomial distribution with parameters α, β, Nex, with probability mass function with x = 0, 1, ..., Nex and where B(·, ·) is the Beta function. For each patient with Nex examined nodes it is then possible to easily estimate the probability of observing no positive nodes as This model relies on several assumptions which, although strong, are needed to ensure mathematical tractability. Among others, we are here assuming that our data do not contain incorrectly staged positive nodes, and that, within the same patient, all nodes have the same probability of being positive. For an extended discussion of the method and the reliability of those assumptions see Gönen et al. (2009).
We fitted a Beta-Binomial model to our data in R by maximization of the likelihood function, using the VGAM package v.1.0-4 [10]. The estimated parameters α and β led to the results showed in Fig. 1, together with a confidence band based on the endpoints of the marginal confidence intervals of the two parameters. The probability of not detecting the positive nodal status of a patient is plotted as a function of the number of nodes examined (Nex). The graph clearly shows that the probability of misclassifying node-positive patients decreases as the number of nodes removed increases. In particular, the probability of a false negative decreases quickly as Nex increases from 1 to 10, reaching a value around 0.3 for Nex=10. Starting from Nex=11, the curve becomes flatter. It reaches a value of 0.21 for Nex=15, and it remains between 0.2 and 0.1 with more than 15 removed nodes, with no further notable drop beyond that level. Considering also a clinical point of view, in terms of both feasibility of the procedure and quality of life of the patient that undergoes surgery, a removal of about 15 lymph nodes represents a good trade off to obtain a fair estimate about the real nodal status of a patient, while removing less than 10 lymph nodes may cause an excessively high false negative rate.

Time to biochemical recurrence analysis Unweighted survival analysis
We first perform an unweighted survival analysis, based on the Kaplan Meier estimation of the survival functions for each of the three node groups. Figure 2 plots the estimated survival curves for each group, while Table 2  tests for the difference in the survival functions between each pair of nodes groups. Both the graph and the tests indicate no differences among the three survival curves, thus suggesting that the number of nodes removed (Nex) does not seem to affect time to BCR. However, since the study is observational as we have mentioned, the unweighted Kaplan-Meier survival curves may provide misleading results due to confounding: the apparent absence of differences across the groups might be in part explained by differences in the composition of the patient groups. If the experiment were randomized, then the number of nodes removed (that is, the treatment) would have been distributed in the same way across the patients, regardless of their covariates (PSA, Gleason score, age). In the previous section we have seen that this is actually not the case, since the surgeon's decision was likely affected by these covariates. As more nodes are removed from patients with higher T stage, PSA and Gleason score, who also might be at higher risk of recurrence, the time to event distributions need to be adjusted for the effect of such confounding.

Inverse probability weighted analysis
A first possible method to adjust the analysis is based on the Inverse Probability Weights (IPW) adjustment, according to which each patient receives a weight that is inversely proportional to the estimated probability of having his number of nodes removed equal to the observed number; see, among others, [7,11]. The IPW is a particular type of the propensity score method, which is aimed at adjusting for confounding by weighting observations by the inverse of the estimated propensity scores, that are probabilities of exposure to the treatment received conditionally on the covariates; see, among others, [12][13][14][15]. We have estimated the weights by fitting a Poisson regression model to estimate each patient's probability of receiving any number of nodes removed. As covariates (Z) we have included age, PSA score, T-stage and Gleason score. We have also truncated the weights to a maximum of 35 (similarly to [16]). More in detail, the IPW adjustment consists of the following steps: • Step 1. Fit a Poisson regression model and compute the predicted probabilities of having the number of removed nodes be equal to the observed number given the covariates, P(Nex = x|Z i ), i = 1, . . . , n, so that n is the total number of patients. Then, sum over the three nodes groups to obtain the predicted probability of each node group g = 1, 2, or 3 and each covariate value: Step 2. Generate the Inverse Probability Weights (IPWs) for each patient as the inverse function of the corresponding predicted probability, which depends both on the number of nodes removed (g) and on the covariates: where g indicates the treatment group (1, 2, or 3). • Step 3. Fit three weighted Kaplan Meier estimators with weights w g (Z i ), separately for each nodes group: Step 4. Estimate the treatment effects in terms of difference in survival probabilities at time t among the three groups: • Step 5. Estimate confidence intervals for the treatment differences using a bootstrap technique, for each time t.
The estimates from the Poisson regression are shown in Table 3, from which we notice that all covariates have a significant impact on the number of nodes removed.
Using the Inverse Probability Weights, whose distribution is shown in Fig. 3, we obtain the weighted Kaplan-Meier survival estimates shown in Fig. 4.
The graph reveals how the time to BCR curves would have looked like, had each of the node groups had the same covariates distribution. In particular, time to BCR seems to improve as the number of removed nodes increases.
Comparing Figs. 4 to 2 in the previous section we note that the estimated survival function of the group of patients with more than 20 nodes removed is higher in the IPW-adjusted Kaplan Meier than in the unweighted analysis. Moreover, no difference emerges between the survival distribution of patients with fewer than 10 nodes and the distribution for patients with 11-20 nodes removed.
Hence, once adjusting for confounding, we can conclude that there is a significant positive effect on time to BCR of removing a higher number of nodes.

Doubly robust analysis
An alternative way for adjusting for the non-randomized nature of treatment is provided by the Doubly robust method; see, among others, [8,9,17]. The IPW method tends to be possibly sensitive to misspecification of the treatment assignment model, while the Doubly robust method tends to be more robust against misspecification either of the treatment assignment model P(Nex = x|Z) or of the survival estimate model S(t|Z, Nex, x).
Indeed, while one will never know whether the weights are correctly specified, if that is the case then the survival probabilities are estimated consistently by IPW, as long as the model for the outcome is also correctly specified. The parameter estimators of the model for outcome, on the other hand, are not guaranteed to be consistent if that corresponding model is not correctly specified. So, IPW methods require that both models be correctly specified. Doubly robust methods, on the other hand, are theoretically guaranteed to consistently estimate the true survival  probabilities even if one of the two models is wrongly specified (hence the name "robust"). In our setting, the survival estimates based on the Doubly robust method are consistent as long as either the Cox model or the weight (or both) models are correctly specified. As stated in [9], "In a causal inference model, an estimator is Doubly Robust (DR) if it remains consistent when either (but not necessarily both) a model for treatment assignment mechanism or a model for the distribution of the counterfactual data is correctly specified. " "With observational data one can never be sure that a model for the treatment assignment mechanism or a model for the counterfactual data is correct. " "DR estimators, in contrast with inverse probability weightedestimators, give the analyst two chances, instead of one, to make valid inference. " In that article, the authors also present the results of a simulation study that demonstrates the impressive finite sample performance of DR estimators in terms of finite sample efficiency and robustness. In [18], additional simulation based evidence is provided to quantify the theoretical properties of the Doubly Robust approach. The Doubly robust method can be summarized in the following steps: • Steps 1. and 2. Proceed analogously as in Step 1 and Step 2 of the IPW method. • Step 3. Fit three weighted Cox models, with weights w g (Z i ). In particular, we fit one model for each treatment group g = 1, 2, 3, where group 1 refers to 1-10 nodes, group 2 refers to 11-20 nodes, and group 3 to 21+ nodes. For detailed notation see also Section 2.2 of Bang and Robins [9]. • Step 4. Compute three predicted survival functions at month t for every patient one for each treatment group:S i (t|Group = g, Z). In this way each patient is assigned three estimated survivals, one factual and the other two counterfactual. • Step 5. For each group g, compute the average survival function at month t over all n patients: • Step 6. Estimate the effects in terms of differences in survival at time t among groups: Step 7. Estimate confidence intervals using bootstrap, for each time t.
Results are shown in Fig. 5. The graph confirms the results obtained with the IPW method, showing also in this case that the survival curve of the group of patients with highest number of removed nodes lies always above the survival curves of the other two groups, which are again identified as not being significantly different from each other.
Finally, in Table 4 we compare the three analyses that we have performed for estimating the group-specific time to BCR, focusing on t = 60, 90 and 120 months from surgery. The table shows that there is no difference across the three nodes groups in terms of probability of surviving more than 5 years (60 months) according to all estimation methods. This is also confirmed by the survival functions depicted in Figs. 2, 4 and 5. Differences between the high nodes group (21+ nodes) and the other groups start emerging from time t = 90, that corresponds to the probability of surviving more than 7.5 years from the surgery. Consistently with earlier results, these differences emerge only when considering the two weighted models.

Overall survival analysis
We now move from time to BCR to time to OS, and repeat the same analyses as for time to BCR. Results of the naive unweighted analysis are shown in Fig. 6 and Table 5. Figure 6 plots the estimated survival curves for each group, while Table 5 shows the p-values of the unadjusted log-rank tests for the difference in the survival functions among each pair of nodes groups. Also for time to OS, both the graph and the tests show that there are no differences across the three survival curves, thus revealing that the number of nodes removed (Nex) does not seem to affect OS.
We performed the weighted Kaplan-Meier estimation based on IPW as well as on the Doubly robust estimation; results are shown in Figs. 7 and 8, respectively. Differently from the analysis on time to BCR, in case of OS when we adjust for the presence of non-randomized study we do not see any significant differences; hence an increased number of nodes removed does not seem to significantly influence OS. This is also confirmed by looking at Table 6, which compares the three analysis performed for the OS at t = 60, 90 and 110 months from surgery.
After adjusting, no significant impact of the number of removed nodes on OS emerges for the three time points.

Discussion and conclusion
Extended pelvic lymph node dissection (PLND) has certainly a key staging role in Prostate Cancer (PCa). Assessing the relative benefits and burden of PLND for oncological and non-oncological outcomes in patients undergoing radical prostatectomy for PCa is still The table contains, for each estimation method: (i) the estimated survival probabilities at 60, 90, and 120 months for each of the three treatment groups; (ii) the pairwise differences between the estimated survival probabilities across the three treatment groups. All estimated quantities come with bootstrap 95% confidence intervals debatable. The issue of whether PLND may affect prostate cancer outcomes such as progression and survival has been an argument of extreme interest in the urologic community over the last decades. Indeed, the impact of PLND on cancer outcomes remains controversial [1] due to the lack of prospective, randomized trials. The authors of [19] found a statistically significant negative association between the number of removed lymph nodes and BCRfree survival in patients with no positive nodes found. However, there is currently no available study supporting its role with regards to oncological outcomes in men with clinically localized disease. Thus, our approach represents a novel perspective to evaluate -within an observational setting -the effect on biochemical recurrence and on allcauses t-month survival of removing different numbers of nodes, providing important evidence also from non randomized trials. We have shown how it is possible to assess the effect of the number of lymph nodes removed by means of an Inverse Probability Weighting adjustment (here, based on Poisson regression) and by a Doubly-robust adjustment method.
As a first analysis we had run traditional Cox models both for BCR and OS, with the number of nodes  Our use of a Poisson regression model was motivated by the very nature of the treatment of interest here, clearly a counting variable. Using a multinomial distribution would have forced (at most ordinal) categories onto that variable. That would be a natural choice in presence of multiple (unordered) treatments, when a fully categorical analysis is appropriate (see, e.g., Feng et al., [20]). In our setting, such a classification would also have led  to an increase in the number of parameters to be able to model all odds ratios with respect to the covariates potentially influencing the number of nodes extracted. Discretizing the problem to just a binary treatment outcome would have forced an even stronger classification of the treatment variable. So our choice was to allow for a parsimonious model for the number of nodes through the Poisson regression model, and to then consider (ordinal) groups for the estimation process. The reasons for such a two-step approach are the need to allow for easy interpretability and description of the results, while still keeping a moderate number of patients in each group for the estimation, while maintaining a modeling approach coherent with the nature of the underlying treatment variable. Here, "moderate number" is meant as a number of patients that allowed us to obtain informative 95% confidence intervals for the reported differences in survival probabilities across treatment groups. As a last comment, note that IPW could be very sensitive to the presence of extreme weights [16]. We truncated the weights to a maximum of 35, but we also performed the estimation with a truncation value of 20, with no apparent changes in the results. Our findings suggest that a large number of nodes removed may be associated with a significant improvement in the time to BCR but there is no detectable impact on OS. However, the lack of any benefit of the extent of pelvic lymph node dissection on patient OS could also be related to the relatively short median follow-up of our population.

Simulation study
In this Appendix we report on a simulation experiment designed to illustrate and explore the finite-sample and the large-sample properties of the estimation procedures that we have used.
The vector z i indicates the covariate vector for subject i, excluding the treatment group Tx i , and λ PM (z i ) indicates the mean parameter defined as log (λ PM (z i )) = γ 0 + γ 1 PSA i + γ 2 Age i +γ 31 1(Stage i = 2) + γ 32 1(Stage i = 3).
The propensity score models PM were defined as follows (in parentheses the values of the regression parameters):    Here, too, covariates absent from a specific model were multiplied by a coefficient equal to zero in the generation step, and not estimated in the estimation process. The generated survival times were right censored with an independent censoring variable distributed as a negative exponential random variable with parameter 1/500. Propensity scores for the three treatments (1-10, 11-20, and > 20 Nodes examined) were computed as described in the main text, i.e. by adding the appropriate terms of the Poisson probability mass function. The theoretical survival probabilities were computed as the average over the observed covariate distribution (fixed for each simulation) of the quantities e −t * λ(t|z i ) exp(−β 41 1(Tx i =2)−β 42 1(Tx i =3)) .
We estimated the survival probability at t * = 500, 1000, and 1500 days. For simplicity, below we only report the results for t * = 1000 days. The conclusions drawn from for the other two time points were analogous. Tables 7, 8 and 9 show the empirical root-MSE and Bias in estimating S(1000) with the naive Kaplan-Meier Both the propensity score model and the survival model used in the estimation process differ from the corresponding models used for data generation. Results are based on 3000 simulated samples. PM stands for Propensity Score Model, and SM stands for Survival Model. Please refer to the text for the description of the models estimator ("Unadjusted"), with the weighted nonparametric estimator ("Weighted"), and with the Doubly robust estimator ("Doubly Robust"). Simulations were based on 3000 samples, and for sample sizes 1000, 3000, and 10000. Data were simulated and estimated by using different combinations of models PM and SM for the generation ("Generation") and for the estimation ("Estimation") processes.
Results show that the naive estimator is always biased. On the other hand, when both models used for estimation match the models used to estimate the data (i.e. they are the correct models), then both the weighted and the Doubly robust methods estimate the survival probabilities properly, with Bias and MSEs that tend to zero as the sample size increases. When only the propensity score is correctly specified, then both estimators are known to be consistent, and this is confirmed by the results. If, on the other hand, only the survival model is correctly specified, then the weighted estimator is clearly biased (as expected), while the Doubly robust estimator still shows its consistency. Lastly, when both models are wrongly specified, both estimators are clearly shown to be biased. Interestingly, for the smaller sample size (1000) the MSE of the (biased) naive estimator is quite similar to that of the adjusted estimators, even when the latter two are known to be consistent. This implies that the former estimator has misleadingly smaller variance that the adjusted estimators. As sample size increases, as expected, the unadjusted estimator maintains a larger MSE than the other two estimators. Note that in the limit the MSE should become equal to the squared bias term (since the variance term tends to zero -recall that the MSE is equal to the sum of the variance of the estimator and the squared bias term). Indeed, for the n=10000 cases one can see that the absolute value of the Bias term of the unadjusted estimator approaches the value of the square root of the MSE.
For completeness, Table 10 shows the results of the three estimation procedures when they are applied, for each combination of propensity score and survival models, to just one sample of size n=10 million. That table  again confirms the impressions gained from Tables 7, 8 and 9, and described above.
Note: Simulations were performed using the seed 13794 in R version 3.3.2.