Introduction

Breast Cancer is one of the main causes of death in Europe, the USA, and China. The number of new cases each year in Europe is about 92.2 women every 100,000 women. The mortality rate in Europe is 23.1 women every 100,0001 (https://encr.eu/sites/default/files/factsheets/ENCR_Factsheet_Breast_2014.pdf).

It is estimated that for 2020 in the US the expected number of new cases of Breast Cancer (BC) in female patients is about 276,000 (30% of all new tumor cases in female patients) and the expected number of deaths caused by Breast Cancer in female patients is about 42,000 (15% of all deaths due to tumors in female patients), thus making BC the first type of cancer for the number of new cases, and the second type of cancer as the cause of death2 in female patients. Similar rankings are observed in Europe3 and China4.

Primary cancer treatment for new cases of BC is surgery (of various types), followed by adjuvant therapies (see e.g.: https://www.gov.uk/government/publications/chemotherapy-radiotherapy-and-surgical-tumour-resections-in-england/chemotherapy-radiotherapy-and-surgical-tumour-resections-in-england). For a patient affected by breast cancer, after tumor removal, it is necessary to decide which adjuvant therapy can prevent the tumor relapse and the formation of metastases. To this effect, a series of measurements of several parameters (clinical, histological, molecular) are collected and evaluated by experts with the help of guidelines.

Conventional clinical–pathological parameters have been used since the definition of the first cancer staging systems in 19465 up to the recent St. Gallen Consensus6 to select patients eligible for adjuvant treatment following BC surgery, thus helping in avoidance of unnecessary cytotoxic treatments. The high social and personal cost of chemotherapy and the evidence of over-prescription with the standard methodologies7, fueled the search for scientific and technological advances in this area, that could impact clinical practice.

The need for better prognosis and prediction of therapy results has led to substantial research in alternative bio-markers based on BC molecular profiling, and novel prediction models and algorithms, that could overcome intrinsic limitations of previous approaches. In particular high-throughput sequencing technologies have been key enablers for the success of this new approach, as well as the efforts for systematic collection of molecular data.

At this moment prognostic tools based on molecular biomarkers are considered valid clinical decision support tools, complementing traditional histopathology (see e.g. the Mammaprint and Oncotype DX tests)8.

Prognostic molecular tests are cost-effective versus the cost of chemotherapy for patients who would not eventually benefit from it. They are considered complementary to histology-based more traditional methods (e.g. TNM staging).

Van’t Veer and her co-authors9 describe a panel of 70 mRNA biomarkers for breast cancer predicting survival after 5 years from breast cancer surgery. This panel is the basis for the Mammaprint test, which after several clinical trials, has been approved by regulatory agencies in the USA and Europe for clinical use.

Paik et al.10 proposed a panel of 16 genes (plus 5 control genes) whose expression level is the basis for computing a score that allows classifying patients into low, medium, and high risk of relapse within 5 years after surgery. This panel is commercialized as Oncotype DX and it has been validated in the clinical trial TAILORx11. In published data, the intermediate class, which is rather neutral for clinical decisions, covers 30% of the patients in the testing cohort. Other methods for multigene based prognosis of breast cancer are covered in a survey by Győrffy et al.12.

In this paper, we describe a novel machine learning (ML) supervised classification method and we apply it to the task of producing prognostic predictions of survival at 5 years for BC patients using gene expression levels measured from the samples of the tumor surgically removed. The prediction method is conditional on the type of post-operative adjunct therapy selected for the patient. Data from a cohort of about 2000 patients available through the Metabric consortium13 are used to train, validate and test the prognostic predictor, and they indicate competitive performances compared to state-of-the-art methods. See “Methods” section (Tables 10, 11) for basic statistics of the main features of the populations used for training, validation, and testing.

Survival analysis aims at modeling and estimate complex objects like the survival function, or the hazard function that give deep insight into the expected survival time (as a continuous function). Here we aim at a more restricted type of result, where survival is dichotomized into low-risk and high-risk classes, with a threshold set at 5 years. The 5 years threshold is a common benchmark in Breast Cancer studies, however, as BC patients often experience a long survival time, also a 10 years benchmark is commonly found in the clinical and epidemiological literature.

Coherent Voting Network (CVN) is a supervised learning paradigm designed explicitly to uncover non-linear, combinatorial patterns in complex data, within a statistically robust framework. Breast Cancer patients after surgery may receive several types of post-surgery adjuvant therapeutic regimen (endocrine, radiotherapy or chemotherapy, and combinations thereof) aiming at reducing relapse and the formation of metastases, and thus favoring long term survival. We wish to predict the outcome of adjuvant therapy using just small molecular fingerprints (mRNA) of the patient’s transcriptome. We aim at simultaneous high scores for PPV (positive predictive value) and NPV (negative predictive value) as these are important indices for the final clinical applications of the predictor. A Training-validate-test protocol is applied onto CVN built on patient data from the Metabric Consortium (about 2000 patients).

The performance in tests is at the state-of-the-art for several BC cancer sub-types and it is remarkable for the subclasses: TNBC, Her2+, and Luminal B. The effectiveness of the selected fingerprints is confirmed also on several independent data sets (for a total of 601 patients) from the NCBI Gene Expression Omnibus (GEO).

This article is organized as follows. In “Results”, “Therapy classes” and “Secondary stratifications”, we describe the main results in the application of the CVN-based prognostic predictor on Metabric data. In “Comparison of CVN with other ML classification methods” we compare the CVN-based prognostic predictor against other state-of-the-art ML methods using the Autoweka package. In “Performance of CVN on independent cohorts of patients”, we apply the molecular fingerprints derived for Metabric to several independent cohorts of patients. In “Discussion”, we place our results in the context of the currently known results and we comment on strong and weak points of the proposed method, as well as on possible extensions. In “Methods” we give a high-level description of the CVN method and of the data preprocessing, while more details are in the Supplementary Materials.

Results

Therapy classes

Patients after surgery may or may not follow one of the following adjuvant therapies: chemotherapy, radiation therapy, and hormone therapy (also called endocrine therapy), which are reported in Metabric annotations. There are thus 8 possible combinations of three therapies. For each therapy profile, we repeat the training-validate-testing procedure to obtain 8 therapy-specific gene sub-panels and prediction performance estimates (primary stratification) (see Supplementary materials 1 for a self-contained recollection of the performance measures used in this context). Table 1 reports 5 therapy classes for which Metabric data are sufficiently numerous to estimate the statistical significance of the predicted performance indices, and the automatic hyper-parameter/feature selection optimization converges.

Table 1 Performance of therapy-based stratification.

The number of genes in each fingerprint for the therapy classes ranges from a minimum of 5 to a maximum of 17, with an average of 9.875. Overall 78 distinct genes are used. The selected fingerprints hardy overlap with previously known fingerprints (see Supplementary materials 1).

Secondary stratifications

Starting from the 5 sub-panels based on the therapy classes (primary stratification), it is possible to define stratifications based on different features (secondary stratification) of the patient. The secondary stratifications do not change the prediction of any single patient but provide a different evaluation of the quality of the prediction. We take into consideration ER status as measured by IHC (Table 2), Intrinsic Type (Table 3), ER/HER2 classification (Table 4), Tumor stage (Table 5), Tumor grade (Table 6), and Lymph node state (Table 7).

Table 2 Secondary stratification by ER status.
Table 3 Secondary stratification by intrinsic status.
Table 4 Secondary stratification by 3 genes status.
Table 5 Secondary stratification by tumor stage.
Table 6 Secondary stratification by tumor grade.
Table 7 Secondary stratification by lymph node status.
Figure 1
figure 1

Stratification by hormonal type: ER−/Her2−.

Figure 2
figure 2

Stratification by hormonal type: Her2+.

Figure 3
figure 3

Stratification by intrinsic type: Luminal B.

Figure 4
figure 4

Stratification lymph node status: positive.

Figure 5
figure 5

Stratification lymph node status: negative.

Here we highlight some of the tabled results. For the testing pool of 82 lymph node positive patients, we obtain PPV 0.77 and NPV 0.78 (odds ratio 11.50); for the pool of 61 lymph node negative patients, we obtain PPV 0.68 and NPV 0.88 (odds ratio 16.07). Improved results are obtained on some specific subtypes of BC. For the testing pool of 16 TNBC patients, we obtain PPV 1.0 and NPV 0.83 (odds ratio 45.00). For the testing pool of 18 HER2+ patients, we obtain PPV 0.91 and NPV 1.0 (odds ratio 40.00). For the testing pool of 41 Luminal B patients, we obtain PPV 0.75 and NPV 0.95 (odds ratio 60.00). The PPV and NPV results should be seen in the context of the prevalence (the proportion of the population with high vs low risk) in the test sets used. Kaplan–Meier plots for notable subclasses are shown in Figs. 1, 2, 3, 4 and 5. Kaplan–Meier plots for all the secondary stratifications are shown in the Supplementary Materials 1.

Comparison of CVN with other ML classification methods

To compare our algorithmic solution with the state-of-the-art in machine learning, we performed experiments with the Autoweka package14,15 within the Weka workbench environment16,17. Autoweka performs automatically feature selection and hyper-parameter optimization of 27 base classification methods, 10 meta-methods, and two ensemble methods, moreover it uses several feature selection search methods along with 8 feature evaluation functions. The hyper-parameters are optimized in Autoweka using a Bayesian optimization strategy to explore the space of parameters. The Autoweka package includes also the Ridge regression method (a form of simple penalized logistic regression ) often used in survival analysis. The ridge parameter is optimized by Autoweka in the range of values from 1e\(-7\) to 10.

As we noticed that the initial feature selection phase is onerous when applied to the input of roughly 24,000 genes, we also applied explicitly several Weka feature selection pre-filters so to reduce the number of features in the input to Autoweka. Autoweka uses ten-fold cross-validation over the training set to select the best configuration of hyper-parameters. We fixed the kappa statistics as the objective function to be maximized in the learning phase (see Supplementary Materials 1). The reported kappa statistics are computed on the trained predictor for the test data set.

Table 8 reports the kappa statistics for the best Autoweka trained classifiers (in round brackets) along with the result we obtain with the coherent voting networks over the test set. For CVN next to the kappa statistics we report the lookahead number or, in two cases, the manual selection of the best configuration. Ignoring for the moment the two manually selected configurations, we notice that we can get the highest kappa values in three of the six remaining cases. We also notice for CVN a robust uniform behavior with consistent high positive values of kappa. In all columns (except corr-ranker) there are negative entries indicating that the best ML method for that input has performed worse than a random classifier. For the corr-ranker feature selection the ML methods have all positive values, but generally lower than those of CVN. Moreover, the best Autoweka results are attained by 15 different methods thus making it hard to pinpoint a single winner algorithm in the Autoweka suite.

Overall the setup experimental conditions for the Autoweka and CVN differ in some aspects, therefore the findings must be considered with care. Keeping these differences in mind, we can conclude that CVN has a level of performance at least comparable with existing ML methods. Moreover, CVN is a single easy-to-explain method that allows for a more uniform approach to the BC prognosis problem over a wide spectrum of clinical conditions.

Table 8 Kappa statistics for training data sets for various Autoweka/Weka feature selection settings.

Performance of CVN on independent cohorts of patients

After a screening of the breast cancer data sets in the NCBI GEO (Gene Expression Omnibus) repository we have identified a few BC data sets with characteristics compatible with the Metabric data set regarding the recorded therapy, endpoint survival (preferentially overall survival). The prediction performance is tested in a leave-one-out evaluation framework in which the multi-gene fingerprint is the optimal fingerprint defined on Metabric data. Greedy hyper-parameter optimization is applied and the best result in terms of OR subject to slackness below 15% is the selected configuration reported in Table 9. Due to different microarray technologies, we have mapped the genes onto the probes for the target technology (using all mapping probes, if multiple probes map onto the same HUGO gene ID).

The GSE45255 data set holds information on three different therapeutic classes. The numbers of patients in each class are however rather small. While for the GEO45255 chemotherapy (ch) subset there is perfect performance, the p-value for OR is too high for claiming statistical significance on this measure, but, in contrast, the AUC value is statistically significant. For the GEO45255 endocrine (ho) and the GEO45255 endocrine plus chemotherapy (chho) subsets, we attain high values in kappa and OR, with significant OR and AUC p values.

Data set GSE37181 holds a large number of patients (119), and it is perfectly balanced among the two classes (60 vs 59), but the endpoint is disease-free survival (dfs), rather than overall survival (os). We notice a loss in terms of OR although the kappa statistics and AUC are still in an acceptable range, with good statistical significance.

Data set GSE7390 holds a larger number of patients (181) but is unbalanced among the two classes (157 vs 24). This has the effect of inducing relatively low kappa statistics, however, the odds ratio (20.86), sensitivity (0.7), specificity (0.89), and the p values indicate a good performance on these indices.

Data set GSE2034 is the largest independent cohort (264) in this table and is roughly balanced (95 vs 169) within a factor 2. Although the kappa statistics is low, the odds ratio OR is high (20.12) even if the endpoint is relapse-free survival (rfs) rather than overall survival (os).

Overall these experiments show that the selected multi-gene fingerprints may be effective across different microarray platform and different patient cohorts, while some loss of performance can be expected when a different endpoint is used. This suggests that when we change the endpoint of the prediction (e.g. disease-free survival) we should recalibrate the fingerprints in the chosen setting.

Table 9 Independent cohorts.

Discussion

We have developed a new ML supervised classification method called Coherent Voting Networks (CVN) which is suitable for handling highly non-linear phenomena such as those prevalent in biological systems. We have applied CVN to the problem of predicting the prognosis of BC patients depending on the chosen post-surgery adjuvant therapy selected. After surgery, a breast cancer patient must follow a therapeutic regime aimed at preventing relapse and formation of metastases. The CVN-based prognostic tool can predict, with good accuracy for a large percentage of the patients, whether the patient will survive more or less than 5 years following current the state of the art adjuvant therapeutic protocols (based on chemotherapy, radiation therapy, and hormone therapy). Such prognostic tool helps the clinician and the patient by validating the chosen therapeutic path (in case of predicted good prognosis), or by suggesting, in combination with other elements, the need for further investigations, or the application of newer, possibly experimental, alternative protocols (in case of predicted poor prognosis). The advantage for the patient is the possibility to personalize the therapeutic choices by using her molecular prognostic profile, with a higher chance of an effective cure and survival. The advantage for the clinician is a tool to validate baseline therapeutic choices (or suggest the need for alternatives). The advantage for the health system at large is better discrimination among those patients requiring expensive and invasive cures (e.g. chemotherapy), and those that would benefit from less expensive and invasive ones (e.g. hormonal therapy). The CVN-based prognostic tool uses a small molecular profile of a few dozen genes that can be measured for each patient’s tumor biopsy with standard technologies like RNA-seq or RT-PCR.

The fingerprint gene panel has been identified using public data of the project Metabric (Molecular Taxonomy of Breast Cancer International Consortium) and tested using other publicly available data of independent cohorts. Thus the results in this paper rely rather heavily on the quality of the Metabric protocols for collecting molecular and clinical data. An interesting line of research to be developed is to assess the robustness of the CVN-based prediction when different technologies and different data processing protocols are used. Preliminary tests on independent cohorts (see Table 9) suggest that the devised gene fingerprint is rather robust with respect to changes in the gene expression measurement technology and are even capable of operating with endpoints different from the default one chosen in this study (overall survival). However the hyper-parameter optimization phase during predictor’s training is likely to be rather more data and technology-dependent, and thus probably the adoption of different technology/protocol in data collection may entail a re-training of the predictor. A second limitation of the method in its training phase is that it relies on knowledge of the adjuvant therapy chosen for the patients. There is an implicit assumption that over the time frame of the data collection no drastic changes in the clinical practice and criteria would take place. As this cannot be guaranteed over a long period (and indeed changing current clinical protocols is the final aim of this tool) there is the practical need of continuous monitoring to ensure consistency between the patient population used in training and the population for which the tool is applied.

The CVN methodology is a general ML supervised classification tool, and, for prognostic purposes, it can be in principle applied to many variants of this problem.

The CVN-based prognostic tool is currently optimized to maximize and balance the kappa statistics (alternatively the odds ratio) across training, validation, and test data while limiting the number of patients for which no answer is given. This strategy produces also often a balancing of PPV and NPV. It is possible to obtain alternative gene panels for a specific situation (or different predictors on the same panels) that may optimize directly PPV and NPV, say by maximizing PPV subject to a lower bound on NPV (or vice versa).

Also, when a higher rate of no-answers is allowed we can increase the PPV and NPV for the given answers. Preliminary data for certain therapy classes give an NPV and PPV close to 95% for 50% of the patients. Thus with the same data, it is possible to devise a cascade of predictors having higher guarantees for the easier cases, as to cover a given population by several stratified predictors (from the easiest to the most complex cases to predict).

It is possible in principle to apply this CVN methodology to derive a prognostic panel at 10 years (this information has also clinical relevance in long-term follow-ups).

In general, it should be possible to derive similar gene panels for other tumors, provided that Metabric-like high quality data is available on a sufficiently large cohort of patients.

Finally, since we have used only gene expression data (and knowledge on the patients 5-year survival) to build the predictors, one may think that feeding other clinical or molecular indices as additional input to the CVN may improve the predictive powers. Preliminary experiments in this direction however show that a straightforward integration of known single clinical measurements does not improve predictions significantly. It remains thus open the question whether more sophisticated heterogeneous data integration strategies taking several indices at once may be beneficial within the CVN approach to prognosis predictions. A promising line of future research involves integrating mRNA and miRNA to produce mixed prognostic signatures18,19. Data on miRNA expression in Metabric patient’s samples have been produced recently within the Metabric miRNA landscape project (https://ega-archive.org/studies/EGAS00000000122). Preliminary results from this project indicate that “breast cancer miRNAs appear to act as modulators of mRNA-mRNA interactions rather than molecular switches”. Thus while it is likely that mixed miRNA-mRNA fingerprints may sharpen some of our results, within the CVN framework, we expect that mRNA will continue to be key elements of the predictors, even in this extended setting. Certainly, a better appreciation of miRNA-mRNA interactions in BC may shed more light on the causative elements of BC progression. A second promising direction of research integrates biomedical imaging and molecular profiling for prognostic purposes20,21.

Triple-negative breast cancer (TNBC) is an aggressive type of breast cancer affecting about 15% of the cases, and it is known to be quite non-homogeneous from a clinical and molecular point of view22,23,24. Research on devising prognostic molecular fingerprints for TNBC has thus been directed mainly at subclasses of of TNBC25,26,27,28,29,30,31. In our results, we are able to attain good performance in terms of PPV, NPV, and OR on the full pool of Metabric TNBC patients. The good overall performance may be explained with the intuition that the initial therapy-based stratification of the patients is able to capture implicitly the TNBC molecular and clinical heterogeneity.

HER2 positive BC covers about 25% of the BC cases. It is considered an aggressive tumoral form, and while it responds well to recent therapeutics, it is known to develop drug resistance in time for about 50% of the cases distant metastases occur32,33,34,35. Molecular signatures for HER2-positive BC prognosis have been found for certain subtypes of the disease or for predicting the response to specific drugs36,37,38,39. Also for this important type of BC, we could attain high PPV, NPV, and OR results.

Lumina-B BC is one of the intrinsic types of BC discovered by Perou et al.40, based on clustering of BC gene expression profiles. Prognostic properties of this subtype have been investigated in particular compared to the other intrinsic types41,42. In general, however, less is known about discriminating prognosis within the type43,44. Here we show that the CVN-based classifier is effective in discriminating good and poor prognosis patients with high PPV, NVP, and OR.

van de Vijver et al.45 report the performance of a 70-genes prognostic gene fingerprint: for lymph node negative patients OR is 15.0 (3.3–56, p val < 0.001) with PPV 0.63 and NPV 0.89. For lymph node positive patients OR is 13.7 (3.1–61 , p val < 0.001), with PPV = 0.4 and NPV 0.95. Overall our results for lymph node positive and negative are similar in terms of OR but, in our case, we have a better balancing between the PPV and NPV measures.

Paik et al.10 developed a 21-gene signature (16 predictive and 5 control genes) to predict recurrence in lymph node negative breast cancer treated with Tamoxifen, which was later incorporated in the Oncotype DX prognostic kit. Taking into account only the low and high-risk classification of the patients we obtain an OR 5.67 (3.39–9.46, p value 9.6e−12) with NPV 0.90 and PPV 0.38. Again, our results show a better balancing of PPV and NPV values.

Our work has focussed on selecting relatively small fingerprints that can be used to build predictive CVN, by maximizing the kappa statistic (or the odds ratio) in testing sets of patient data, subject to an upper bound on the slackness of the method (percentage of no responses). In this research, we did not aim at uncovering causative fingerprints (i.e. a pattern of gene expression level measures that explain the future survival in combination with a therapeutic regime46. Although we cannot rule out that the uncovered genes may indeed be involved in the causation of the disease, two orders of considerations advise caution. One consideration is that several just slightly sub-optimal fingerprints may also be found (a phenomenon compatible also with the findings by Venet et al.47). Thus causative genes may be present outside a predictive fingerprint of minimal size, with an explanatory role as important as that of those present in the fingerprint. The second consideration is that we have used one mRNA data set from protein-coding genes as our feature space. It is known that BC involves several layers of biological regulation (e.g. genetic aberrations, actions of non-coding RNA, epigenetic signals, multi-cell signaling, metabolic and environmental conditions), thus a causative explanation might involve a more complex interplay of several layers. Finally, we did not touch yet on the topic of whether such fingerprints contain directly actionable targets for therapeutic agents (either for administered drugs or for new drugs tailored to the personal molecular profile of the patients). These related problems are of interest and may entail the collection and fusion of additional relevant ‘omic’ data, as well as the refinement of the algorithms introduced in this study.

Methods

Here we give an overview of the Coherent Voting Network (CVN) methodology at a high level. For details, we refer to the Supplementary materials 1 (“Methods” in detail). The description is in two parts. The first part introduces the CVN and its use for prognosis prediction. The second part describes the feature-selection and hyper-parameter optimization procedure that is performed in a train-validate-test protocol aiming at optimizing the gene fingerprint, the CVN configuration, and estimating the performance of the method on a testing set of patients.

Construction of a CVN

As working with a complete gene set is a computational burden and may introduce too much noise from the experimental measurements, we apply a mild initial statistical filter to preserve in the computation only genes able to discriminate the two categories of patients (high-risk or low-risk) that correspond to bad and good prognosis, using thresholds for fold change, t test, ks-test (Kolmogorov–Smirnov), and mwu-test (Mann–Whitney U). Thus the gene set we use in the further CVN construction is composed of genes passing a combination of these statistical discrimination tests.

We build a bipartite graph G in which we have patient nodes P and Gene-Interval nodes GI where each node of the GI class is labeled with a gene and an interval of values for the expression of that gene. This graph is built in a straightforward manner from the input data matrix of gene expression for a pool of patients, by using quantization methods48. We build a partial dense cover of this bipartite graph (see definition in Pellegrini et al.49) which is a collection \({\mathcal {C}}\) of dense subgraphs of G, where each subgraph is also called a community. Each community will have both patient and gene nodes, and the communities may overlap. Let us for the moment concentrate only on the patient nodes. Each patient may belong to many communities. Each patient has a category (high-risk or low-risk) that corresponds to a bad or good prognosis. Each community expresses a vote (high-risk, low-risk, or null) by a voting scheme (say, for the moment, simple majority, but more schemes are described in the Supplementary Materials 1). Each patient receives a prediction that is the majority category expressed by the communities it belongs to. Finally, the voting is coherent for a given patient p if the vote received by p is equal to her category. The degree of coherence of the voting network is the fraction of patients for which it is coherent. Ideally, the higher the degree of coherence of a CVN the better such CVN is as a basis for a predictor. The key point is that in such a construction the partial dense cover does not depend on the category of the patients, thus we may have in input non-classified patients, for which the vote of the network represents their category prediction. The intuition is that a network that is coherent for the classified patients, even if built without knowing their category, is a good predictor also for the unclassified ones.

We can see a CVN as a generalization of the notion of guilt by association (GbA) in biological networks. In a typical application, some nodes in a biological network will have labels and some will be unlabeled. We make a prediction for an unlabeled node by using the labeled nodes within a neighborhood of the unlabeled node in the biological graph. Note that in GbA each node receives a vote from a single subset of the nodes.

So far each community in a CVN may have a large number of genes, and one of our aims is to find a minimal set of genes that leave the communities (of patients) unchanged since the reduction of the number of genes would not change much their density. To achieve this goal we consider now only the genes belonging to any community. We look for a minimal set M of genes so that each community (of genes) includes at least k genes in M. The set M can be well approximated by using a greedy set multi-cover algorithm (see e.g.50).

After computing the minimal set M of genes we can rebuild the CVN using only the patient set P and the genes in M obtaining a CVN’, measure the coherence of CVN’, and use CVN’ for prediction of the category of unclassified patients.

Train-validate-test protocol

Each phase of the construction described above depends on the choice of values for hyper-parameters, and we will have a CVN for each such choice (which we call a parameter-vector v of the parameter-space V). While sophisticated strategies for searching this discrete parameter-space exist (in ML they are termed hyper-parameter optimization strategies) in our application the construction of a single CVN is in practice very efficient thus we will use greedy search and compute a CVN for each \(v \in V\), as |V| is in the range of only a few hundreds.

A further aim, besides finding an optimal v, and a small gene set M is to have high performance for the testing phase in a train-validation-test set-up.

We begin by splitting the initial set of patients into three sets: the training set T0, the validation set T1, and the test set T2. In a standard ML setting information leaking is avoided by finding the optimal \((v^*,M^*)\) pair only on (T0, T1) and then applying such optimal predictor to (T0, T2). The performance is measured on this unique predictor for (T0, T2). We relax such an all/nothing schema by allowing the use of T2 in the choice of \((v^*,M^*)\) in a very limited and controlled way, by use of the concept of lookahead. Instead of producing a single predictor on (T0, T2) we produce a ranking of all predictors on (T0, T1) that we can build by choosing a \(v \in V\). We then lookup vectors v in this ranked list, and we stop when the corresponding predictor for (T0, T2) satisfies a stopping criterion. The number of vectors v we visit in this lookahead process is the lookahead number (lh). For lh=1 we have the standard ML set up. In Table 1 we report the lh values observed for the therapy classes: 1 once, 2 twice, 3 once, and 4 once.

Computation of p values

In this study, we use three different p values associated with the statistics: odds ratio, ROC AUC, and log-rank test. The log-rank test statistic and the associated p value are computed with the API of the lifelines package (https://lifelines.readthedocs.io). The log-rank test statistic is a chi-squared test under the null hypothesis of the two series having the same hazard ratio. The ROC AUC statistic and the associated p value are computed via equivalence to the Wilcoxon Mann Whitney test using the API in the scipy.stats package (https://scipy.org/). This p value is one-sided and assumes an asymptotic normal distribution. The Odds Ratio statistic and the associated p-value are computed with the exact Fisher test API in the scipy.stats package, with the ‘two-sided’ option. In this test, the distribution of odds ratio values follows the hypergeometric distribution.

Metabric patients sample selection

The Metabric collection used in this paper is described in Curtie et al.13 and is made of 1992 clinically annotated primary fresh-frozen breast cancer specimens from tumor banks in the UK and Canada. The Metabric collection is the union of two main cohorts: the first cohort of 997 female patients and the second cohort of 995 female patients. The differences among the two main cohorts do not impact our research, as the relevant transcriptomic methodology adopted is the same in both cases, thus we do not distinguish the cohort of origin in subsequent steps. Note also that we do not use data on the normal tissue specimens.

Nearly all estrogen receptor (ER)-positive and/or lymph node (LN)-negative patients did not receive chemotherapy, whereas ER-negative and LN-positive patients did. Moreover, none of the HER2+ patients received trastuzumab. Thus the treatments were homogeneous within clinically relevant groupings, which is an important feature of this collection supporting our therapy-based stratification. Although many assays are available for this collection of specimens, here we exploit the clinical data only in conjunction with the normalized gene expression matrix.

All patient specimens were obtained with appropriate consent from the relevant institutional review boards13. RNA was isolated from samples and hybridized to Illumina HT-12 v3 platforms for transcriptional profiling. Illumina HT-12 v3 technology targets more than 25,000 annotated genes with more than 48,000 probes. Probes were designed using the RefSeq (Build 36.2, Rel 22) and the UniGene (Build 199) databases. Illumina HT-12 v3 raw data is then preprocessed in steps: spatial artifact correction, summarization, normalization of Log2 intensities with beadarray (https://bioconductor.org/) and bash51 (see Suppl. Mat. in Curtis et al.13).

Each patient is annotated with her risk class, taking censoring into consideration, setting survival below 60 months (5 years and below) as high-risk, and survival above 72 months (6 years and above) as low-risk. We take the full collection of 1992 Metabric patients in a vector and apply a random permutation (function random.shuffle in python). Next, we assign the first half of the positions of the vector to the training (1000 patients), the subsequent quarter of the positions to validation (500 patients), and the last quarter to testing (492 patients). Note that due to the properties of random permutations every subset of patients of the corresponding size has the same chance of showing up as a training/validation/testing set. We perform also on the three sets an equalization step. Within the majority risk-class, we take a random sample (via random permutation) of the same size as the minority risk-class. The patients in the majority risk class not selected are discarded. Afterward, each of the three sets (training/validation/testing) in the two variants (unequalized/equalized) is stratified according to the eight possible therapy classes, as reported in the clinical annotations. When for a therapy class the two risk-classes groups are sufficiently balanced in the unequalized case for the three sets (with a ratio below 2.5:1 of the largest class to the smallest) we use the unequalized sets. Otherwise (unbalanced case) we use the equalized versions of the sets.

Handling missing data

When data is used in matrix form missing data need to be taken into account before numerical computation may start, as a fully dense matrix is usually assumed by most numerical methods. In our setting data is presented in a matrix, and re-mapped to a graph G after the application of the initial statistical filters and discretization. Initial statistical filters and discretization are done in a gene-by-gene fashion and they are well-defined operations even when some matrix entries are missing. Here we just apply loose filtering, excluding by default genes having more than 50% of missing entries. For the phase of graph generation missing entries in the matrix are simply mapped to missing edges in the graph. Note that, from a formal point of view, graph algorithms do not suffer from this. More specifically in our context, missing edges may result in lower density for the communities associated with the incident nodes, and this situation is handled in full generality by the algorithm for building the dense communities. Our approach thus avoids altogether any potential bias or noise introduced by the standard missing data handling methodologies that rely on interpolation.

Handling censored data

Our approach to dichotomization and censoring can be classified as an “uncensoring” technique52, a transformation of the input data so that standard classification algorithms can be applied effectively to censored data. Our approach has similarities with the method described in Zupan et al.53, where the instances in the given data are split into three categories: (1) instances that experience the event of interest (death by BC related causes) during the observation period will be labeled as eventful and assigned to the risk class according to the event time; (2) instances whose censored time is later than the predefined time point (5 years) are assigned to the low-risk class; (3) instances whose censored time is earlier than the predefined time point are removed from further consideration. In the context of Metabric data, which is of high quality only 20 patients out of 1992 are censored due to loss-in-follow-up. All other censored cases of type (3) are censored due to their entry the observation program later than 5 years before the end of the observation period (end-of-study). Note that our choice of removing patients of type (3) does not introduce any bias, as the time-of-entry of a patient in a study is considered independent of any other feature of the patient. As Metabric data are sufficiently numerous we can afford to neglect the patient of type (3) and still attain statistically significant results in most cases. In situations where data is not sufficient, one might want to adopt the full approach in Zupan et al. 2000 and assign a marginal probability of event occurrence estimated by the Kaplan–Meier method to patients of type (3).

Statistics on patients features

Here we report the distribution of 17 patient features (continuous and categorical) for training, validation, and testing sets (after possible equalization in a therapy class). Categorical features in Table 10 are: intrinsic subtypes, type of breast surgery, histological subtypes, inferred menopausal status, HER2 SNP6 copy number gain/loss, laterality, intrinsic clustering, cohort of origin, ER status by immunohistochemical analysis, hormone receptors status, cellularity, lymph node status, tumor stage and tumor grade. Continuous features in Table 11 are age at diagnosis (in years), NPI (Nottingham prognostic index), and overall survival (in months). The repeated use of uniform random sampling ensures a high similarity in the distribution of the feature values across the train, validate and test sets.

Table 10 Distribution of patient categorical features over training, validation, and testing sets.
Table 11 Distribution of patient continuous features over training, validation, and testing sets.

Ethical statement

Patients were not directly involved in the study.