Review Article

Win Ratio in Real-world Heart Failure: Integrating Propensity Scores and Causal Inference for Asian Registries

Register or Login to View PDF Permissions
Permissions× For commercial reprint enquiries please contact Springer Healthcare: ReprintsWarehouse@springernature.com.

For permissions and non-commercial reprint enquiries, please visit Copyright.com to start a request.

For author reprints, please email rob.barclay@radcliffe-group.com.
Information image
Average (ratings)
No ratings
Your rating

Abstract

Composite endpoints in heart failure (HF) trials enhance power, but often obscure clinical relevance. The win ratio (WR) offers a hierarchical, patient-centred alternative. Applying this method to real-world data demands methodological adaptations to address confounding and selection bias. The objective of this paper is to review the WR methodology and present a practical framework for its application to observational HF registries by integrating propensity scores and causal inference to approximate randomisation. This review outlines the traditional limitations of composites and the fundamentals of WRs. We integrate propensity score approaches, including probability weighting, matching and stratification, with WR analyses, emphasising balance diagnostics and weight truncation. A conceptual example based on Asian HF registries illustrates the workflow, comparing WR estimates with traditional time-to-first-event composites. In conclusion, when embedded in a proper causal inference framework, the WR bridges randomised trials and routine care by prioritising patient-centred outcomes while maintaining statistical rigour. Future integration with target-trial emulation will advance real-world evidence in HF research.

Received:

Accepted:

Published online:

Disclosure: The authors have no conflicts of interest to declare.

Acknowledgements: The authors thank the Asian Pacific Society of Cardiology for fostering regional collaboration in cardiovascular research and data science. They acknowledge the use of ChatGPT (DALL-E 3) and Google Gemini in the generation and conceptual design of the illustrations. All AI-assisted outputs were subsequently reviewed, edited and verified by the authors to ensure scientific accuracy and technical relevance.

Correspondence: Amarit Tansawet, Research Affairs, Faculty of Medicine, Chulalongkorn University, 10th Floor Mongkut-Phetcharat Research and Service Innovation Building, King Chulalongkorn Memorial Hospital, 1873 Ratchadamri Rd, Pathum Wan, Bangkok 10330, Thailand. E: amarit.t@chula.ac.th

Copyright:

© The Author(s). This work is open access and is licensed under CC-BY-NC 4.0. Users may copy, redistribute and make derivative works for non-commercial purposes, provided the original work is cited correctly.

Heart failure (HF) remains a significant cause of morbidity and mortality worldwide, with a particularly high burden in Asia, where demographic diversity, comorbidities and variable access to advanced therapies complicate disease management.1

Randomised controlled trials (RCTs) have played a crucial role in establishing evidence-based therapies, however, their eligibility criteria exclude elderly patients, those with multimorbidities and under-represented ethnic groups.2 Moreover, RCT findings may not translate into effective clinical approaches.3 Consequently, real-world data (RWD) have become increasingly valuable for complementing RCT evidence, as they reflect diverse patient populations and routine clinical conditions.4

However, evaluating HF outcomes typically relies on composite endpoints to increase statistical efficiency. Despite improving power, this approach has two major limitations. First, it assigns equal weight to events of differing severity, such as death and hospitalisation, potentially obscuring clinical relevance if the result is driven by less severe outcomes. Second, traditional ‘time-to-first-event’ analyses ignore recurrent events, thereby underestimating the total disease burden in chronic HF.5

Researchers propose the win ratio (WR) as a hierarchical, patient-centred alternative that prioritises clinically meaningful outcomes to address these issues. This approach compares all possible patient pairs between groups according to a predefined hierarchy, such as cardiovascular death (CVD) > HF hospitalisation > symptom improvement, and calculates the ratio of ‘wins’ to ‘losses’.6 The WR has since been adopted in major HF trials, including EMPEROR-Preserved and DELIVER, providing a more nuanced interpretation of treatment benefits.7,8 Nevertheless, recent methodological critiques highlight pitfalls, such as sensitivity to arbitrary hierarchy ordering and potential misinterpretation under high event rates, emphasising the need for transparent reporting and sensitivity analyses.9,10

Translating the WR to observational RWD introduces additional complexity, including confounding, missing data and variable follow-up. In Asian registries, heterogeneity in coding standards, incomplete mortality linkage, and unequal access to evidence-based care compound these challenges. Applying causal inference frameworks, particularly propensity score–based weighting and matching, can help approximate randomisation and enable fairer pairwise comparisons.11,12

Graphical Abstract: Win Ratio in Real-world Heart Failure: Integrating Propensity Scores and Causal Inference for Asian Registries

Article image

This article is structured as a narrative methodological review designed to provide a practical tutorial. It aims to: summarise methodological foundations of the WR and recent critiques; demonstrate how propensity score methods can emulate randomisation in observational WR analyses; and propose a practical framework for applying the WR in Asian HF registries. The review is based on the assertion that integrating causal inference principles with hierarchical outcome assessment may advance patient-centred evidence generation from real-world HF data.

Literature Search Strategy

To develop this practical framework, we conducted a targeted literature search across PubMed and Google Scholar for articles published up to 2025 in the English language. Search terms included combinations of win ratio, hierarchical endpoints, heart failure, propensity score and causal inference. We prioritised seminal methodological papers, critical reviews and recent clinical trials or observational studies illustrating the application of these methods in cardiovascular research.

Real-world Data on Heart Failure: Opportunities and Limitations

Large-scale real-world registries have become pivotal in advancing HF research in Asia (Table 1 ), offering pragmatic evidence that complements randomised trials. These registries capture diverse phenotypes, treatment patterns and outcomes that are often under-represented in conventional RCTs.13

Table 1: Asian Heart Failure Registry Characteristics

Article image

The ASIAN-HF Registry, which includes data from more than 6,000 patients, revealed striking regional heterogeneity: patients from south-east Asia were younger, had higher rates of ischaemic heart disease and diabetes, lower use of guideline-directed medical therapy (GDMT) and worse survival compared with east Asian and cohorts from Western countries.14

The JROAD-HF Registry (>400,000 hospitalisations) in Japan, where the Ministry of Health provides nationwide administrative data for longitudinal analyses of trends and the burden of rehospitalisation.15 The KorAHF Registry (>5,000 patients) in Korea, there was a reduction in short-term mortality but high persistent readmission rates, highlighting the recurrence of events.16 The China-HF Registry (>40,000 patients) demonstrated wide inter-hospital variation in uptake of GDMT and persistent underuse of evidence-based medications outside tertiary centres.17

In south-east Asia, the Thai HF Registry (THFR; 3,000 patients) reported a younger age at presentation and lower GDMT usage.18 It also highlighted that non-CVDs, such as infection or sepsis, predominated in HF with mildly reduced ejection fraction (HFmrEF) and HF with preserved ejection fraction (HFpEF), reflecting the complex multimorbidity patterns typical of south-east Asian populations. These studies collectively emphasise the importance of longitudinal RWD to monitor evolving trends.

Unlike registries from developed nations, such as SwedeHF and GWTG-HF, that often benefit from unified national healthcare systems and robust links to national statistics, Asian registries operate across highly diverse and fragmented healthcare infrastructures.19,20 Consequently, substantial heterogeneity in data exists across Asian countries, complicating cross-national synthesis. First, ICD-10 coding practices vary considerably: some registries rely on clinician-entered diagnoses, whereas others derive them from administrative billing codes, leading to phenotypic misclassification. Second, the completeness of mortality linkage differs; Japan and Korea maintain robust vital statistics systems, whereas linkage in south-east Asia may be incomplete, potentially leading to an underestimation of death rates. Third, follow-up duration and completeness varies across hospitals and regions. These differences affect endpoint ascertainment and must be carefully considered when defining hierarchical outcomes for analysis.

Beyond data quality, several methodological challenges persist. Missing data – particularly for laboratory parameters, echocardiographic measures and quality-of-life assessments – can lead to biased event rates and limit the ability to adjust for multiple variables. Confounding by indication, where treatment decisions are based on clinical judgement rather than random allocation, poses a significant threat to the causal validity of the results. Last, recurrent events such as repeated HF hospitalisations are common but often inadequately captured using conventional time-to-first-event methods, resulting in underestimation of total disease burden and the effects of treatment.5

To overcome these biases and derive reliable inferences from observational RWD, advanced causal-inference frameworks are essential. Propensity-score weighting, matching and stratification can adjust for baseline imbalances and approximate randomisation, enabling fairer comparisons of therapeutic strategies. Integrating these methods with hierarchical analytic frameworks, such as the WR, offers a promising pathway toward generating robust, patient-centred real-world evidence for HF across Asia.

Problems with Traditional Composite Endpoints

Composite endpoints are frequently used in cardiovascular trials to enhance statistical efficiency by capturing multiple clinically relevant events within a single analysis. In HF research, composites typically combine CVD and HF hospitalisation to increase event rates and reduce sample size requirements. While this approach improves power, it often compromises interpretability and may obscure clinically meaningful distinctions between outcomes of differing severity.6

A significant limitation arises from the equal weighting problem, wherein each component event contributes equally to the overall endpoint regardless of its clinical importance. For example, an emergency department visit for transient dyspnoea and a CVD are both counted as single events in a time-to-first composite analysis, despite their vastly different implications for patients and clinicians. A treatment that reduces minor events but has no effect on mortality may thus appear beneficial, potentially leading to misleading clinical conclusions.6 Further, when a composite is driven predominantly by the least severe component, such as unscheduled clinic visits or emergency evaluations, the clinical relevance of the observed treatment effect diminishes even if statistical significance is achieved.21

Equally problematic is the time-to-first-event framework commonly applied to composite outcomes. This method considers only the first event per patient and censors subsequent occurrences, disregarding the recurrent nature of HF exacerbations and hospitalisations.5 Since HF is characterised by multiple readmissions over time, focusing solely on the first event underestimates the total disease burden and the actual treatment effect. Analyses from trials such as PARADIGM-HF and DAPA-HF have shown that recurrent hospitalisations account for up to 60–70% of total HF events, substantially influencing patients’ quality of life and healthcare usage.22,23 Ignoring these recurrent events not only underestimates the full therapeutic benefit but also fails to capture the lived experience of patients with chronic HF.

A third challenge involves competing risks, particularly the interplay between non-fatal and fatal outcomes. When death occurs before hospitalisation, patients are no longer at risk for subsequent non-fatal events, violating the assumption of independent censoring inherent in standard survival analyses. Failure to account for these dependencies can distort estimates of treatment effects, particularly when mortality rates differ substantially between groups.24 In populations with high competing mortality, such as elderly patients or those with HFpEF, standard composite analyses may overestimate the treatment effect on hospitalisations by not properly adjusting for differential survival.

Several high-profile HF trials have illustrated how these limitations can lead to ambiguous or even contradictory interpretations. In TOPCAT, regional heterogeneity in event adjudication contributed to conflicting conclusions about spironolactone’s benefit, partly due to reliance on a composite driven by soft endpoints. Similarly, in PARAGON-HF, the composite of total HF hospitalisations and CV death narrowly missed statistical significance; yet, subgroup analyses suggested differential treatment effects when outcomes were hierarchically prioritised.25 These examples emphasise that when composite components have divergent treatment effects or occur at vastly different rates, traditional analyses may yield statistically significant but have clinically ambiguous results.

These methodological pitfalls have prompted growing interest in hierarchical endpoint frameworks, which aim to preserve clinical interpretability while retaining statistical efficiency. Instead of treating all events equally, hierarchical approaches, such as the WR, compare outcomes in order of clinical importance – death before hospitalisation, for example. This design respects the patient-centred hierarchy of severity and aligns more closely with clinical decision-making. By integrating such approaches with causal inference methods in RWD, researchers can strike a balance between statistical rigour and clinical relevance, particularly in diverse Asian populations where endpoint heterogeneity and competing risks may differ from those in trial cohorts from high-income countries.

The Win Ratio: Concept, Interpretation and Evidence from Trials

The WR was introduced as a hierarchical alternative to traditional composite endpoints, designed to preserve clinical interpretability while maintaining statistical efficiency.6 Unlike standard time-to-first-event analyses, which treat all outcomes equally, the WR compares pairs of patients across treatment groups according to a pre-specified hierarchy of clinical importance, such as death, HF hospitalisation and symptom improvement.

Concept and Methodology

In its simplest form, each patient in the treatment group is compared with every patient in the control group. For each pair, the outcome is classified as a win, loss or tie.

Comparisons proceed sequentially in rounds according to outcome hierarchy (Figure 1A). In the first round, all pairs are compared on the highest-priority outcome, typically mortality. Pairs classified as wins or losses are finalised and excluded from further comparison. Only pairs that remain tied then proceed to the second round, where they are compared on the next priority outcome, such as HF hospitalisation. This sequential process continues through subsequent outcome tiers until all pairs are either resolved or remain tied across all outcomes. The win ratio is ultimately calculated as the total number of wins divided by the total losses.

  • A win occurs when the treatment patient has a better outcome on the most critical endpoint, for example, they live longer and avoid hospitalisation, compared with the control patient.
  • A loss occurs when the control patient fares better.
  • A tie occurs when both parties have identical outcomes, that is both parties are alive without hospitalisation.

The WR is calculated as the total number of wins divided by the total number of losses:

WR = Number of wins/Number of losses

A WR greater than 1 indicates that the treatment is favourable. Alternatively, the win difference (WD), which is the difference in proportions of wins and losses, can be reported to provide an absolute measure of effect.

This hierarchical framework ensures that comparisons begin with the most clinically meaningful outcome, typically mortality. Only tied pairs (for example, when both patients survive) proceed to the next outcome tier (Figure 1A).6 The method uses event timing for ordinal ranking, determining who experienced an event first, rather than quantifying temporal differences (Figures 1B and 1C). For instance, when both treatment and control patients die, the WR classifies the earlier death as a loss regardless of the time interval, unlike Cox regression, which quantifies HR based on exact timing. Detailed scenarios illustrating differential follow-up, censoring and fixed time windows are shown in Figure 1B.

For time-to-event outcomes, the WR uses event timing to determine ordinal rankings (who experienced an event first) rather than quantifying temporal differences. Pairs with differing follow-up durations are classified as tied if both remain event-free during overlapping observation, then advance to the next outcome tier. Events occurring after the comparator’s censoring time typically result in ties. When a fixed time window is specified, for example 90-day mortality, the analysis becomes a binary outcome assessment rather than an accurate time-to-event comparison.

Figure 1: Win Ratio Methodology and Time-to-event Handling

Article image

Clinical Interpretation and Visualisation

Clinically, the WR provides an intuitive interpretation: a WR of 1.5 implies that a randomly selected patient receiving treatment is 50% more likely to ‘win’ than a control patient when outcomes are ranked by importance. Visualisation tools, such as win–loss plots or hierarchical bar charts, can display the cumulative distribution of wins and losses across endpoints, making the hierarchy explicit and enhancing interpretability for clinicians, regulators and policy-makers.26

Evidence from Clinical Trials

The WR has been increasingly applied in major cardiovascular and HF trials. In the EMPULSE trial, empagliflozin significantly improved 90-day outcomes among patients hospitalised for acute HF, yielding a WR of 1.36 (95% CI [1.09–1.68]; p=0.0054).27 The hierarchical endpoint – death, number of HF events and ≥5-point improvement in the Kansas City Cardiomyopathy Questionnaire (KCCQ) – allowed simultaneous assessment of survival and quality-of-life domains.

In the EMPEROR-Preserved trial, which evaluated empagliflozin in HFpEF, the primary composite (CVD or HF hospitalisation) was re-analysed using WR to clarify the relative contributions of fatal versus non-fatal events.5 The analysis confirmed a consistent treatment benefit while highlighting the mortality-neutral but hospitalisation-reducing effect of sodium-glucose co-transporter 2 inhibitors (SGLT2i).

Similarly, in the DELIVER trial, dapagliflozin in HFpEF yielded a WR of 1.25 (95% CI [1.10–1.42]), reinforcing the robustness of hierarchical comparisons even when overall event rates were modest.7 Collectively, these studies validated the WR as a clinically meaningful and statistically rigorous endpoint that captures the multidimensional impact of therapy, encompassing mortality, morbidity and patient-reported outcomes.

Advantages and Limitations

The WR offers several advantages:

  • It respects clinical priorities by ranking outcomes according to severity.
  • It enhances interpretability through patient-centred comparisons.
  • It remains valid even when event timing varies or follow-up is incomplete, since the analysis depends on outcome order rather than exact timing.

Additionally, it can integrate binary, ordinal and continuous endpoints, such as symptom scores or biomarkers, within a single analytic framework.6

Nonetheless, several limitations warrant consideration. Results depend heavily on the chosen hierarchy — different orderings of outcomes may yield different WR estimates, necessitating sensitivity analyses.8,9 The WR also becomes less stable when event frequencies are low, as a few events may disproportionately influence the results. Missing data, censoring and unequal follow-up are further risks for bias. While early implementations were computationally intensive, modern statistical software has largely mitigated these challenges, facilitating broader application in clinical trials and real-world studies.

Overall, the WR represents a significant methodological advance that aligns statistical analyses with clinical reasoning. Its adoption in recent HF trials underscores its potential to bridge efficacy and real-world relevance, especially when combined with causal-inference approaches to address confounding and selection bias in observational datasets. The following section examines circumstances in which the WR may be misleading, along with strategies to ensure its appropriate use.

When the Win Ratio Can Mislead: Lessons from Recent Critiques

Despite its conceptual appeal and increasing adoption in cardiovascular trials, the WR is not immune to methodological pitfalls. Recent critical appraisals have emphasised that while the approach appears intuitively patient-centred, inappropriate design choices, incomplete data or lack of transparent reporting can produce misleading interpretations. Butler et al. cautioned that the WR may be ‘seductive but misleading’ if applied without careful consideration of endpoint hierarchy, data structure and analytical assumptions.10 Understanding when and why the WR may fail is essential, particularly when extending the method from controlled trial settings to heterogeneous RWD.

Sensitivity to Endpoint Hierarchy and Event Rates

The validity of the WR fundamentally depends on the predefined outcome hierarchy, yet this hierarchy is often chosen based on clinical consensus rather than empirical evidence. Reordering endpoints can dramatically alter conclusions. For example, prioritising HF hospitalisation before death (rather than the conventional death-first hierarchy) may reverse the direction of treatment effect, particularly when the intervention differentially affects fatal versus non-fatal outcomes.28

This sensitivity is especially problematic when event frequencies differ substantially between components. When hospitalisations are common but mortality is rare, pairwise comparisons become dominated by the more frequent, but less severe, endpoint. In such scenarios, a modest reduction in hospitalisations may produce an apparently favourable WR, even if mortality trends in the opposite direction, potentially obscuring clinically critical signals.9

In RWD, the problem becomes even more pronounced. Delayed death ascertainment, incomplete linkage to vital statistics or differential loss to follow-up can result in a systematic underestimation of survival-related wins in one exposure group. If administrative databases capture hospitalisations promptly but death notifications lag by weeks or months, treatment effects may appear more significant than they really are. These biases are particularly concerning in Asian registries, where mortality linkage is not always complete and it varies across regions.29

Moreover, while the WR is promoted as patient-centred, the choice of hierarchy is often driven by investigator preference or statistical considerations rather than patient values. For example, elderly patients with HF may prioritise symptom relief over marginal survival gains, while younger patients may value longevity above all, and yet most analyses impose a single, fixed hierarchy.30 Furthermore, the method overlooks the timing and severity within hierarchical tiers: a patient dying after 5 years is counted identically to a patient dying after 5 days, and a brief overnight observation is equated with a prolonged admission to intensive care.

Dependence on Data Completeness and Follow-up

The WR assumes that outcome ascertainment is accurate, timely and consistent across all participants. In randomised trials with protocolised follow-up, this assumption often holds. In observational RWD, however, it frequently fails.

Variable observation windows pose a significant challenge. Patients with shorter follow-up or those who transfer care to non-participating sites may be incorrectly classified as being ‘ties’ when events occurred but were not captured, artificially inflating the WR toward unity and attenuating apparent treatment differences.31 The bias worsens when follow-up duration correlates with exposure, for instance, if patients on newer therapies are monitored more intensively, they may accrue more documented events despite receiving better care.

Differential censoring introduces additional complexity. When death rates differ between exposure groups, subsequent non-fatal comparisons are no longer balanced, patients in the higher mortality group are systematically removed from the risk set, leaving survivors who may differ prognostically. Standard WR calculations do not account for this informative censoring, which can potentially bring unpredictable bias to estimates.

Missing hospitalisation records compound these issues. In fragmented healthcare systems across south-east Asia, patients may be hospitalised at multiple institutions without centralised data linkage. If one treatment group systematically uses hospital networks with better data capture, apparent differences in WRs may reflect surveillance bias rather than actual clinical benefit.

Importance of Transparency and Reporting

Given these vulnerabilities, transparent pre-specification and comprehensive reporting are crucial for the assessment of the validity of WR analyses. Unfortunately, many published applications fall short of expectations. Key reporting elements often omitted include: exact endpoint definitions and hierarchical ordering, including tie adjudication rules; handling of missing data and loss to follow-up, particularly whether this differed between groups; sensitivity analyses using alternative hierarchies or adjustments to competing risks; distribution of wins, losses and ties by endpoint category; and assessment of underlying statistical assumptions.30 Without these details, reproducibility is limited and external validity – especially when extending trial findings to real-world populations – remains uncertain.

When the Win Ratio Win Ratio Is Appropriate versus Misleading

The WR is most appropriate when: outcomes can be meaningfully and unambiguously ordered by clinical severity; event ascertainment is complete, accurate and equally reliable across exposure groups; follow-up duration is adequate and comparable between groups; the hierarchy reflects genuine patient preferences or clinical consensus; and multiple clinically meaningful outcomes occur at sufficient frequency to inform pairwise comparisons.

The WR may be misleading when: endpoint ordering is arbitrary or controversial; event rates are highly imbalanced, with one endpoint dominating comparisons; mortality differences are overwhelming, rendering non-fatal comparisons clinically irrelevant; data completeness varies systematically by exposure, region or patient characteristics; follow-up is highly incomplete or censoring is informative and differential between groups; or competing risks are substantial, requiring formal competing risk regression.24

In real-world Asian registries, characterised by variable coding practices, incomplete mortality linkage, heterogeneous follow-up and fragmented care systems, these limitations are not hypothetical. Therefore, the WR should be applied cautiously, with extensive sensitivity analyses, transparent reporting and explicit acknowledgement of data-quality constraints. When these conditions cannot be met, alternative approaches such as recurrent event models or competing-risk regression may provide more reliable estimates. The following section demonstrates how integrating causal inference methods, particularly propensity score techniques, can mitigate some of these challenges by approximating randomisation and enabling fairer pairwise comparisons in observational data.

Propensity Score Methods for Observational Win Ratio Analyses

Applying the WR to observational RWD requires careful adjustment for confounding and selection bias, since treatment allocation is not random. Propensity score (PS) methods, widely used in comparative effectiveness research, offer a principled framework to approximate randomisation by balancing observed covariates between exposure groups.11 When integrated with hierarchical outcomes, such as the WR, PS methods can enable fairer pairwise comparisons and enhance the causal interpretability of findings from real-world HF registries.

Conceptual Framework

The PS is defined as the conditional probability of receiving a treatment given observed baseline covariates:

PSi = P(Ti = 1 | Xi)

where Ti denotes treatment assignment and Xi the vector of confounders.

By balancing these covariates, comparisons between treated and untreated patients approximate those of a randomised trial, conditional on measured variables.

Several approaches exist for incorporating PS into WR analyses:

  • Matching: each treated patient is matched with one or more control patients who have similar PS values. WR comparisons are then restricted to matched pairs, which reduces extrapolation but potentially excludes unmatched patients.
  • Stratification: patients are grouped into strata (often quintiles) based on PS distribution and WRs are computed within each stratum and pooled across strata.
  • Inverse probability of treatment weighting (IPTW): each individual is weighted by the inverse of their probability of receiving the treatment they actually received:

equation1

These weights create a pseudo-population in which treatment assignment is independent of observed confounders.32

Among these, IPTW offers the most excellent flexibility for RWD applications, as it preserves sample size and facilitates variance estimation using robust sandwich methods.

A concise comparison of these approaches is presented in Table 2.

Table 2: Comparison of Propensity Score Methods

Article image

Diagnostics and Sensitivity Considerations

Regardless of the approach, balance diagnostics are essential to verify that covariate distributions are similar between treatment groups after adjustment for confounding factors. The standardised mean difference (SMD) is commonly used and an absolute SMD<0.1 indicates adequate balance. Visualisation via Love plots, displaying pre- and post-adjustment SMDs, provides an intuitive assessment of balance quality.33

Extreme weights may arise when PS values approach 0 or 1, reflecting a lack of overlap between groups. Weight truncation or stabilisation, for example by capping at the 1st–99th percentile, can mitigate variance inflation and prevent undue influence from outlier patients.34 Sensitivity analyses, excluding poorly overlapping regions or applying doubly robust estimation that combines weighting and outcome modelling, can further enhance robustness.

Integration with the Win Ratio

When combining IPTW with the WR, each pairwise comparison can be weighted by the product of the two individuals’ inverse probabilities, ensuring that pairs representing underrepresented covariate profiles contribute appropriately.35 This weighted WR maintains the intuitive interpretation of wins and losses while aligning the pseudo-population toward causal estimands, such as the average treatment effect. Statistical inference can be performed via bootstrap resampling or asymptotic variance estimates.

To illustrate, consider an Asian HF registry evaluating SGLT2i use versus standard therapy. After estimating PS using demographic, comorbidity and laboratory covariates, IPTW is applied to balance treatment groups. A hierarchical outcome (death > HF hospitalisation > emergency visits) is then analysed using a weighted WR framework. Comparing weighted and unweighted results can reveal whether apparent benefits persist after adjusting for confounding factors, highlighting the incremental value of PS integration.

Challenges and Practical Recommendations

While PS-adjusted WR analyses enhance causal validity, several caveats remain. First, unmeasured confounding persists, particularly for socioeconomic or behavioural factors not captured in registries. Second, time-varying treatment exposure, such as therapy initiation, discontinuation, or switching, violates the static treatment assumption of conventional PS methods, requiring extensions such as marginal structural models or dynamic PS updating.36 Third, missing data on covariates used for PS estimation can bias weights; multiple imputation before PS modelling is recommended when missingness is moderate and random.

Practical steps for implementation include: pre-specifying covariates based on clinical rationale, not statistical significance; reporting covariate balance before and after adjustment; conducting sensitivity analyses with alternative PS models; and transparently documenting how weights were derived and truncated. These steps align with emerging reporting standards for causal analyses in RWD studies.37

In summary, integrating propensity score methods within WR analyses provides a robust pathway for translating hierarchical outcome frameworks into real-world causal inference. When applied transparently and accompanied by rigorous balance assessment, these methods can extend the interpretive clarity of clinical trials to diverse, unselected HF populations across Asia.

Practical Framework for Real-World Application

The integration of PS methods with WR analyses offers a principled approach for generating causal evidence from observational HF data. Drawing on recent applications in oncology, we propose a conceptual workflow tailored to Asian HF registries.38

Define the Causal Question and Treatment Strategies

Any causal analysis begins with a clearly articulated research question that specifies: who (the target population), what (specific treatment strategies), when (index time and follow-up), and compared to what (the reference strategy).12 For example: Among patients with HFrEF hospitalised in Asian registries between 2020–2023, does initiation of SGLT2i therapy within 30 days of discharge reduce death and recurrent hospitalisations compared with standard GDMT alone?

This specificity is essential for defining eligibility criteria, excluding prevalent users to avoid immortal time bias, and establishing a coherent time zero for outcomes to be ascertained.39 The causal contrast should align with clinically actionable decisions rather than comparing heterogeneous treatment patterns.

Specify the Endpoint Hierarchy

Hierarchical outcomes must reflect clinical priorities and be justified a priori. A typical hierarchy for HF registries might look like this: CVD > HF hospitalisation > emergency department visit > functional status change (if measured). This ordering respects that survival takes precedence over morbidity.

However, alternative hierarchies should be explored in sensitivity analyses. For instance, CVD might be prioritised over non-CVD in populations with high competing mortality from infections or malignancy, which is common in Asian HF cohorts.29 Investigators should explicitly justify their primary hierarchy and report how the results change under alternative scenarios.35

Apply Propensity Score Balancing and Assess Overlap

The PS is estimated using baseline covariates that influence both treatment selection and outcomes, selected based on clinical knowledge rather than automated procedures.11 Typical confounders include age, sex, ejection fraction, comorbidities, prior HF hospitalisations and baseline medications.

Balance diagnostics are critical. Standardised mean differences (SMD<0.1) should be verified before and after adjustment, with Love plots visualising balance improvement.39 Positivity violations – regions where treated or control patients have no counterparts – should be identified through PS distribution plots. Extreme weights should be truncated to prevent undue influence from outliers.34

Compute Win Ratio and Compare with Traditional Endpoints

With balanced groups, the WR is computed by comparing all patient pairs according to the predefined hierarchy. Each treated patient is compared with each control patient (with weights applied in IPTW settings), yielding win, loss or tie classifications.35

Results should be compared with traditional regression analyses. Discrepancies between methods can be informative as they suggest that the WR primarily benefits from reduced hospitalisations. At the same time, Cox models show no survival advantage, which highlights that treatment effects concentrate in non-fatal outcomes, which is a clinically significant distinction.7,8

Conduct Sensitivity Analyses and Assess Feasibility

Robust inference requires sensitivity analyses addressing: alternative endpoint hierarchies; different propensity score specifications; weight truncation strategies; subgroup analyses by region, age, or phenotype; and restriction to well-defined follow-up windows.24,35 Further, we recommend calculating the E-value to quantify the potential impact of unmeasured confounding; this metric indicates the minimum strength of association an unobserved confounder would need to explain away the observed treatment effect. Documenting how results vary provides transparency about robustness.

This framework is feasible for established registries, such as ASIAN-HF, JROADHF, KorAHF and China-HF, as well as emerging cohorts, such as the THFR.14–17 Key requirements include complete baseline data, reliable outcome ascertainment (particularly mortality linkage), and adequate sample size. Regional heterogeneity can be addressed through stratified analyses or region-specific propensity scores.

Figure 2 illustrates this conceptual workflow, from defining the causal question to applying PS balancing and interpreting the WR results.

Figure 2: Conceptual Workflow Diagram

Article image

Application in Asian Registries: Recent Case Studies

To demonstrate the feasibility of the proposed framework, we examine two recent studies using the WR in Asian HF populations.

Case Study 1: The SAVIOR-L Subanalysis

Hidaka et al. applied the WR to evaluate adaptive servo-ventilation (ASV) in patients with worsening HF.40

  • Causal method: they used propensity score matching (1:3) to balance baseline characteristics between ASV and non-ASV users, ensuring comparability in a heterogeneous elderly cohort.
  • Hierarchy: the analysis prioritised CVD followed by cardiovascular hospitalisation.
  • Result: while standard composites showed disadvantageous outcomes for ASV, the matched WR (0.838; 95% CI [0.657–1.069]) clarified a neutral effect rather than definitive harm, illustrating how the method provides nuanced safety signals using complex RWD.

Case Study 2: Angiotensin Receptor-neprilysin Inhibitor in Dialysis Patients

Yang et al. used the WR to compare angiotensin receptor-neprilysin inhibitor (ARNI) with angiotensin-converting enzyme inhibitors (ACEI)/angiotensin receptor blockers (ACEI/ARB) in HFrEF patients with end-stage renal disease – a population typically excluded from RCTs.41

  • Hierarchy innovation: they integrated safety outcomes into the hierarchy: mortality>HF hospitalisation>safety events (hypotension, hyperkalaemia).
  • Causal method: although the primary analysis was unmatched, PS matching was rigorously applied in sensitivity analyses to validate robustness against confounding.
  • Result: the WR (0.73; 95% CI [0.38–1.44]) confirmed comparable efficacy but highlighted a trend toward a better safety profile for ARNI.

These examples validate that integrating PS methods with hierarchical endpoints is not only feasible but essential for interpreting complex real-world Asian datasets.

Methodological Challenges and Limitations

Despite its promise, applying the WR to real-world Asian HF data involves several methodological challenges.

Unmeasured confounding remains the foremost limitation. PS methods balance only observed covariates; however, factors such as patient preferences, physician behaviour, socioeconomic barriers and unmeasured clinical severity cannot be directly adjusted for.11 E-value calculations can quantify robustness to unmeasured confounding.39

Missing data and incomplete follow-up are pervasive. Missing baseline covariates bias treatment estimates if missingness relates to both treatment and outcomes; multiple imputation is recommended when plausible.42 Differential loss to follow-up or incomplete outcome ascertainment, common in fragmented healthcare systems, may distort WR estimates.

Generalisability across Asian countries is limited by heterogeneity in healthcare infrastructure, coding practices and population characteristics. Findings from well-resourced settings may not translate to lower-resource environments. Pooled analyses require careful consideration of effect modification.13

Model-dependent interpretations introduce additional complexity. The specification of the PS model affects the balance quality and WR estimates. Different analysts may reach different conclusions depending on modelling choices, underscoring the importance of transparent reporting and pre-specification of the analytical models and endpoint hierarchies.37

Finally, time-varying confounding and treatment switching, which are common in chronic HF management, violate static treatment assumptions. Patients may initiate, discontinue or switch therapies over time, creating dynamic feedback. Addressing these complexities requires advanced methods such as marginal structural models or target trial emulation frameworks.36

Future Directions

The integration of WR methodology with causal inference frameworks represents an evolving frontier in cardiovascular research.

Target trial emulation offers a rigorous approach by designing analyses to emulate randomised trials.12 By specifying eligibility, treatment strategies, follow-up, outcomes and analytical plans as if conducting an RCT, this framework minimises biases and ensures alignment between causal questions and statistical estimands. Combining target trial emulation with WR endpoints could enhance the credibility of real-world evidence.

Digital health and wearable data offer opportunities to enrich hierarchical outcomes. Continuous monitoring could enable more granular endpoints, such as sustained functional deterioration, beyond traditional binary outcomes.43

To truly realise the patient-centred promise of the WR, future studies should consider incorporating patient preferences directly into the hierarchy definition. Methods such as discrete choice experiments can quantify how patients make trade-offs between longevity and quality of life, thereby validating the hierarchical order used in the analysis.

Harmonised reporting standards are needed. Consensus guidelines should specify minimum reporting elements for PS methods, WR calculations, sensitivity analyses and data quality assessments, thereby enhancing reproducibility and facilitating cross-registry comparisons.37 To assist investigators in this process, we have summarised key recommendations in Box 1.

Box 1: Proposed Best Practice Checklist for Win Ratio Analysis in Real-world Heart Failure Studies

  • Pre-specify the hierarchy: define the outcome order based on clinical priority (for example: death > hospitalisation > quality of life) rather than statistical convenience.
  • Validate data completeness: ensure robust mortality linkage is available as differential loss to follow-up can critically bias win/loss counts.
  • Adjust for confounding: integrate propensity score methods, such as Inverse probability of treatment weighting or matching) to balance baseline characteristics and approximate randomisation.
  • Assess overlap: verify covariate balance using Love plots and truncate extreme weights to prevent instability.
  • Report transparently: publish the absolute counts of wins, losses and ties for each specific outcome tier, not just the final ratio.
  • Conduct sensitivity analyses: test whether results remain consistent under alternative hierarchies, for example, prioritising total HF events, or different statistical assumptions.

Conclusion

The WR provides a hierarchical approach that aligns with clinical decision-making. In conclusion, this methodological review and practical framework demonstrate that when embedded within rigorous causal inference frameworks, using PS, transparent reporting and sensitivity analyses, the WR can bridge the gap between randomised trials and real-world care. For Asian HF registries, this integration enables the generation of actionable evidence from diverse populations. As methods evolve and data quality improves, the WR may emerge as a cornerstone of real-world comparative effectiveness research in HF across the Asia-Pacific region.

Clinical Perspective

  • The win ratio (WR) provides a hierarchical, patient-centred alternative to conventional composite endpoints by prioritising mortality over non-fatal outcomes in heart failure analyses.
  • Integration of propensity score methodology with WR analysis strengthens causal inference when applied to observational Asian heart failure registries characterised by confounding, missing data and heterogeneous follow-up.
  • A structured best practice framework, including pre-specified outcome hierarchies, mortality linkage, covariate balance assessment and sensitivity analyses, is essential to ensure valid and transparent generation of WR-based real-world evidence in heart failure.

References

  1. Feng J, Zhang Y, Zhang J. Epidemiology and burden of heart failure in Asia. JACC Asia 2024;4:249–64. 
    Crossref | PubMed
  2. Kennedy-Martin T, Curtis S, Faries D, et al. A literature review on the representativeness of randomized controlled trial samples and implications for the external validity of trial results. Trials 2015;16:495. 
    Crossref | PubMed
  3. Rothwell PM. External validity of randomised controlled trials: ’to whom do the results of this trial apply?’ Lancet 2005;365:82–93. 
    Crossref | PubMed
  4. Blonde L, Khunti K, Harris SB, et al. Interpretation and impact of real-world clinical data for the practicing clinician. Adv Ther 2018;35:1763–74. 
    Crossref | PubMed
  5. Rogers JK, Pocock SJ, McMurray JJV, et al. Analysing recurrent hospitalizations in heart failure: a review of statistical methodology, with application to CHARM-Preserved. Eur J Heart Fail 2014;16:33–40. 
    Crossref | PubMed
  6. Pocock SJ, Ariti CA, Collier TJ, Wang D. The win ratio: a new approach to the analysis of composite endpoints in clinical trials based on clinical priorities. Eur Heart J 2012;33:176–82. 
    Crossref | PubMed
  7. Anker SD, Butler J, Filippatos G, et al. Empagliflozin in heart failure with a preserved ejection fraction. N Engl J Med 2021;385:1451–61. 
    Crossref | PubMed
  8. Solomon SD, McMurray JJV, Claggett B, et al. Dapagliflozin in heart failure with mildly reduced or preserved ejection fraction. N Engl J Med 2022;387:1089–98. 
    Crossref | PubMed
  9. Ajufo E, Nayak A, Mehra MR. Fallacies of using the win ratio in cardiovascular trials: challenges and solutions. JACC Basic Transl Sci 2023;8:720–7. 
    Crossref | PubMed
  10. Butler J, Stockbridge N, Packer M. Win ratio: a seductive but potentially misleading method for evaluating evidence from clinical trials. Circulation 2024;149:1546–8. 
    Crossref | PubMed
  11. Austin PC. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behav Res 2011;46:399–424. 
    Crossref | PubMed
  12. Hernán MA, Sauer BC, Hernández-Díaz S, et al. Specifying a target trial prevents immortal time bias and other self-inflicted injuries in observational analyses. J Clin Epidemiol 2016;79:70–5. 
    Crossref | PubMed
  13. Tromp J, Tay WT, Ouwerkerk W, et al. Multimorbidity in patients with heart failure from 11 Asian regions: a prospective cohort study using the ASIAN-HF registry. PLOS Med 2018;15:e1002541. 
    Crossref | PubMed
  14. Lam CSP, Teng TK, Tay WT, et al. Regional and ethnic differences among patients with heart failure in Asia: the Asian sudden cardiac death in heart failure registry. Eur Heart J 2016;37:3141–53. 
    Crossref | PubMed
  15. Ide T, Kaku H, Matsushima S, et al. Clinical characteristics and outcomes of hospitalized patients with heart failure from the large-scale Japanese Registry of Acute Decompensated Heart Failure (JROADHF). Circ J 2021;85:1438–50. 
    Crossref | PubMed
  16. Lee SE, Lee H-Y, Cho H-J, et al. Clinical characteristics and outcome of acute heart failure in Korea: results from the Korean Acute Heart Failure Registry (KorAHF). Korean Circ J 2017;47:341–53. 
    Crossref | PubMed
  17. Wang H, Li Y, Chai K, et al. Mortality in patients admitted to hospital with heart failure in China: a nationwide cardiovascular Association Database-Heart Failure Centre Registry cohort study. Lancet Glob Health 2024;12:e611–22. 
    Crossref | PubMed
  18. Laothavorn P, Hengrussamee K, Kanjanavanit R, et al. Thai acute decompensated heart failure registry (Thai ADHERE). CVD Prev Control 2010;5:89–95. 
    Crossref
  19. Savarese G, Vasko P, Jonsson Å, et al. The Swedish Heart Failure Registry: a living, ongoing quality assurance and research in heart failure. Ups J Med Sci 2019;124:65–9. 
    Crossref | PubMed
  20. Peterson PN, Rumsfeld JS, Liang L, et al. A validated risk score for in-hospital mortality in patients with heart failure from the American Heart Association get with the guidelines program. Circ Cardiovasc Qual Outcomes 2010;3:25–32. 
    Crossref | PubMed
  21. Freemantle N, Calvert M, Wood J, et al. Composite outcomes in randomized trials: greater precision but with greater uncertainty? JAMA 2003;289:2554–9. 
    Crossref | PubMed
  22. McMurray JJV, Packer M, Desai AS, et al. Angiotensin–neprilysin inhibition versus enalapril in heart failure. N Engl J Med 2014;371:993–1004. 
    Crossref | PubMed
  23. McMurray JJV, Solomon SD, Inzucchi SE, et al. Dapagliflozin in patients with heart failure and reduced ejection fraction. N Engl J Med 2019;381:1995–2008. 
    Crossref | PubMed
  24. Austin PC, Fine JP. Accounting for competing risks in randomized controlled trials: a review and recommendations for improvement. Stat Med 2017;36:1203–9. 
    Crossref | PubMed
  25. Solomon SD, McMurray JJV, Anand IS, et al. Angiotensin–neprilysin inhibition in heart failure with preserved ejection fraction. N Engl J Med 2019;381:1609–20. 
    Crossref | PubMed
  26. Mentz RJ, Cotter G, Cleland JGF, et al. International differences in clinical characteristics, management, and outcomes in acute heart failure patients: better short-term outcomes in patients enrolled in Eastern Europe and Russia in the PROTECT trial. Eur J Heart Fail 2014;16:614–24. 
    Crossref | PubMed
  27. Voors AA, Angermann CE, Teerlink JR, et al. The SGLT2 inhibitor empagliflozin in patients hospitalized for acute heart failure: a multinational randomized trial. Nat Med 2022;28:568–74. 
    Crossref | PubMed
  28. Redfors B, Gregson J, Crowley A, et al. The win ratio approach for composite endpoints: practical guidance based on previous experience. Eur Heart J 2020;41:4391–9. 
    Crossref | PubMed
  29. Yingchoncharoen T, Phrommintikul A, Buakhamsri A, et al. Nationwide analysis of mortality and causes of death in heart failure: updated insights from the Thai Heart Failure Registry (THFR). Eur Heart J 2024;45(Suppl 1):ehae666. 
    Crossref
  30. Pocock SJ, Gregson J, Collier TJ, et al. The win ratio in cardiology trials: lessons learnt, new developments, and wise future use. Eur Heart J 2024;45:4684–99. 
    Crossref | PubMed
  31. Dong G, Huang B, Chang YW, et al. The win ratio: impact of censoring and follow-up time and use with nonproportional hazards. Pharm Stat 2020;19:168–77. 
    Crossref | PubMed
  32. Austin PC, Stuart EA. Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies. Stat Med 2015;34:3661–79. 
    Crossref | PubMed
  33. Austin PC. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Stat Med 2009;28:3083–107. 
    Crossref | PubMed
  34. Cole SR, Hernán MA. Constructing inverse probability weights for marginal structural models. Am J Epidemiol 2008;168:656–64. 
    Crossref | PubMed
  35. Dong G, Mao L, Huang B, et al. The inverse-probability-of-censoring weighting (IPCW) adjusted win ratio statistic: an unbiased estimator in the presence of independent censoring. J Biopharm Stat 2020;30:882–99. 
    Crossref | PubMed
  36. Hernán MA, Brumback B, Robins JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. Epidemiology. 2000;11:561–70. 
    Crossref | PubMed
  37. Garrido MM, Kelley AS, Paris J, et al. Methods for constructing and assessing propensity scores. Health Serv Res 2014;49:1701–20. 
    Crossref | PubMed
  38. Chiaruttini MV, Lorenzoni G, Spolverato G, Gregori D. Win statistics in observational cancer research: integrating clinical and quality-of-life outcomes. J Clin Med 2024;13:3272. 
    Crossref | PubMed
  39. VanderWeele TJ, Ding P. Sensitivity analysis in observational research: introducing the E-value. Ann Intern Med 2017;167:268–74. 
    Crossref | PubMed
  40. Hidaka T, Tada T, Suzuki H, et al. Efficacy of adaptive servo-ventilation in worsening heart failure. Int Heart J 2025;66:585–92. 
    Crossref | PubMed
  41. Yang I-N, Huang C-Y, Yang C-T, et al. Real-world experience of angiotensin receptor-neprilysin inhibitors in patients with heart failure and dialysis. Front Cardiovasc Med 2024;11:1393440. 
    Crossref | PubMed
  42. Sterne JAC, White IR, Carlin JB, et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ 2009;338:b2393. 
    Crossref | PubMed
  43. Patel MS, Asch DA, Volpp KG. Wearable devices as facilitators, not drivers, of health behavior change. JAMA 2015;313:459–60. 
    Crossref | PubMed