Negative sampling strategies impact the prediction of scale-free biomolecular network interactions with machine learning

Scritto il 09/05/2025
da Pengpai Li

BMC Biol. 2025 May 9;23(1):123. doi: 10.1186/s12915-025-02231-w.

ABSTRACT

BACKGROUND: Understanding protein-molecular interaction is crucial for unraveling the mechanisms underlying diverse biological processes. Machine learning (ML) techniques have been extensively employed in predicting these interactions and have garnered substantial research focus. Previous studies have predominantly centered on improving model performance through novel and efficient ML approaches, often resulting in overoptimistic predictive estimates. However, these advancements frequently neglect the inherent biases stemming from network properties, particularly in biological contexts.

RESULTS: In this study, we examined the biases inherent in ML models during the learning and prediction of protein-molecular interactions, particularly those arising from the scale-free property of biological networks-a characteristic where in a few nodes have many connections while most have very few. Our comprehensive analysis across diverse tasks, datasets, and ML methods provides compelling evidence of these biases. We discovered that the training and evaluation of ML models are profoundly influenced by network topology, potentially distorting model performance assessments. To mitigate this issue, we propose the degree distribution balanced (DDB) sampling strategy, a straightforward yet potent approach that alleviates biases stemming from network properties. This method further underscores the limitations of certain ML models in learning protein-molecular interactions solely from intrinsic molecular features.

CONCLUSIONS: Our findings present a novel perspective for assessing the performance of ML models in inferring protein-molecular interactions with greater fairness. By addressing biases introduced by network properties, the DDB sampling approach provides a more balanced and precise assessment of model capabilities. These insights hold the potential to bolster the reliability of ML models in bioinformatics, fostering a more stringent evaluation framework for predicting protein-molecular interactions.

PMID:40346567 | DOI:10.1186/s12915-025-02231-w