paper-trackr newsletter

← back to my website

Saturday, 26 April 2025

Hello science reader,

Welcome to your daily dose of research. Dive into the latest discoveries from the world of science!

This newsletter was built with paper-trackr and is tailored to my interests.
Customize your own filters and receive personalized updates.


Popcorn: prediction of short coding and noncoding genomic sequences in prokaryotes.

Alison Kyrouz, Lian Liu, Lixin Qin, Brian Tjaden

Source: PubMed Published: 2025-04-25

tl;dr: We present Popcorn, a novel machine learning method for determining whether prokaryotic sequences are coding or noncoding, including coding sORFs and nonc coding RNAs.

Abstract: The most challenging prokaryotic genes to identify often correspond to short ORFs (sORFs) encoding small proteins or to noncoding RNAs. RNA-seq experiments commonly evince small transcripts that do not correspond to annotated genes and are candidates for novel coding sORFs or small regulatory RNAs, but it can be difficult to accurately assess whether the numerous small transcripts are coding or not. We present Popcorn (PrOkaryotic Prediction of Coding OR Noncoding), a novel machine learning method for determining whether prokaryotic sequences are coding or noncoding. We find that Popcorn is effective in distinguishing coding from noncoding sequences, including coding sORFs and noncoding RNAs.

Read full paper


Mapping the rapid growth of multi-omics in tumor immunotherapy: Bibliometric evidence of technology convergence and paradigm shifts.

Huijing Dong, Xinmeng Wang, Yumin Zheng, Jia Li, Zhening Liu, Aolin Wang, Yulei Shen, Daixi Wu, Huijuan Cui

Source: PubMed Published: 2025-04-24

tl;dr: A systematic analysis of omics-driven tumor immunotherapy research through a bibliometric lens.

Abstract: This study aims to fill the knowledge gap in systematically mapping the evolution of omics-driven tumor immunotherapy research through a bibliometric lens. While omics technologies (genomics, transcriptomics, proteomics, metabolomics)provide multidimensional molecular profiling, their synergistic potential with immunotherapy remains underexplored in large-scale trend analyses. A comprehensive search was conducted using the Web of Science Core Collection for literature related to omics in tumor immunotherapy, up to August 2024. Bibliometric analyses, conducted using R version 4.3.3, VOSviewer 1.6.20, and Citespace 6.2, examined publication trends, country and institutional contributions, journal distributions, keyword co-occurrence, and citation bursts. This analysis of 9,494 publications demonstrates rapid growth in omics-driven tumor immunotherapy research since 2019, with China leading in output (63% of articles) yet exhibiting limited multinational collaboration (7.9% vs. the UK's 61.8%). Keyword co-occurrence and citation burst analyses reveal evolving frontiers: early emphasis on "PD-1/CTLA-4 blockade" has transitioned toward "machine learning," "multi-omics," and "lncRNA," reflecting a shift to predictive modeling and biomarker discovery. Multi-omics integration has facilitated the development of immune infiltration-based prognostic models, such as TIME subtypes, which have been validated across multiple tumor types, which inform clinical trial design (e.g. NCT06833723). Additionally, proteomic analysis of melanoma patients suggests that metabolic biomarkers, particularly oxidative phosphorylation and lipid metabolism, may stratify responders to PD-1 blockade therapy. Moreover, spatial omics has confirmed ENPP1 as a potential novel therapeutic target in Ewing sarcoma. Citation trends underscore clinical translation, particularly mutation-guided therapies. Omics technologies are transforming tumor immunotherapy by enhancing biomarker discovery and improving therapeutic predictions. Future advancements will necessitate longitudinal omics monitoring, AI-driven multi-omics integration, and international collaboration to accelerate clinical translation. This study presents a systematic framework for exploring emerging research frontiers and offers insights for optimizing precision-driven immunotherapy.

Read full paper


A hybrid machine learning model with attention mechanism and multidimensional multivariate feature coding for essential gene prediction.

Wu Yan, Fu Yu, Li Tan, Li Mengshan, Xie Xiaojun, Zhou Weihong, Sheng Sheng, Wang Jun, Wu Fu-An

Source: PubMed Published: 2025-04-24

tl;dr: We propose a new approach for essential gene prediction based on machine learning, which can help to identify the most important genes in the genome.

Abstract: Essential genes are crucial for the development, inheritance, and survival of species. The exploration of these genes can unravel the complex mechanisms and fundamental life processes and identify potential therapeutic targets for various diseases. Therefore, the identification of essential genes is significant. Machine learning has become the mainstream approach for essential gene prediction. However, some key challenges in machine learning need to be addressed, such as the extraction of genetic features, the impact of imbalanced data, and the cross-species generalization ability.

Read full paper


Identification of pivotal genes and regulatory networks associated with SAH based on multi-omics analysis and machine learning.

Haoran Lu, Teng Xie, Xiaohong Qin, Shanshan Wei, Zilong Zhao, Xizhi Liu, Liquan Wu, Rui Ding, Zhibiao Chen

Source: PubMed Published: 2025-04-24

tl;dr: We identified five SAH-related feature genes and 1336 differentially expressed proteins in the immune microenvironment, which could serve as potential therapeutic targets and provide clues for exploring therapeutic options.

Abstract: Subarachnoid hemorrhage (SAH) is a disease with high mortality and morbidity, and its pathophysiology is complex but poorly understood. To investigate the potential therapeutic targets post-SAH, the SAH-related feature genes were screened by the combined analysis of transcriptomics and metabolomics of rat cortical tissues following SAH and proteomics of cerebrospinal fluid from SAH patients, as well as WGCNA and machine learning. The competitive endogenous RNAs (ceRNAs) and transcription factors (TFs) regulatory networks of the feature genes were constructed and further validated by molecular biology experiments. A total of 1336 differentially expressed proteins were identified, including 729 proteins downregulated and 607 proteins upregulated. The immune microenvironment changed after SAH and the changement persisted at SAH 7d. Through multi-omics and bioinformatics techniques, five SAH-related feature genes (A2M, GFAP, GLIPR2, GPNMB, and LCN2) were identified, closely related to the immune microenvironment. In addition, ceRNAs and TFs regulatory networks of the feature genes were constructed. The increased expression levels of A2M and GLIPR2 following SAH were verified, and co-localization of A2M with intravascular microthrombus was demonstrated. Multiomics and bioinformatics tools were used to predict the SAH associated feature genes confirmed further through the ceRNAs and TFs regulatory network development. These molecules might play a key role in SAH and may serve as potential biological markers and provide clues for exploring therapeutic options.

Read full paper


Identification and verification of mitochondria-related genes biomarkers associated with immune infiltration for COPD using WGCNA and machine learning algorithms.

Meijuan Peng, Chen Jiang, Ziyu Dai, Bin Xie, Qiong Chen, Jianing Lin

Source: PubMed Published: 2025-04-24

tl;dr: We identified five mitochondrial-related genes associated with COPD and its immune microenvironment and identified a potential therapeutic target for androstenol.

Abstract: Mitochondrial dysfunction plays a pivotal role in the pathogenesis of chronic obstructive pulmonary disease (COPD). This study combines bioinformatics analysis with machine learning to elucidate potential key mitochondrial-related genes associated with COPD and its immune microenvironment. We utilized the limma package and Weighted Gene Co-expression Network Analysis (WGCNA) to analyze datasets from the Gene Expression Omnibus (GEO) database (GSE57148), identifying 12 key differentially expressed mitochondrial genes (MitoDEGs). Using 12 distinct machine learning algorithms (comprising 143 predictive models), we identified the optimal diagnostic model, which includes five pivotal MitoDEGs: ERN1, FASTK, HIGD1B, NDUFA7 and NDUFB7. The diagnostic specificity and sensitivity of each gene, as well as the diagnostic model itself, were evaluated using Receiver operating characteristic (ROC) curves. This model demonstrated high specificity in the validation cohorts (GSE76925, GSE151052, GSE239897). Expression analysis revealed upregulation of ERN1 and downregulation of FASTK, HIGD1B, NDUFA7 and NDUFB7 in COPD patients. Spearman's correlation analysis indicated a significant association between MitoDEGs and immune cell infiltration, with ERN1 expression positively correlated with neutrophil infiltration and the other genes negatively correlated. The GABA receptor modulator androstenol was identified as a potential therapeutic candidate. In vivo studies confirmed reduced mRNA expression of HIGD1B and NDUFB7 in COPD mice. These findings elucidate mitochondrial-immune interactions in COPD and highlight novel diagnostic and therapeutic targets.

Read full paper


Proteomics uncovers ICAM2 (CD102) as a novel serum biomarker of proliferative lupus nephritis.

Zhengyong Li, Yifang Sun, Yixue Wang, Fengxun Liu, Shaokang Pan, Songwei Li, Zuishuang Guo, Dan Gao, Jinghua Yang, Zhangsuo Liu, Dongwei Liu

Source: PubMed Published: 2025-04-23

tl;dr: This study aimed to identify novel, non-invasive biomarkers for lupus nephritis (LN) through serum proteomics.

Abstract: This study aimed to identify novel, non-invasive biomarkers for lupus nephritis (LN) through serum proteomics.

Read full paper


Definer: A computational method for accurate identification of RNA pseudouridine sites based on deep learning.

Bo Han, Sudan Bai, Yang Liu, Jiezhang Wu, Xin Feng, Ruihao Xin

Source: PubMed Published: 2025-04-24

tl;dr: We propose a deep learning-based computational method, Definer, to accurately identify RNA pseudouridine modification sites from high-throughput RNA sequence data.

Abstract: Pseudouridine is an important modification site, which is widely present in a variety of non-coding RNAs and is involved in a variety of important biological processes. Studies have shown that pseudouridine is important in many biological functions such as gene expression, RNA structural stability, and various diseases. Therefore, accurate identification of pseudouridine sites can effectively explain the functional mechanism of this modification site. Due to the rapid increase of genomics data, traditional biological experimental methods to identify RNA modification sites can no longer meet the practical needs, and it is necessary to accurately identify pseudouridine sites from high-throughput RNA sequence data by computational methods. In this study, we propose a deep learning-based computational method, Definer, to accurately identify RNA pseudouridine loci in three species, Homo sapiens, Saccharomyces cerevisiae and Mus musculus. The method incorporates two sequence coding schemes, including NCP and One-hot, and then feeds the extracted RNA sequence features into a deep learning model constructed from CNN, GRU and Attention. The benchmark dataset contains data from three species, H. sapiens, S. cerevisiae and M. musculus, and the results using 10-fold cross-validation show that Definer significantly outperforms other existing methods. Meanwhile, the data sets of two species, H. sapiens and S. cerevisiae, were tested independently to further demonstrate the predictive ability of the model. In summary, our method, Definer, can accurately identify pseudouridine modification sites in RNA.

Read full paper


Benchmarking foundation cell models for post-perturbation RNA-seq prediction.

Gerold Csendes, Gema Sanz, Kristóf Z Szalay, Bence Szalai

Source: PubMed Published: 2025-04-23

tl;dr: We found that even the simplest baseline model-taking the mean of training examples-outperformed scGPT and scFoundation, and that basic machine learning models that incorporate biologically meaningful features outperformed the latter by a large margin.

Abstract: Accurately predicting cellular responses to perturbations is essential for understanding cell behaviour in both healthy and diseased states. While perturbation data is ideal for building such predictive models, its availability is considerably lower than baseline (non-perturbed) cellular data. To address this limitation, several foundation cell models have been developed using large-scale single-cell gene expression data. These models are fine-tuned after pre-training for specific tasks, such as predicting post-perturbation gene expression profiles, and are considered state-of-the-art for these problems. However, proper benchmarking of these models remains an unsolved challenge. In this study, we benchmarked two recently published foundation models, scGPT and scFoundation, against baseline models. Surprisingly, we found that even the simplest baseline model-taking the mean of training examples-outperformed scGPT and scFoundation. Furthermore, basic machine learning models that incorporate biologically meaningful features outperformed scGPT by a large margin. Additionally, we identified that the current Perturb-Seq benchmark datasets exhibit low perturbation-specific variance, making them suboptimal for evaluating such models. Our results highlight important limitations in current benchmarking approaches and provide insights into more effectively evaluating post-perturbation gene expression prediction models.

Read full paper


The Signed Two-Space Proximity Model for Learning Representations in Protein-Protein Interaction Networks.

Nikolaos Nakis, Chrysoula Kosma, Anastasia Brativnyk, Michail Chatzianastasis, Iakovos Evdaimon, Michalis Vazirgiannis

Source: PubMed Published: 2025-04-23

tl;dr: We present the Signed Two-Space Proximity Model (S2-SPM) for signed PPI networks, which explicitly incorporates both positive and negative interactions, reflecting the complex regulatory mechanisms within biological systems.

Abstract: Accurately predicting complex protein-protein interactions (PPIs) is crucial for decoding biological processes, from cellular functioning to disease mechanisms. However, experimental methods for determining PPIs are computationally expensive. Thus, attention has been recently drawn to machine learning approaches. Furthermore, insufficient effort has been made toward analyzing signed PPI networks, which capture both activating (positive) and inhibitory (negative) interactions. To accurately represent biological relationships, we present the Signed Two-Space Proximity Model (S2-SPM) for signed PPI networks, which explicitly incorporates both types of interactions, reflecting the complex regulatory mechanisms within biological systems. This is achieved by leveraging two independent latent spaces to differentiate between positive and negative interactions while representing protein similarity through proximity in these spaces. Our approach also enables the identification of archetypes representing extreme protein profiles.

Read full paper


Azurify integrates cancer genomics with machine learning to classify the clinical significance of somatic variants

Bigdeli A, Chandrashekar DS, Chitturi A, Rushton C, Mackinnon AC, Segal J, Harada S, Sacan A, Faryabi RB.

Source: PPR Published: 2025-04-23

tl;dr: We introduce Azurify - a computational tool that integrates machine learning, public resources recommended by professional societies, and clinically annotated data to classify the pathogenicity of variations in precision cancer medicine.

Abstract:

SUMMARY

Accurate classification of somatic variations from high-throughput sequencing data has become integral to diagnostics and prognostics across various cancers. However, the classification of these variations remains highly manual, inherently variable, and largely inaccessible outside specialized laboratories. Here, we introduce Azurify - a computational tool that integrates machine learning, public resources recommended by professional societies, and clinically annotated data to classify the pathogenicity of variations in precision cancer medicine. Trained on over 15,000 clinically classified variants from 8,202 patients across 138 cancer phenotypes, Azurify achieves 99.1% classification accuracy for concordant pathogenic variants in data from two external clinical laboratories. Additionally, Azurify reliably performs precise molecular profiling in leukemia cases. Azurify’s unified, scalable, and modular framework can be easily deployed within bioinformatics pipelines and retrained as new data emerges. In addition to supporting clinical workflows, Azurify offers a high-throughput screening solution for research, enabling genomic studies to identify meaningful variant-disease associations with greater efficiency and consistency.

Read full paper


An updated comparison of microarray and RNA-seq for concentration response transcriptomic study: case studies with two cannabinoids, cannabichromene and cannabinol.

Gao X, Yourick MR, Campasino K, Zhao Y, Sepehr E, Vaught C, Sprando RL, Yourick JJ.

Source: MED Published: 2025-04-23

tl;dr: We provide an updated comparison between microarray and RNA-seq using two cannabinoids, cannabichromene (CBC) and cannabinol (CBN), as case studies.

Background

Transcriptomic benchmark concentration (BMC) modeling provides quantitative toxicogenomic information that is increasingly being used in regulatory risk assessment of data poor chemicals. Over the past decade, RNA sequencing (RNA-seq) is gradually replacing microarray as the major platform for transcriptomic applications due to its higher precision, wider dynamic range, and capability of detecting novel transcripts. However, it is unclear whether RNA-seq offers substantial advantages over microarray for concentration response transcriptomic studies.

Results

We provide an updated comparison between microarray and RNA-seq using two cannabinoids, cannabichromene (CBC) and cannabinol (CBN), as case studies. The two platforms revealed similar overall gene expression patterns with regard to concentration for both CBC and CBN. However, in spite of the many varieties of non-coding RNA transcripts and larger numbers of differentially expressed genes (DEGs) with wider dynamic ranges identified by RNA-seq, the two platforms displayed equivalent performance in identifying functions and pathways impacted by compound exposure through gene set enrichment analysis (GSEA). Furthermore, transcriptomic point of departure (tPoD) values derived by the two platforms through BMC modeling were on the same levels for both CBC and CBN.

Conclusions

Considering the relatively low cost, smaller data size, and better availability of software and public databases for data analysis and interpretation, microarray is still a viable method of choice for traditional transcriptomic applications such as mechanistic pathway identification and concentration response modeling.

Read full paper


High-throughput screening data generation, scoring and FAIRification: a case study on nanomaterials.

Gergana Tancheva, Vesa Hongisto, Konrad Patyra, Luchesar Iliev, Nikolay Kochev, Penny Nymark, Pekka Kohonen, Nina Jeliazkova, Roland Grafström

Source: PubMed Published: 2025-04-23

tl;dr: We present a HTS-driven FAIRifed computational assessment tool for hazard analysis of multiple materials hazards rapidly and simultaneously, aligning with regulatory recommendations and addressing industry needs.

Abstract: In vitro-based high-throughput screening (HTS) technology is applicable to hazard-based ranking and grouping of diverse agents, including nanomaterials (NMs). We present a standardized HTS-derived human cell-based testing protocol which combines the analysis of five assays into a broad toxic mode-of-action-based hazard value, termed Tox5-score. The overall protocol includes automated data FAIRification, preprocessing and score calculation. A newly developed Python module ToxFAIRy can be used independently or within an Orange Data Mining workflow that has custom widgets for fine-tuning, included in the custom-developed Orange add-on Orange3-ToxFAIRy. The created data-handling workflow has the advantage of facilitated conversion of the FAIR HTS data into the NeXus format, capable of integrating all data and metadata into a single file and multidimensional matrix amenable to interactive visualizations and selection of data subsets. The resulting FAIR HTS data includes both raw and interpreted data (scores) in machine-readable formats distributable as data archive, including into the eNanoMapper database and Nanosafety Data Interface. We overall present a HTS-driven FAIRifed computational assessment tool for hazard analysis of multiple agents simultaneously, including with broad potential applicability across diverse scientific communities.Scientific Contribution Our study represents significant tool development for analyzing multiple materials hazards rapidly and simultaneously, aligning with regulatory recommendations and addressing industry needs. The innovative integration of in vitro-based toxicity scoring with automated data preprocessing within FAIRification workflows enhances the applicability of HTS-derived data application in the materials development community. The protocols described increase the effectiveness of materials toxicity testing and mode-of-action research by offering an alternative to manual data processing, enrichment of HTS data with metadata, refining testing methodologies-such as for bioactivity-based grouping-and overall, demonstrates the value of reusing existing data.

Read full paper


OSaMPle workflow for salivary metaproteomics analysis reveals dysbiosis in inflammatory bowel disease patients.

Jinhui Yuan, Boyan Sun, Murong Li, Congyi Yang, Lingqiang Zhang, Ning Chen, Feng Chen, Leyuan Li

Source: PubMed Published: 2025-04-23

tl;dr: We present an Optimized Salivary MetaProteomic sample analysis workflow (OSaMPle) to enrich salivary bacteria and reduce host-derived interferences for in-depth analysis of the oral metaproteome.

Abstract: The human oral microbiome has been associated with multiple inflammatory conditions including inflammatory bowel disease (IBD). Identifying functional changes in oral microbiome by metaproteomics helps understanding the factors driving dysbiosis related to intestinal diseases. However, enriching bacterial cells from oral samples (such as saliva and mouth rinse) rich in host proteins is challenging. Here, we present an Optimized Salivary MetaProteomic sample analysis workflow (OSaMPle) to enrich salivary bacteria and reduce host-derived interferences for in-depth analysis of the oral metaproteome. Compared to a conventional approach, OSaMPle improved the identification of bacterial peptides and proteins by 3.2 folds and 1.7 folds, respectively. Furthermore, applying OSaMPle to analyze mouth rinse samples from IBD patients revealed significant alterations in bacterial protein expressions under disease conditions. Specifically, proteins involved in the fatty acid elongation pathway in Peptostreptococcus were significantly less abundant in IBD patients, whereas proteins associated with the TCA cycle in Neisseria were significantly more abundant. The OSaMPle workflow is capable of processing small-volume oral samples and adaptable to high-throughput automation. It holds promise as a strategy for investigating the functional responses of oral microbiomes under disease conditions and identifying disease-associated microbes with their proteins, providing critical insights for detecting disease-related biomarkers within the oral microbiome.

Read full paper


Rapid assay development for low input targeted proteomics using a versatile linear ion trap.

Ariana E Shannon, Rachael N Teodorescu, No Joon Song, Lilian R Heil, Cristina C Jacob, Philip M Remes, Zihai Li, Mark P Rubinstein, Brian C Searle

Source: PubMed Published: 2025-04-23

tl;dr: We show consistent quantification across three orders of magnitude in a matched-matrix background of low-level proteins such as transcription factors and cytokines in a 1 ng sample without requiring stable isotope-labeled standards.

Abstract: Advances in proteomics and mass spectrometry enable the study of limited cell populations, where high-mass accuracy instruments are typically required. While triple quadrupoles offer fast and sensitive low-mass specificity measurements, these instruments are effectively restricted to targeted proteomics. Linear ion traps (LITs) offer a versatile, cost-effective alternative capable of both targeted and global proteomics. Here, we describe a workflow using a hybrid quadrupole-LIT instrument that rapidly develops targeted proteomics assays from global data-independent acquisition (DIA) measurements without high-mass accuracy. Using an automated software approach for scheduling parallel reaction monitoring assays (PRM), we show consistent quantification across three orders of magnitude in a matched-matrix background. We demonstrate measuring low-level proteins such as transcription factors and cytokines with quantitative linearity below two orders of magnitude in a 1 ng background proteome without requiring stable isotope-labeled standards. From a 1 ng sample, we found clear consistency between proteins in subsets of CD4

Read full paper


Prioritization of novel anti-infective stilbene derivatives by combining metabolomic data organization and a stringent 3R-infection model in a knowledge graph.

Kirchhoffer OA, Quirós-Guerrero L, Nitschke J, Nothias LF, Burdet F, Marcourt L, Hanna N, Mehl F, David B, Grondin A, Queiroz EF, Pagni M, Soldati T, Wolfender JL.

Source: MED Published: 2025-04-23

tl;dr: A new class of stilbene-rich plant NEs with the highest anti-infective activity against tuberculosis .

Abstract: The rising threat of multidrug-resistant tuberculosis, caused by Mycobacterium tuberculosis, underscores the urgent need for new therapeutic solutions to tackle the challenge of antibiotic resistance. The current study utilized an innovative 3R infection model featuring the amoeba Dictyostelium discoideum infected with Mycobacterium marinum, serving as stand-ins for macrophages and M. tuberculosis, respectively. This high-throughput phenotypic assay allowed for the evaluation of more specific anti-infective activities that may be less prone to resistance mechanisms. To discover novel anti-infective compounds, a diverse collection of 1600 plant NEs from the Pierre Fabre Library was screened using the latter assay. Concurrently, these NEs underwent untargeted UHPLC-HRMS/MS analysis. The biological screening flagged the NE from Stauntonia brunoniana as one of the anti-infective hit NEs. High-resolution HPLC micro-fractionation coupled with bioactivity profiling was employed to highlight the natural products driving this bioactivity. Stilbenes were eventually identified as the primary active compounds in the bioactive fractions. A knowledge graph was then used to leverage the heterogeneous data integrated into it to make a rational selection of stilbene-rich NEs. Using both CANOPUS chemical classes and Jaccard similarity indices to compare features within the metabolome of the 1600 plant NEs collection, 14 NEs rich in stilbenes were retrieved. Among those, the roots of Gnetum edule were flagged as possessing broader chemo-diversity in their stilbene content, along with the corresponding NE also being a strict anti-infective. Eventually, a total of 11 stilbene oligomers were isolated from G. edule and fully characterized by NMR with their absolute stereochemistry established through electronic circular dichroism. Six of these compounds are new since they possess a stereochemistry which was never described in the literature to the best of our knowledge. All of them were assessed for their anti-infective activity and (-)-gnetuhainin M was reported as having the highest anti-infective activity with an IC50 of 22.22 μM.

Read full paper


Directing stem cell differentiation by chromatin state approximation

Montano-Gutierrez, L. F., Mueller, S., Kutschat, A. P., Seruggia, D., Halbritter, F.

Source: bioRxiv Published: 2025-04-25

tl;dr: We found that greedy selection by chromatin approximation can be a viable optimisation strategy for the generation of erythroblasts from haematopoietic stem cells, and discovered transcriptional regulators linked to roadblocks in differentiation.

Abstract: A prime goal of regenerative medicine is to replace dysfunctional cells in the body. To design protocols for producing target cells in the laboratory, one may need to consider exponentially large combinations of culture components. Here, we investigated the potential of iteratively approximating the target phenotype by quantifying the distance between chromatin profiles (ATAC-seq) of differentiating cells in vitro and their in-vivo counterparts. We tested this approach on the well-studied generation of erythroblasts from haematopoietic stem cells, evaluating a fixed number of components over two sequential differentiation rounds (8x8 protocols). We found that the most erythroblast-like cells upon the first round yielded the most erythroblast-like cells at the second round, suggesting that greedy selection by chromatin approximation can be a viable optimisation strategy. Furthermore, by analysing regulatory sequences in incompletely reprogrammed chromatin regions, we uncovered transcriptional regulators linked to roadblocks in differentiation and made a data-driven selection of ligands that further improved erythropoiesis. In future, our methodology can help craft notoriously difficult cells in vitro, such as B cells.

Read full paper


DNA Methylation Dynamics of Dose-dependent Acute Exercise, Training Adaptation, and Detraining

Hariharan, M., Patel, S., Song, H., Rehman, A., Barragan, C., Bartlett, A., Castanon, R., Nery, J., Rothenberg, V., Chen, H., Tian, W., Ding, W., Wang, W., McAdam, J., Graham, Z., Lavin, K., Bamman, M., Broderick, T., Ecker, J.

Source: bioRxiv Published: 2025-04-25

tl;dr: We study the dynamics of DNA methylation in blood and muscle derived from males and females, and show that exercise- and training-induced epigenetic changes alter immune surveillance, mitochondrial function, and inflammatory response, and underscore the relevance of epigenetic plasticity to health monitoring and wellness.

Abstract: Exercise and diet are direct physical contributors to human health, wellness, resilience, and performance [1-5]. Endurance and resistance training are known to improve healthspan through various biological processes such as mitochondrial function [6-8], telomere maintenance [9], and inflammaging [10]. Although several training prescriptions have been defined with specific merits [1,10-20], the long-term effects of these in terms of their molecular alterations have not yet been well explored. In this study, we focus on two combined endurance and resistance training programs: (1) traditional moderate-intensity continuous endurance and resistance exercise (TRAD) and (2) a variation of high-intensity interval training (HIIT) we refer to as high intensity tactical training (HITT), to assess the dynamics of DNA methylation (DNAm) in blood and muscle derived from males (N=23) and females (N=31), over a 12-week period of training followed by a 4-week period of detraining, sampled at pre-exercise and acute time points, totaling 528 samples. Due to its rapid responsiveness to stimuli and its stability, DNAm has been known to facilitate regulatory cascades that significantly affect various physiological processes and pathways. We find that several thousand differentially methylated regions (DMRs) associated with acute exercise in blood, many of which are shared across males and females. This trend is reversed when comparing the baseline (pre-exercise) time points or post-exercise timepoints at the untrained state with those at the post-conditioned state. Here, muscle shows majority of DNAm changes, with most of those being unique. We also find several hundred memory DMRs in muscle that maintain the gain or loss of methylation after four weeks of inactivity. Comparing phenotypic measurements, we find specific DMRs that correlate significantly with mitochondrial function and myofiber switching. Using machine learning, we select a subset of DMRs that are most characteristic of training modalities, sex and timepoint. Most of the DMRs are enriched in pathways associated with immune function, cell differentiation, and exercise adaptation. These findings reveal mechanisms by which exercise- and training-induced epigenetic changes alter immune surveillance, mitochondrial function, and inflammatory response, and underscore the relevance of epigenetic plasticity to health monitoring and wellness.

Read full paper


MiT4SL: multi-omics triplet representation learning for cancer cell line-adapted prediction of synthetic lethality

Tao, S., Feng, Y., Yang, Y., Wu, M., Zheng, J.

Source: bioRxiv Published: 2025-04-25

tl;dr: We propose MiT4SL, a multi-omics triplet representation learning model for cell line-adapted SL prediction, which is the first deep learning model designed specifically for cancer cell-line- adapted SL prediction.

Abstract: Synthetic lethality (SL) offers a promising approach for targeted cancer therapies. Current SL prediction models heavily rely on extensive labeled data for specific cell lines to accurately identify SL pairs. However, a major limitation is the scarcity of SL labels across most cell lines, which makes it challenging to predict SL pairs for target cell lines with limited or even no available labels in real-world scenarios. Furthermore, gene interactions could be opposite between training and test cell lines, i.e. SL vs. non-SL, which further aggravates the challenge of generalization among cell lines. A promising strategy is to transfer knowledge learned from cell lines with relatively abundant SL labels to those with limited SL labels for the discovery of novel SL pairs, i.e., cell line-adapted SL prediction. Here, we propose MiT4SL, a multi-omics triplet representation learning model for cell line-adapted SL prediction. The core idea of MiT4SL is to model cell lineage information as embeddings, which are generated by combining a protein-protein interaction network representation tailored to each cell line with the corresponding protein sequence embeddings. We then combine these cell line embeddings with gene pair representations derived from a biomedical knowledge graph and protein sequences. This triplet representation learning strategy enables MiT4SL to capture both shared biological mechanisms across cell lines and those unique to each cell line, effectively mitigating distribution shift and improving generalization to target cell lines. Additionally, explicit cell line embeddings provide the necessary signals for MiT4SL to effectively differentiate between cell line contexts, enabling it to adjust predictions and mitigate possible label conflicts for the same gene pair across different cell lines. Experimental results across various cell line-adapted scenarios show that MiT4SL outperforms six state-of-the-art models. To the best of our knowledge, MiT4SL is the first deep learning model designed specifically for cancer cell line-adapted SL prediction.

Read full paper


Reference-guided genome assembly at scale using ultra-low-coverage high-fidelity long-reads with HiFiCCL

Jiang, Z., Pan, W., Gao, R., Hu, H., Gao, W., Zhou, M., Yin, Y.-H., Qian, Z., Jin, S., Wang, G.

Source: bioRxiv Published: 2025-04-24

tl;dr: We propose HiFiCCL, the first assembly framework specifically designed for ultra-low-coverage high-fidelity reads, using a reference-guided, chromosome-by-chromosome assembly approach.

Abstract: Population genomics using short-read resequencing captures single nucleotide polymorphisms and small insertions and deletions but struggles with structural variants (SVs), leading to a loss of heritability in genome-wide association studies. In recent years, long-read sequencing has improved pangenome construction for key eukaryotic species, addressing this issue to some extent. Sufficient-coverage high-fidelity (HiFi) data for population genomics is often prohibitively expensive, limiting its use in large-scale populations and broader eukaryotic species and creating an urgent need for robust ultra-low coverage assemblies. However, current assemblers underperform in such conditions. To address this, we propose HiFiCCL, the first assembly framework specifically designed for ultra-low-coverage high-fidelity reads, using a reference-guided, chromosome-by-chromosome assembly approach. We demonstrate that HiFiCCL improves ultra-low-coverage assembly performance of existing assemblers and outperforms the state-of-the-art assemblers on human and plant datasets. Tested on 45 human datasets (~5x coverage), HiFiCCL combined with hifiasm reduces the length of misassembled contigs relative to hifiasm by an average of 21.19% and up to 38.58%. These improved assemblies enhance germline structural variant detection, reduce chromosome-level mis-scaffolding, enable more accurate pangenome graph construction, and improve the detection of rare and somatic structural variants based on the pangenome graph under ultra-low-coverage conditions.

Read full paper


Attention-Based Solution for Synergistic Virus Combination Therapy

Majidifar, S., Hooshmand, M.

Source: bioRxiv Published: 2025-04-24

tl;dr: This paper proposes AI-based models to predict novel drug combinations that can synergistically treat viral diseases.

Abstract: Computational drug repurposing is vital in drug discovery research because it significantly reduces both the cost and time involved in the drug development process. Additionally, combination therapy--using more than one drug for treatment--can enhance efficacy and minimize the side effects associated with individual drugs. However, there is currently limited research focused on computational approaches to combination therapy for viral diseases. This paper proposes AI-based models to predict novel drug combinations that can synergistically treat viral diseases. To achieve this, we have compiled a comprehensive dataset containing information on viruses, drug compounds, and their approved interactions. We introduce two attention-based models and compare their performance with traditional machine learning and deep learning models in predicting synergistic drug pairs for treating viral diseases. Among all the methods tested, the random forest algorithm and one of the attention-based models utilizing a customized dot product as a predictor showed the highest performance. Notably, two predicted combinations--acyclovir + ribavirin and acyclovir + Pranobex Inosine--have been experimentally validated to produce a synergistic antiviral effect against the herpes simplex virus type 1, as reported in existing literature.

Read full paper


Task Splitting and Prompt Engineering for Cypher Query Generation in Domain-Specific Knowledge Graphs

Soleymani, S., Gravel, N. M., Kochut, K., Kannan, N.

Source: bioRxiv Published: 2025-04-24

tl;dr: We propose a novel approach, Prompt2Cypher (P2C), which leverages task splitting and prompt engineering to decompose user queries into manageable subtasks, enhancing LLMs' ability to generate accurate Cypher queries that align with the underlying graph database schema.

Abstract: The integration of large language models (LLMs) with knowledge graphs (KGs) holds significant potential for simplifying the process of querying graph databases, especially for non-technical users. KGs provide a structured representation of domain-specific data, enabling rich and precise information retrieval. However, the complexity of graph query languages, such as Cypher, presents a barrier to their effective use by non-experts. This research addresses the challenge by proposing a novel approach, Prompt2Cypher (P2C), which leverages task splitting and prompt engineering to decompose user queries into manageable subtasks, enhancing LLMs' ability to generate accurate Cypher queries that align with the underlying graph database schema. We demonstrate the effectiveness of P2C in two biological KGs (protein kinase and ion-channel) that differ in size, schema and complexity. Compared to a baseline approach, our method improves query accuracy, as demonstrated by higher Precision, Recall, F1-score, and Jaccard similarity metrics. This work contributes to the ongoing efforts to bridge the gap between domain-specific knowledge graphs and user-friendly graph database query interfaces.

Read full paper


Comparative analysis of genomic prediction approaches for multiple time-resolved traits in maize

Hobby, D., Lindner, R., Mbebi, A. J., Tong, H., Nikoloski, Z.

Source: bioRxiv Published: 2025-04-24

tl;dr: We compared and contrasted the performance of MegaLMM and dynamicGP as well as their hybrid variants that can handle high-dimensional temporal data for multi-trait genomic prediction.

Abstract: Ability to accurately predict multiple growth-related traits over plant developmental trajectories has the potential to revolutionize crop breeding and precision agriculture. Despite increased availability of time-resolved data for multiple traits from high-throughput phenotyping platforms of model plants and crops, genomic prediction is largely applied to a small number of traits, often neglecting their dynamics. Here, we compared and contrasted the performance of MegaLMM and dynamicGP as well as their hybrid variants that can handle high-dimensional temporal data for multi-trait genomic prediction. The comparative analysis made use of time series for 50 geometric, colour, and texture traits in a maize multi-parent advanced generation inter-cross (MAGIC) population. The performance of the approaches was assessed using snapshot accuracy and longitudinal accuracy, We found that MegaLMM outperforms dynamicGP in terms of snapshot accuracy, while dynamicGP proved superior in terms of longitudinal accuracy. This study paves the way for careful investigation of factors that affect the capacity to predict dynamics of multiple traits from genetic markers alone. providing insight into the ability to predict multiple traits at a single time point or the dynamics of individual traits over the considered time domain, respectively.

Read full paper


BioMiner: A Multi-modal System for Automated Mining of Protein-Ligand Bioactivity Data from Literature

Yan, J., Zhu, J., Yang, Y., Liu, Q., Zhang, K., Zhang, Z., Liu, X., Zhang, B., Gao, K., Xiao, J., Chen, E.

Source: bioRxiv Published: 2025-04-24

tl;dr: We introduce BioMiner, a multi-modal system designed to automatically extract protein-ligand bioactivity data from thousands to potentially millions of publications.

Abstract: Protein-ligand bioactivity data published in literature are essential for drug discovery, yet manual curation struggles to keep pace with rapidly growing literature. Automated bioactivity extraction is challenging due to the multi-modal distribution of information (text, tables, figures, structures) and the complexity of chemical representations (e.g., Markush structures). Furthermore, the lack of standardized benchmarks impedes the evaluation and development of extraction methods. In this work, we introduce BioMiner, a multi-modal system designed to automatically extract protein-ligand bioactivity data from thousands to potentially millions of publications. BioMiner employs a modular, agent-based architecture, leveraging a synergistic combination of multi-modal large language models, domain-specific models, and domain tools to navigate this complex extraction task. To address the benchmark gap and support method development, we establish BioVista, a comprehensive benchmark comprising 16,457 bioactivity entries and 8,735 chemical structures curated from 500 publications. On BioVista, BioMiner validates its extraction ability and provides a quantitative baseline, achieving F1 scores of 0.22, 0.45, and 0.53 for bioactivity triplets, chemical structures, and bioactivity measurement with high throughput (14s/paper on 8 V100 GPUs). We further demonstrate BioMiner's practical utility via three applications: (1) extracting 67,953 data from 11,683 papers to build a pre-training database that improves downstream models performance by 3.1%; (2) enabling a human-in-the-loop workflow that doubles the number of high-quality NLRP3 bioactivity data, helping 38.6% improvement over 28 QSAR models and identification of 16 hit candidates with novel scaffolds; and (3) accelerating the annotation of the protein-ligand structures in PoseBusters benchmark with reported bioactivity, achieving a 5-fold speed increase and 10% accuracy improvement over manual methods. BioMiner and BioVista provide a scalable extraction methodology and a rigorous benchmark, paving the way to unlock vast amounts of previously inaccessible bioactivity data and accelerate data-driven drug discovery.

Read full paper


Assemblies of long-read metagenomes suffer from diverse errors

Trigodet, F., Sachdeva, R., Banfield, J. F., Eren, A. M.

Source: bioRxiv Published: 2025-04-24

tl;dr: We show that erroneous reporting is pervasive among long-read assemblers and can take many forms, including multi-domain chimeras, prematurely circularized sequences, haplotyping errors, excessive repeats, and phantom sequences.

Abstract: Genomes from metagenomes have revolutionised our understanding of microbial diversity, ecology, and evolution, propelling advances in basic science, biomedicine, and biotechnology. Assembly algorithms that take advantage of increasingly available long-read sequencing technologies bring the recovery of complete genomes directly from metagenomes within reach. However, assessing the accuracy of the assembled long reads, especially from complex environments that often include poorly studied organisms, poses remarkable challenges. Here we show that erroneous reporting is pervasive among long-read assemblers and can take many forms, including multi-domain chimeras, prematurely circularized sequences, haplotyping errors, excessive repeats, and phantom sequences. Our study highlights the need for rigorous evaluation of the algorithms while they are in development, and options for users who may opt for more accurate reads than shorter runtimes.

Read full paper


High-quality metagenome assembly from nanopore reads with nanoMDBG

Benoit, G., James, R., Raguideau, S., Alabone, G., Goodall, T., Chikhi, R., Quince, C.

Source: bioRxiv Published: 2025-04-24

tl;dr: We present nanoMDBG, an evolution of the metaM DBG HiFi assembler, designed to support newer ONT sequencing data through a novel pre-processing step that performs fast and accurate error correction in minimizer-space.

Abstract: Third-generation long-read sequencing technologies, have been shown to significantly enhance the quality of metagenome assemblies. The results obtained using the highly accurate reads generated by PacBio HiFi have been particularly notable yielding hundreds of circularized, complete genomes as metagenome-assembled genomes (MAGs) without manual intervention. Oxford Nanopore Technologies (ONT) has recently improved the accuracy of its sequencing reads, achieving a per-base error rate of approximately 1-2%. Given the high-throughput, convenience and low-cost of ONT sequencing this could accelerate the uptake of long read metagenomics. However, current metagenome assemblers are optimized for PacBio HiFi data and underperform on the latest ONT data and do not scale to the large data sets that it enables. We present nanoMDBG, an evolution of the metaMDBG HiFi assembler, designed to support newer ONT sequencing data through a novel pre-processing step that performs fast and accurate error correction in minimizer-space. Across a range of ONT datasets, including a large 400 Gbp soil sample sequenced specifically for this study, nanoMDBG reconstructs up to twice as many high-quality MAGs as the next best ONT assembler, metaflye, while requiring a third of the CPU time and memory. As a result of these advances, we show that the latest ONT technology can now produce results comparable to those obtained using PacBio HiFi sequencing at the same sequencing depth.

Read full paper


Engineering microbial consortia for distributed signal processing

Duncker, K. E., Shende, A. R., Shyti, I., Ruan, A., D'Cunha, R., Ma, H. R., Venugopal Lavanya, H., Liu, S., Gottel, N., Anderson, D. J., Gunsch, C. K., You, L.

Source: bioRxiv Published: 2025-04-24

tl;dr: A new, generalizable approach that leverages microbial consortia to distribute sensory functions and computational methods to disentangle signals.

Abstract: A critical goal in biology is deducing input signals from measurable readouts. Genetic circuits have successfully been designed to respond to specific analytes and produce quantifiable outputs; however, multiplexed signal processing remains challenging. This limitation is partially due to crosstalk, or sensors' non-specific responses to unintended signals. While strategies to achieve orthogonality are promising, they are time-intensive and context-dependent. Here, we introduce a new, generalizable approach that leverages microbial consortia to distribute sensory functions and computational methods to disentangle signals. Compartmentalizing sensor circuits within distinct populations simplifies experimental optimization by allowing individual populations to be exchanged without requiring genetic modifications. Our computational pipeline combines mechanistic modeling with machine learning to decode microbial communities' unique temporal responses and predict multiple input concentrations. We demonstrated this platform's versatility in a variety of contexts: measuring signals with high crosstalk, detecting antibiotics with natural microbial communities, and quantifying chemicals in hospital sink wastewater. Our findings highlight how combining microbial engineering with computational strategies can produce robust, scalable biosensors for diverse applications.

Read full paper


AI-Based Antibody Design Targeting Recent H5N1 Avian Influenza Strains

Santolla, N. F., Ford, C. T.

Source: bioRxiv Published: 2025-04-24

tl;dr: We show the utility of artificial intelligence in the discovery of novel antibodies against specific H5N1 strains of interest, which bind similarly to known therapeutic and elicited antibodies.

Abstract: In 2025 alone, H5N1 avian influenza is responsible for thousands of infections across various animal species, including avian and mammalian livestock such as chickens and cows, and poses a threat to human health due to avian-to-mammalian transmission. There have been 70 human cases of H5N1 influenza in the United States since April 2024 and, as shown in recent studies, our current antibody defenses are waning. Thus, it is imperative to discover new therapeutics in the fight against more recent strains of the virus. In this study, we present the Frankies framework for automated antibody diffusion and assessment. This pipeline was used to automate the generation of 30 novel anti-HA1 Fv antibody fragment sequences, fold them into 3-dimensional structures, and then dock against a recent H5N1 HA1 antigen structure for binding evaluation. Here we show the utility of artificial intelligence in the discovery of novel antibodies against specific H5N1 strains of interest, which bind similarly to known therapeutic and elicited antibodies.

Read full paper


Lipid flip flop regulates the shape of growing and dividing synthetic cells

Lira, R. B., Dekker, C.

Source: bioRxiv Published: 2025-04-24

tl;dr: We demonstrate that lipid flip flop relaxes curvature stresses and yields more symmetrically sized buds, which leads to bud scission.

Abstract: Cells grow their boundaries by incorporating newly synthesized lipids into their membranes as well as through fusion of intracellular vesicles. As these processes yield trans-bilayer imbalances in lipid numbers, cells must actively flip lipid molecules across the bilayer to enable growth. Using giant and small unilamellar vesicles (GUVs and SUVs, respectively), we here recapitulate cellular growth and division under various conditions of transmembrane flip flop of lipids. By dynamically monitoring the changes in reduced volume and spontaneous curvature of GUVs that grow by fusion of many small SUVs, the morphology of these growing synthetic cells is quantified. We demonstrate that lipid flip flop relaxes curvature stresses and yields more symmetrically sized buds. Further increasing the neck curvature is shown to lead to bud scission. The mechanisms presented here offer fundamental insights into cell growth and division, which are important for understanding early protocells and designing synthetic cells that are able to grow and divide.

Read full paper


High-throughput engineering and modification of non-ribosomal peptide synthetases based on Golden Gate assembly

Podolski, A., Lindeboom, T. A., Praeve, L., Kranz, J., Schindler, D., Bode, H. B.

Source: bioRxiv Published: 2025-04-24

tl;dr: A novel method for efficient high-throughput generation of novel and engineered NRPS libraries utilising the XUT concept, which allows the efficient modular assembly of different natural NRPS fragments to form hybrid NRPS that produce defined peptides.

Abstract: Non-ribosomal peptide synthetases (NRPS) are multimodular enzymes that produce complex peptides with diverse biological activities, potentially being used as clinical drugs. However, the pharmaceutical applications of such natural peptides often require further derivatisation and modification of the peptide backbone, mainly performed by chemical synthesis. A sustainable alternative resembles the in vivo engineering of NRPS to change and modify the enzyme properties rationally and, thus, the produced products. The novel NRPS engineering concept, the eXchange Unit Thiolation domain (XUT), allows the efficient modular assembly of different natural NRPS fragments to form hybrid NRPS that produce defined peptides. In this study, we describe a Golden Gate assembly (GGA) method for efficient high-throughput generation of novel and engineered NRPS libraries utilising the XUT concept. This method was applied to generate over 100 novel NRPS with the possibility of changing starter, elongation, and termination modules, respectively. Additionally, we applied this method for targeted modification of the xenoamicin biosynthetic gene cluster (BGC) XabABCD from Xenorhabdus doucetiae, resulting in the generation of 25 novel xenoamicin derivatives.

Read full paper


Conformation-specific Design: a New Benchmark and Algorithm with Application to Engineer a Constitutively Active MAP Kinase

Stern, J., Alharbi, S., Sandholu, A., Arold, S. T., Della Corte, D.

Source: bioRxiv Published: 2025-04-24

tl;dr: A general method for designing proteins with high conformational specificity is desirable for a variety of applications, including enzyme design and drug target redesign.

Abstract: A general method for designing proteins with high conformational specificity is desirable for a variety of applications, including enzyme design and drug target redesign. To assess the ability of algorithms to design for conformational specificity, we introduce MotifDiv, a benchmark dataset of 200 conformational specificity design challenges. We also introduce CSDesign, an algorithm for designing proteins with high preference for a target conformation over an alternate conformation. On the MotifDiv benchmark, CSDesign designs protein sequences that are predicted to prefer the target conformation. We apply this method in vitro to redesign human MAP kinase ERK2, an enzyme with active and inactive conformations. Out of two designs for the active conformation, one increased activity sufficiently to retain activity in the absence of activating phosphorylations, a property not present in the wild type protein.

Read full paper


Harnessing Coding Sequence Cleavage: Theophylline Aptazymes as Portable Gene Regulators in Bacteria

Yurdusev, E., Nicole, Y., Pan Du, F., Suslu, N. E., Perreault, J.

Source: bioRxiv Published: 2025-04-24

tl;dr: We present a novel "semi-trans" aptazyme-based system for gene regulation in bacteria, expanding the range of gene regulation tools, and demonstrate ligand-controlled regulation of tetracycline resistance and nickel sensitivity.

Abstract: Nucleic acid-based regulatory elements capable of modulating gene expression in response to specific molecular cues have gained increasing attention in synthetic biology. These systems, which include riboswitches, allosteric DNAzymes, and aptazymes, function as Gene Expression Nucleic Allosteric actuators (GENAs) by coupling molecular recognition with genetic regulation. Their versatility can enable applications in diagnostics, therapeutics, and metabolic engineering. This study presents a novel "semi-trans" aptazyme-based system for gene regulation in bacteria, expanding the range of GENAs. The system employs theophylline-responsive hammerhead ribozyme aptazymes positioned in the 5' UnTranslated Region (UTR), designed to cleave within the coding sequence of the target gene, thereby modulating gene expression in a ligand-dependent manner. Using the tetA gene in Escherichia coli (E. coli) as a proof of concept, we demonstrate ligand-controlled regulation of tetracycline resistance and nickel sensitivity. The system's effectiveness is validated through in vitro cleavage assays and in vivo phenotypic studies in two E. coli strains, highlighting its portability across genetic backgrounds. Furthermore, the ability to design multiple aptazymes targeting different coding regions enables complex and fine-tuned regulation. This work broadens the landscape of synthetic gene regulation tools, facilitating the development of new aptazymes based on this approach.

Read full paper


You’re receiving this email because you're using paper-trackr. Stay curious!