LATEST IN SCIENCE

Papers published yesterday, filtered for the topics I follow.

bioinformatics, synthetic biology Mar 20, 2026 bioRxiv

Bacteriophage host prediction using a genome language model

WANG, Z., Arsuaga, J.

Abstract: Computational bacteriophage host prediction from genomic sequences remains challenging because host range depends on diverse, rapidly evolving genomic determinants--from receptor-binding proteins to anti-defense systems and downstream infection compatibility--and because the signals available to predictors, including sequence homology, CRISPR spacer matches, nucleotide composition, and mobile genetic elements, are sparse, unevenly distributed across taxa, and constrained by incomplete host annotations. Here, we frame host prediction as an unsupervised retrieval problem. We asked whether embeddings from the pretrained genome language model Evo2 captured a reliable host-range signal without training on phage-host labels. We generated whole-genome embeddings for phages and candidate bacterial hosts with the Evo2-7B model, applied normalization, and ranked hosts by cosine similarity. Using the Virus-Host Database, we selected embedding and fusion choices on a Gram-positive validation cohort and then evaluated the approach on a held-out Gram-negative test cohort to minimize data leakage. We found that Evo2 was strongest at retrieving multiple plausible hosts, with the recorded host in the top 10 for 55.4% of phages. However, it did not maximize species-level top-1 accuracy (19.4% vs. 23.2% for the best baseline). At higher taxonomic ranks, Evo2 captured a coarser host-range signal: top-1 accuracy reached 43.4% at the genus level and 51.6% at the family level. Reciprocal rank fusion of Evo2 with BLASTN, VirHostMatcher, and PHIST improved all retrieval metrics. Top-10 retrieval rose to 58.5% and top-1 accuracy to 26.9%. Stratified analyses by phage genome length, host clade, and host mobile genetic element coverage revealed scenario-dependent performance. Evo2 embeddings excelled for intermediate-length phages and when host mobile element content was low, whereas alignment and k-mer methods dominated when local homology was abundant. These results suggest that pretrained genome embeddings complement established alignment- and k-mer/composition-based methods and that context-aware hybrid pipelines may help improve phage host prediction.

Read full paper
bioinformatics, synthetic biology Mar 20, 2026 bioRxiv

PanXpress: Gene expression quantification with a pan-transcriptomic gapped k-mer index

Alves Ferreira, I., Zentgraf, J., +1 author, Rahmann, S.

Abstract: Motivation: Most existing workflows for quantifying bacterial gene expression from RNA-seq data rely on mapping reads to a (single) reference transcriptome, typically ignoring strain-level variation. When samples contain unknown or mixed strains, these workflows may introduce reference bias and fail to accurately capture strain-specific gene expression. Pan-transcriptomic approaches address this issue by using pan-transcriptomes as references, but existing solutions require multiple steps for pan-transcriptome construction, indexing, and expression quantification. Results: We introduce PanXpress, a unified framework for bacterial pan-transcriptomics that performs pan-transcriptome construction and indexing directly from genomic FASTA and GFF annotation files, alignment-free mapping of reads to genes from FASTQ samples, and gene expression quantification. The index, a multi-way Cuckoo hash table storing gapped k-mers with associated genes, preserves diversity on the k-mer level. Using simulated RNA-seq data from a mixture of Pseudomonas aeruginosa strains, PanXpress achieves mapping recall comparable to alignment-based methods such as Bowtie2 with higher precision and obtains accurate gene expression and log fold change estimates. On real P. aeruginosa RNA-seq data, using PanXpress' pan-transcriptomic reference increases the proportion of mapped reads and discovered expressed genes. The index of PanXpress is smaller than that of other tools and it provides faster analysis with consistent results, compared to other tools (Salmon, Kallisto, Bowtie2). PanXpress is thus an accurate and efficient method for bacterial gene expression analysis in complex samples.

Read full paper
bioinformatics, synthetic biology Mar 20, 2026 bioRxiv

ISdetector: precise mapping of insertion sequences and associated structural variations from short-read sequencing data

Zhou, Y., Lu, B.

Abstract: Motivation: Insertion sequences (ISs) are key drivers of genomic plasticity in bacteria and archaea. Determining their exact insertion coordinates is critical for understanding drug resistance, virulence, and pathogen epidemiology. However, accurately mapping ISs from high-throughput short-read sequencing data remains challenging due to the repetitive nature of these elements and accompany-ing structural variations, which frequently confound standard alignment-based algorithms. As whole-genome sequencing becomes the standard for population-level studies, there is a need for robust, scalable, and specialized pipelines to detect ISs. Results: We present ISdetector, a bioinformatics pipeline that detects precise insertion sites of spe-cific ISs using an IS-clean reference strategy combined with clustering of IS-relevant signals from soft-clipped reads. Compared with existing tools, including ISMapper and MGEFinder, ISdetector demonstrates higher accuracy and robustness, achieving high F1 scores in both high-GC-content genomes (e.g., Mycobacterium tuberculosis, F1=0.91) and high-IS-burden genomes (e.g., Shigella sonnei, F1=0.85). Furthermore, ISdetector identifies IS movements accompanied by structural varia-tions, such as large-scale deletions, which are often missed by existing methods. Implemented with multi-threading, ISdetector shows near-linear decreases in running time with increasing thread counts, making it highly scalable and efficient for processing large numbers of samples in popula-tion-level studies.

Read full paper
bioinformatics, synthetic biology Mar 20, 2026 bioRxiv

Differentiable Gene Set Enrichment Analysis for Pathway-Level Supervision in Transcriptomic Learning

Li, S., Ruan, Y., +2 authors, Saigo, H.

Abstract: In transcriptomics-driven drug discovery, upstream predictors of chemical-induced transcriptional profiles (CTPs) are typically trained with gene-wise objectives, whereas downstream interpretation relies on pathway-level, rank-based statistics such as Gene Set Enrichment Analysis (GSEA). This objective mismatch destabilizes pathway conclusions under prediction errors: small ranking perturbations can flip enrichment direction or distort pathway ordering. To bridge this gap, we present differentiable GSEA (dGSEA), a training-compatible surrogate that maps predicted gene-level scores to pathway enrichment with well-behaved gradients. Technically, dGSEA replaces discrete ranking operations with temperature-controlled soft sorting, smooth prefix accumulation, and differentiable extremum aggregation. Critically, to preserve the statistical semantics of classical GSEA, we introduce sign-specific robust permutation normalization (dNES) with optional k-calibration. For computational efficiency, a scalable Nystrom-window approximation (nyswin) reduces the quadratic bottleneck to near-linear complexity, enabling genome-scale evaluation. Empirically, across synthetic benchmarks and LINCS L1000 signatures, dGSEA matches classical GSEA accuracy with improved numerical stability. When incorporated as an auxiliary objective for SMILES-to-transcriptome prediction, dGSEA improves pathway-level agreement (macro correlation 0.257 -> 0.306; sign accuracy 0.620 -> 0.641) without compromising gene-level performance, providing a practical mechanism for pathway-aware optimization in transcriptomic prediction pipelines.

Read full paper
bioinformatics, synthetic biology Mar 20, 2026 bioRxiv

Disagreement among variant effect predictors guides experimental prioritization of target proteins

Jonsson, N. F., Marsh, J. A., Lindorff-Larsen, K.

Abstract: Interpreting the functional consequences of genetic variation, especially rare missense variants, remains a significant challenge in human genetics. Computational variant effect predictors (VEPs) and multiplexed assays of variant effects (MAVEs) provide complementary approaches, with VEPs offering scalable predictions and MAVEs delivering detailed empirical measurements. However, MAVEs are resource intensive and cannot yet be applied broadly across the proteome, making it important to identify proteins where experimental mapping will be most informative. We hypothesised that MAVEs should be particularly valuable for proteins where computational predictors disagree, as such disagreement may highlight mechanistic blind spots. To test this, we analysed predictions from ten distinct VEPs across more than 13,000 human proteins and quantified inter-predictor concordance. We observed substantial variability across proteins in the degree of agreement across predictors and investigated structural, functional and gene-level features associated with this variation. We find that inter-VEP concordance showed no relationship with agreement to experimental MAVE data. If predictor agreement reflected how intrinsically predictable a protein is, these quantities would be expected to correlate. Their decoupling instead suggests that MAVEs may provide orthogonal information to VEPs, supporting the use of inter-VEP disagreement to prioritise proteins where experimental data will be most informative. We therefore propose using inter-VEP disagreement as a practical strategy to prioritise proteins for experimental characterization. Focusing on proteins with low predictor concordance should maximise the informational value of new MAVEs, and improve variant interpretation in both research and clinical contexts.

Read full paper
bioinformatics, synthetic biology Mar 20, 2026 bioRxiv

Substrate transport limits phenylalanine ammonia-lyase activity in engineered Lacticaseibacillus rhamnosus GG

Choudhury, D., Mays, Z. J., Nair, N. U.

Abstract: Probiotic-based encapsulation offers unique advantages over purified enzymes, such as increased protection from thermal-, pH-, and protease-mediated degradation, for oral therapeutic delivery applications. However, one of the major disadvantages of whole-cell systems is lower reaction rate due to substrate-product transport limitations imposed by the cell membrane and/or wall. In this work, we explore the potential of different lactic acid bacteria (LAB) - Lacticaseibacillus rhamnosus GG (LGG), Lactococcus lactis (Ll), and Lactiplantibacillus plantarum (Lp) - as expression hosts for recombinant Anabaena variabilis phenylalanine ammonia-lyase (AvPAL*). AvPAL* is used as a therapeutic to treat Phenylketonuria (PKU), a rare autosomal recessive metabolic disorder. Among the three species tested, LGG showed the highest PAL activity followed by L. lactis. Next, we attempted to overcome mass transfer limitation in whole-cell biocatalysts in two ways - expression of heterologous transporters and treatment with different chemical surfactants. Engineered strains expressing heterologous transporters exhibited approximately 3-4-fold increased PAL activity, while chemical treatment did not improve reaction rates. This work highlights the challenges and advances in realizing the potential of LAB as biotherapeutics.

Read full paper
bioinformatics, synthetic biology Mar 20, 2026 bioRxiv

HViLM: A Foundation Model for Viral Genomics Enables Multi-Task Prediction of Pathogenicity, Transmissibility, and Host Tropism

Davuluri, R. V., Dutta, P., +5 authors, Liu, H.

Abstract: Motivation: The emergence of novel viral pathogens poses critical threats to global health, yet current computational approaches for viral risk assessment are predominantly virus-specific and require extensive retraining for each new threat. Computational methods for rapid characterization of emerging viruses across multiple epidemiologically relevant dimensions--pathogenicity, host tropism, and transmissibility--are urgently needed to inform public health responses and guide experimental prioritization. Results: We present HViLM (Human Virome Language Model), the first foundation model for pan-viral genomic analysis through continued pre-training of DNABERT-2 on 5 million non-redundant viral sequences (MMseqs2-clustered from 25 million chunks at 80% identity) spanning 9,000 species across 45+ viral families from the VIRION database. We introduce the Human Virome Understanding Evaluation (HVUE) benchmark comprising seven curated datasets across three prediction tasks: pathogenicity classification, host tropism prediction, and transmissibility assessment. Through parameter-efficient fine-tuning with LoRA, HViLM achieves state-of-the-art performance with average accuracies of 95.32% for pathogenicity, 96.25% for host tropism, and 97.36% for transmissibility assessment. The model demonstrates robust cross-family generalization, substantially outperforming sequence-similarity baselines and general genomic foundation models. Attention-based interpretability analysis reveals that HViLM captures biologically meaningful pathogenicity determinants through molecular mimicry of host regulatory elements, including convergent evolution of eight independent sequences targeting Interferon Regulatory Factor 1 (Irf1) for immune evasion. Availability: The HVUE benchmark datasets, training scripts, and complete implementation are publicly available at https://github.com/duttaprat/HViLM . Pre-trained HViLM-base model weights and fine-tuned task-specific variants are available on Hugging Face at https://huggingface.co/duttaprat/HViLM-base .

Read full paper
bioinformatics, synthetic biology Mar 20, 2026 bioRxiv

Designing mRNA coding sequence via multimodal reverse translation language modeling with Pro2RNA

Bian, B., Zhang, Y., +2 authors, Saito, Y.

Abstract: mRNA coding sequence design is a critical component in the development of mRNA vaccines, nucleic acid therapeutics, and heterologous gene expression systems. While large language models have recently been successfully applied to protein design and RNA modeling, designing optimal mRNA coding sequences for a given protein, particularly in a species-specific manner, remains a major challenge. Here, we present Pro2RNA, a multimodal reverse-translation language model that generates mRNA coding sequences from their corresponding protein sequences while explicitly conditioning on host organism taxonomy information. Pro2RNA integrates multiple pretrained language models across different modalities, including ESM2 for protein representation, SciBERT for taxonomy understanding, and a generative RNA language model for mRNA codon-level sequence generation. By training on mRNA-protein pairs from eukaryote and bacteria datasets respectively, Pro2RNA learns species-dependent genetic codes and codon usage patterns, enabling the generation of host-adapted and natural-like mRNA coding sequences. Across multiple benchmark evaluations, Pro2RNA matches or surpasses existing optimization methods, demonstrating its potential as a powerful and flexible framework for species-aware mRNA coding sequence design.

Read full paper
bioinformatics, synthetic biology Mar 20, 2026 bioRxiv

Systematic assessment of machine learning-based variant annotation methods for rare variant association testing

Aguirre, M., Irudayanathan, F. J., +5 authors, Fletez-Brant, K.

Abstract: Machine learning-based annotation methods are increasingly used to assess the pathogenicity of genetic variants, but their performance at prioritizing variants for gene-level association testing remains poorly characterized. Here, we systematically benchmark five annotation methods --- CADD v1.6, CADD v1.7, AlphaMissense, ESM-1b, and GPN-MSA --- using four primary gene-based tests and six annotation-level aggregation tests across 14 quantitative traits measured in up to 350,377 UK Biobank participants. Using a novel framework based on Wasserstein distances, we quantify how annotation choice affects test calibration and power. Tests using CADD annotations achieve the highest signal separation, while tests using AlphaMissense annotations exhibit systematically lower calibration. All combinations of methods produced significant results that were enriched (1.8--5.8-fold) for loss-of-function intolerant genes, though tests using GPN-MSA annotations displayed the highest such enrichment. Replication across symmetric phenotypes and loss-of-function burden tests was generally similar across methods. Our analysis provides practical guidance for annotation method selection in rare variant studies and establishes a distributional framework for calibration assessment.

Read full paper
bioinformatics, cancer biology Mar 20, 2026 bioRxiv

Targeting wild type NTRK decreases brain metastases of lung cancers non-driven by NTRK fusions

Contreras-Zarate, M. J., Jaramillo-Gomez, J. A., +7 authors, Cittelly, D. M.

Abstract: The central nervous system (CNS) is a common site of metastatic spread for both non-small cell and small cell lung cancer, yet the therapeutic strategies to prevent and decrease lung cancer brain metastases remain limited. Tyrosine kinase inhibitors have shown promising results in increasing the overall response in brain metastases, owing to their brain penetrance and increased effectiveness; however, their use is limited to the small group of tumors carrying specific oncogenic drivers. Among these, inhibitors with activity against neurotrophic tyrosine receptor kinases (NTRKs) are showing promising effects in reducing CNS metastases in cancers driven by gene rearrangements of these drugs' targets. However, wild-type NTRKs are susceptible to activation by their canonical ligands, which are expressed throughout the brain metastatic niche and can, in a paracrine manner, activate NTRK function in cancer cells. Here we show that NTRKs are expressed in primary tumors, brain metastases, and lung cancer cells with various driver mutations expressing wild-type NTRK2 (WT-TrkB). We demonstrate that WT-TrkB activates downstream signaling and proliferation in response to exogenous BDNF and conditioned media from reactive astrocytes known to secrete BDNF in the brain niche. Importantly, the FDA-approved NTRK inhibitor entrectinib blocked BDNF and astrocyte-induced survival pathways in multiple lung cancer cell lines, decreased their proliferation in vitro, and effectively prevented brain metastatic colonization and progression in vivo without significant effects on extracranial disease. Thus, these studies suggest that brain-dependent activation of NTRK is critical for brain metastases of WT-NTRK+ lung cancers, and therefore, NTRK inhibitors can be used to target non-fusion NTRK function to prevent or decrease brain metastases.

Read full paper
bioinformatics, cancer biology Mar 20, 2026 bioRxiv

Targeting MTHFD2 disrupts mitochondrial redox homeostasis and restores venetoclax sensitivity in acute myeloid leukemia

Sokei, J. O., di Martino, O., Basse, M., +7 authors, Sykes, S. M.

Abstract: One-carbon metabolism is frequently dysregulated in human cancer including acute myeloid leukemia. However, the mitochondrial mechanisms by which one-carbon enzymes support leukemia survival and therapeutic response remain incompletely defined. Here, we report that the one-carbon metabolism enzyme MTHFD2 is a critical regulator of acute myeloid leukemia nucleotide metabolism, redox homeostasis, and disease progression. We show that genetic ablation of MTHFD2 suppresses acute myeloid leukemia cell proliferation in vitro and significantly delays leukemia onset in a genetically engineered mouse model, while sparing healthy hematopoietic stem and progenitor cell function. Stable isotope tracing demonstrates that MTHFD2 supports de novo purine synthesis and sustains mitochondrial NADH and NADPH production. Consistent with this role, MTHFD2 inhibition increases mitochondrial superoxide levels, and combined purine supplementation and mitochondrial reactive oxygen species neutralization rescues acute myeloid leukemia cell viability. We also demonstrate that the small-molecule inhibitor DS18561882 directly inhibits mitochondrial MTHFD2 activity and phenocopies genetic deletion. DS18561882 exhibits activity across a cohort of 60 primary AML patient samples, synergizes with venetoclax in treatment-naive acute myeloid leukemia, and restores venetoclax sensitivity in resistant AML models. These findings establish mitochondrial MTHFD2 as a genetically validated, therapeutically targetable metabolic vulnerability in acute myeloid leukemia and support targeting mitochondrial one-carbon metabolism to enhance and restore venetoclax response.

Read full paper
bioinformatics, cancer biology Mar 20, 2026 bioRxiv

Tumor Cell Death Drives Tumor-Promoting IL-6+ iCAF formation via P2X7-activation

McDonnell, C., Zinina, V., +7 authors, Schmitt, M.

Abstract: Chemotherapy resistance in pancreatic ductal adenocarcinoma is commonly attributed to tumor cell-intrinsic mechanisms, yet how cytotoxic therapy reshapes the tumor microenvironment remains incompletely understood. Here we show that PDAC cells exposed to cytotoxic agents reprogram pancreatic stellate cells toward an inflammatory cancer-associated fibroblast phenotype. Mechanistically, chemotherapy triggers the release of ATP from dying PDAC cells, which activates P2X7 signaling in PSCs in a paracrine manner, leading ERK activation and inflammatory polarization. In turn, therapy-educated PSCs promote tumor cell proliferation, induce resistance-associated transcriptional programs and impair CD8+ T cell-mediated cytotoxicity in an IL-6-dependent manner. Pharmacological inhibition of P2X7 suppressed stromal IL-6 induction and enhanced gemcitabine efficacy in vivo. These findings identify a therapy-induced ATP-P2X7-IL-6 axis that links tumor cell death to stromal reprogramming and adaptive resistance in PDAC.

Read full paper
machine learning, cancer biology Mar 20, 2026 MED

Tai Chi for treating cancer-related fatigue: A meta-analysis of randomized controlled trials.

Qiao C, Zhao XH, +6 authors, Li DH.

Abstract: Cancer-related fatigue (CRF) lacks effective pharmacological treatment, with the available options in Western medicine often having limited efficacy and adverse effects. Tai Chi, a traditional Chinese exercise, shows promise in improving CRF.To evaluate the clinical efficacy of Tai Chi in alleviating CRF.In this meta-analysis, we reviewed 9 randomized controlled trials (RCTs) retrieved from databases such as PubMed, EMBASE, the Cochrane Library, China National Knowledge Infrastructure, Wanfang Database, and the Chinese Biomedical Literature Database, and published before March 31, 2025. The experimental groups received conventional treatment plus Tai Chi, and the control groups received conventional treatment only, with varying durations. Using random-effects models, we calculated standardized mean differences (SMD) and mean differences with 95% confidence intervals (CI) to assess the effects of CRF. Heterogeneity was evaluated through I 2 statistics. To assess the robustness of the pooled results, we performed leave-one-out sensitivity analysis by sequentially excluding each study and reconducting the meta-analysis. Publication bias was evaluated through funnel plot inspection, supplemented by quantitative assessments using the trim-and-fill method and Egger's test.This study conducted a systematic review and meta-analysis of 9 RCTs (n = 659 cancer patients) and found that Tai Chi significantly improved CRF, enhanced sleep quality, and increased quality of life, with a favorable safety profile. The research provides evidence-based medical support for promoting Tai Chi as an adjunctive therapy for CRF.Results analysis based on the GRADE assessment indicated that Tai Chi significantly alleviated fatigue symptoms in cancer patients (moderate-certainty evidence, SMD = -1.29, 95%CI: -1.72 to -0.85, P P = 0.007), and enhanced quality of life (low-certainty evidence, SMD = 0.70, 95%CI: 0.23 to 1.16, P = 0.003), suggesting that Tai Chi can serve as an effective adjuvant intervention for CRF.

Read full paper