Latest in Science

bioinformatics, synthetic biology Jun 27, 2026 bioRxiv

eRNAformer enables genome-wide de novo mapping of enhancer-derived RNA loci

Yu, H., Li, W., +10 authors, Lu, L.

Abstract: Enhancer-derived RNAs (eRNAs) are critical regulators of gene transcription, yet their genome-wide annotation remains challenging. Here, we present eRNAformer, a multi-modal deep learning framework that integrates convolutional neural networks with transformers, specifically designed to capture long-range genetic features associated with bidirectional transcription. This approach enables de novo mapping of eRNA loci using DNA sequence and aggregated conventional RNA-seq data. When evaluated on ENCODE datasets, eRNAformer demonstrated high sensitivity and specificity in discriminating known eRNA loci from non-eRNA loci. Notably, the newly identified eRNA loci were enriched with evolutionarily constrained variants and genetic risk factors for complex diseases, and exhibit potential relevance for cancer therapy. Applied to GEO datasets, eRNAformer identified a range from 14,219 to 56,451 eRNA loci across multiple hematologic malignancies, facilitating the construction of a comprehensive eRNA database for blood cancers. We further identified and experimentally validated FOXO1e, a cluster of eRNAs located approximately 120 kb upstream of FOXO1, a known oncogene that drives t(8;21) acute myeloid leukemia (AML) preleukemic program. Together, these findings establish eRNAformer as a powerful tool for genome-wide eRNA annotation, provide a valuable resource for eRNA studies in hematologic cancers, and underscore the functional importance of eRNAs in AML pathogenesis.

Read full paper

bioinformatics, synthetic biology Jun 27, 2026 bioRxiv

BoltzProt-1: Towards Efficient De Novo Binder Design with Good Developability

Ucar, T., Bates, J., +7 authors, Passaro, S.

Abstract: Designing binders against novel protein targets remains a central challenge in computational drug discovery. Here we introduce BoltzProt-1, a pipeline for generating protein binders, including nanobodies, with improved hit rates and favorable developability properties. At its core lie a refined iteration of BoltzGen's generative model and a novel protein-protein interaction prediction model, BoltzPPI. Employing BoltzPPI instead of BoltzGen's standard structure-prediction confidence metrics to rank nanobody (VHH) designs increases the confirmed-binder hit rate from 3.3% to 8.0% across 10 novel targets. Assessed on 10 additional targets used in prior literature, the BoltzProt-1 pipeline obtains nanobody screening hits for 7 of 10 targets, surpassing the 6 of 10 previously reported by Chai-2. Finally, evaluating the developability of BoltzProt-1-designed nanobodies in terms of stability, aggregation, purity, polyspecificity and hydrophobicity reveals that 58% of its confirmed binders pass every criterion, exceeding both BoltzGen (40%) and clinical-stage VHH controls (21%).

Read full paper

bioinformatics, synthetic biology Jun 27, 2026 bioRxiv

Stability-driven multi-omics integration for reproducible latent structure

Guan, H., Gerwen, M. v., +3 authors, Petrick, L.

Abstract: High-dimensional multi-omics data integration offers novel opportunities to characterize complex biological systems. Even though sampling variability frequently compromises findings, particularly in small cohorts, the reproducibility and generalizability of the derived latent structures are insufficiently evaluated. We propose a Stability-driven framework for multi-omics integration that combines sparse generalized canonical correlation analysis with repeated cross-validation, out-of-sample projection, and systematic evaluation of both component-level and feature-level stability. We apply this framework to untargeted metabolomic and Olink targeted inflammation proteomic profiles in a thyroid cancer case-control cohort (n = 162). Our Stability-driven integration identified reproducible metabolomic and proteomic latent components that showed consistent out-of-sample disease associations and tracked temporally structured changes relative to time to diagnosis. The proposed framework provides a generalizable strategy for identifying reproducible latent structures that improve robustness of biological inference in multi-omics studies.

Read full paper

bioinformatics, synthetic biology Jun 27, 2026 bioRxiv

UTRGen: A unified framework for full-spectrum design of mRNA 5' UTRs

Wang, Z., Chen, M., +6 authors, Li, X.

Abstract: The 5' untranslated region (5' UTR) is a key regulatory element that governs mRNA translation and protein output. However, existing computational methods typically address isolated tasks such as functional prediction or sequence optimization, limiting their ability to support rational design across the full 5' UTR engineering workflow. Here, we present UTRGen, a unified modeling framework for 5' UTRs that integrates sequence generation, multi-property prediction, and constrained function-guided design. UTRGen is pre-trained autoregressively on large-scale 5' UTR datasets from multiple species and subsequently adapted to diverse downstream regulatory tasks. Across systematic evaluations, UTRGen generates novel and diverse 5' UTRs while preserving sequence, structural, and functional characteristics of natural UTRs. After task-specific fine-tuning, UTRGen achieves state-of-the-art performance across 14 benchmark datasets, improving translation efficiency prediction by up to 11.1%, expression level prediction by up to 13.2%, and mean ribosome load prediction by up to 3.0% relative to the strongest baselines. It also achieved the best overall performance for internal ribosome entry site identification. To enable controllable design, we formulate function-guided 5' UTR design as a GRPO-based refinement process over a pre-trained autoregressive sequence prior, using composite rewards to encode functional objectives and biological constraints while regularizing toward the natural 5' UTR distribution. The resulting sequences show consistently improved predicted translation efficiency and expression levels across cellular contexts, and reveal interpretable sequence features associated with high activity, including reduced C content, fewer upstream AUGs, and depletion of inhibitory motifs. Together, our results establish a unified modeling strategy for 5' UTR design and lay a foundation for programmable control of translation.

Read full paper

bioinformatics, synthetic biology Jun 27, 2026 bioRxiv

The Hidden Disorder Divide: Reconciling Benchmark Inconsistencies in Intrinsically Disordered Protein Binding Site Prediction

Malhis, N., Mehdiabadi, M., +5 authors, Piovesan, D.

Abstract: Computational predictors of protein-binding sites within intrinsically disordered regions (IDRs) show highly inconsistent performance across high-quality benchmark datasets. To understand the origins of these discrepancies, we systematically compared predictors across three independent test sets: two CAID datasets updated with the latest DisProt annotations and a composite dataset (DBs) assembled from DIBS, FuzDB, IDEAL, and MFIB. Predictors trained predominantly on DisProt data achieved substantially higher AUCs on the CAID sets but performed poorly on the DBs. In contrast, predictors trained on older, low-quality PDB-based datasets showed balanced performance across all sets, with a slight preference for DBs. Predictors with mixed training exposure displayed intermediate behavior. Through controlled experiments using identical CNN architectures and feature analysis, we demonstrate that the dominant factor driving these performance differences is the intrinsic disorder propensity of the binding sites themselves. Binding residues in DisProt-based datasets exhibit markedly higher average disorder propensity scores than those in PDB-derived datasets. This previously unrecognized selection bias (literature studies preferentially characterizing more disordered binding sites, while PDB-derived annotations capture less disordered ones) effectively splits IDR-protein binding sites into two distinct categories. Predictors optimized on one category therefore generalize poorly to the other. Binding-site length and sequence conservation play only minor or negligible roles in explaining the observed inconsistencies. These findings highlight a critical limitation in current benchmarking practices and training strategies for IDR-binding site prediction, underscoring the need for more balanced and disorder-aware reference datasets. Finally, the diagnostic techniques introduced here could prove valuable beyond the specific application examined in this study.

Read full paper

bioinformatics, synthetic biology Jun 27, 2026 bioRxiv

Glitch genes: embedding geometry predicts functional fragility in single-cell foundation models

Whalley, J. P.

Abstract:

Read full paper

bioinformatics, synthetic biology Jun 27, 2026 bioRxiv

ProLoc: Text-guided Localization of Protein Functional Regions

Liu, p., Fan, J., +1 author, Zhang, J.

Abstract: Motivation: Protein function is often mediated by specific sequence regions, such as domains, motifs and functional sites. Identifying these regions is important for understanding protein mechanisms, annotating newly sequenced proteins and prioritizing residues for experimental validation. However, existing protein function prediction and protein-text models mainly capture global protein-level associations, making it difficult to determine which residues support a given textual functional description. This limits their use for mechanistic interpretation and residue-level experimental prioritization. Results: We introduce text-guided protein functional region localization, a span-level grounding task that identifies residue regions corresponding to natural-language functional descriptions. We construct an InterPro-derived localization benchmark of explicit protein-text-region examples, covering both domain-level and functional-site annotations with sequence-similarity-aware splits and a unified span-level evaluation protocol. We further propose ProLoc, a text-conditioned localization model built on raw ESM2-650M and PubMedBERT with direct residue-level localization and anchor-free span proposal generation. On the held-out test set, ProLoc substantially outperforms window-based adaptations of representative protein and protein-text models. Its direct output achieves the strongest single-region localization performance, reaching 0.7730 IoU@1, while its anchor-free proposal output improves visible multi-site recovery, reaching 0.9671 VM R@10 IoU50 and 0.9489 VM All-Hit@50.

Read full paper

bioinformatics, synthetic biology Jun 27, 2026 bioRxiv

Real Science Is Harder Than Benchmarks: Evaluating Advanced AI Frameworks on Published Studies. I. Uncertainty Quantification, ML on Therapeutic Data Commons, and Agent-Based Modeling

Ahmed, M. O., Amale, S. A., +19 authors, Sinitskiy, A.

Abstract: Artificial Intelligence (AI) frameworks for automating scientific research have shown strong performance on benchmarks, but their capacity to routinely reproduce results from multiple real-life published studies remains largely untested. We evaluated five advanced AI research frameworks (Kosmos, K-Dense, ToolUniverse, BioAgents from bio.xyz, and the AI Scientist-v2 from Sakana AI) on three real-life tasks (including two recently published papers) spanning uncertainty quantification for molecular property predictions, machine learning on Therapeutic Data Commons benchmarks, and agent-based modeling. AI frameworks demonstrated genuine strengths: generating original hypotheses, competently executing routine data acquisition and coding tasks, providing statistical measures of confidence often absent from the original papers, and producing well-formatted final reports. At the same time, our experiments revealed that real-world scientific tasks remain considerably harder than current benchmarks suggest. No AI framework matched the scope or depth of the original studies, results varied across multiple runs of the same framework with the same prompt, and we documented cases of severe hallucinations in final reports, gaps in literature coverage, and overconfident conclusions. Verification of AI outputs required substantial domain expertise. While these three tasks are only partially representative of the broader scientific landscape, they offer a starting point for developing a more rigorous methodology for evaluation of AI performance than what is currently practiced. We conclude that AI frameworks are already valuable for prototyping research directions and stress-testing completed studies, and some of the limitations documented here appear largely tractable through infrastructure improvements and continued development.

Read full paper

bioinformatics, synthetic biology Jun 26, 2026 bioRxiv

vDeepInsight: an injective three-dimensional voxel carrier for tabular-feature neighborhood learning

Jia, S., Lysenko, A., +2 authors, Tsunoda, T.

Abstract: DeepInsight-style methods make tabular feature relationships accessible to convolutional networks by placing each feature at a fixed position on an image carrier. An open design question is how the carrier geometry should be constructed when feature neighborhoods themselves carry part of the signal. We introduce vDeepInsight, an injective three-dimensional (3D) voxel carrier that preserves feature neighborhoods more faithfully than matched two-dimensional (2D) carriers while keeping a one-to-one mapping from each feature to a single voxel. Genes are embedded with t-SNE or UMAP, assigned one-to-one to a sparse voxel grid by linear-sum assignment, and processed by a submanifold sparse 3D convolutional network. We evaluate the carrier on gene expression through four linked analyses. First, representation-quality metrics show that 3D layouts reduce gene-neighborhood distortion relative to matched 2D layouts before any model is trained. Second, controlled synthetic tasks show that a sparse 3D convolution can exploit this preserved locality, but only when the supervised signal is constructed to depend on co-located genes and the receptive field spans adjacent voxels. Third, on real omics tasks the 3D carrier matches or exceeds tuned tabular baselines and consistently exceeds matched 2D carriers; the margin is small on marker-type classification, where individual genes already carry much of the label (tissue, lineage and cancer-type classification), and larger on program-type tasks, where the target depends on coordinated, pathway-level multi-gene activity (drug-response regression, TCGA immunogenomic-context regression and mechanism-of-action classification). Fourth, because the assignment is injective, voxel attribution maps directly back to genes, enabling gene-level attribution and pathway-level functional interpretation without voxel-to-gene deconvolution. Overall, the added carrier dimension improves the fidelity of feature-neighborhood representation and translates this improvement into prediction gains that are largest when the signal is distributed across local gene programs rather than dominated by individual marker genes.

Read full paper

bioinformatics, synthetic biology Jun 26, 2026 bioRxiv

Acquiring Improved Protein Variants With Probabilistic Preferential Learning

van der Flier, F. J., de Ridder, D., Probst, D., Redestig, H.

Abstract: Variant effect prediction (VEP) models can be used to select promising novel enzymes from a pool of candidates. Most supervised VEP models are framed as regression tasks, placing more emphasis on getting the predicted quantities correct than on the relative comparison of individual candidates. Preferential or contrastive models may better align with the goal of selection, or acquisition, especially when informed by predictive uncertainty. Here, we introduce a probabilistic preferential learning model based on the Kermut Gaussian process (PKermut) that we designed with the ambition to increase the hit rate among selected variants. We benchmark PKermut against established models, including the original Kermut, the RITA regressor, and an augmented Potts model, on 69 curated ProteinGym datasets across various assay categories. To evaluate acquisition performance, we propose a novel quantile cross-validation scheme that ensures the evaluation of a model's ability to extrapolate by reserving high-performing variants exclusively for the test set. We assess models using Spearman correlation and evaluate their acquisition performance using five different acquisition functions, encompassing both uncertainty-aware and unaware strategies. Our experimental results indicate that uncertainty estimates improve the acquisition ability of our models, and that strategies that reward uncertainty generally result in better outcomes than those that do not on single-mutation variant datasets. We observe that PKermut's Spearman scores and ability to acquire improved variants are greatly affected by the number of variant comparisons sampled in the training set. Kermut achieves the highest Spearman correlation in 54/69 datasets (78%), compared to 12/69 (17%) for PKermut. For acquisition performance, Kermut leads in 44/69 datasets (64%), while PKermut leads in 15/69 (22%). While at this stage PKermut is not a recommended alternative to Kermut, its contrastive nature offers several conceptual opportunities. We share our findings to inspire further development aimed at improving the alignment between training objectives of VEP models and their downstream application in protein engineering.

Read full paper

eRNAformer enables genome-wide de novo mapping of enhancer-derived RNA loci

BoltzProt-1: Towards Efficient De Novo Binder Design with Good Developability

Stability-driven multi-omics integration for reproducible latent structure

UTRGen: A unified framework for full-spectrum design of mRNA 5' UTRs

The Hidden Disorder Divide: Reconciling Benchmark Inconsistencies in Intrinsically Disordered Protein Binding Site Prediction

Glitch genes: embedding geometry predicts functional fragility in single-cell foundation models

ProLoc: Text-guided Localization of Protein Functional Regions

Real Science Is Harder Than Benchmarks: Evaluating Advanced AI Frameworks on Published Studies. I. Uncertainty Quantification, ML on Therapeutic Data Commons, and Agent-Based Modeling

vDeepInsight: an injective three-dimensional voxel carrier for tabular-feature neighborhood learning

Acquiring Improved Protein Variants With Probabilistic Preferential Learning