paper-trackr newsletter

← back to my website

Wednesday, 13 August 2025

Hello science reader,

Welcome to your daily dose of research. Dive into the latest discoveries from the world of science!

This newsletter was built with paper-trackr and is tailored to my interests.
Customize your own filters and receive personalized updates.


ViralQuest: A user-friendly interactive pipeline for viral-sequences analysis and curation

Rodrigues, G. V. P., Ferreira, L. Y. M., Aguiar, E. R. G. R.

Source: bioRxiv Published: 2025-08-12

bioinformatics, genomics

tl;dr: We developed ViralQuest, a bioinformatics tool that automates the in-depth characterization of viral sequences from pre-assembled contigs.

Background: High-throughput sequencing (HTS) has become an essential, unbiased tool in virology for identifying known and novel viruses. However, analyzing the large and complex datasets generated by HTS presents significant bioinformatics challenges. The process of accurately identifying and characterizing viral sequences from assembled contigs remains a bottleneck, often requiring specialized expertise and involving non-standardized parameters. There is a pressing need for robust, user-friendly, and reproducible pipelines to streamline this post-assembly analysis. Results: To address these challenges, we developed ViralQuest, a bioinformatics tool that automates the in-depth characterization of viral sequences from pre-assembled contigs. The pipeline integrates multiple lines of evidence for robust identification, using Diamond BLASTx against the Viral RefSeq database and pyHMMER searches against the RVDB, Vfam, and eggNOG profile HMM databases. For detailed characterization, ViralQuest performs taxonomic classification based on the ICTV nomenclature and functional annotation via Pfam domain analysis. Novel features of ViralQuest include an AI-powered summarization module that uses a Large Language Model (LLM) to generate contextual narratives for key viral findings and a comprehensive confidence score to rank putative viral contigs. All results are consolidated into a single, interactive HTML report that includes dynamic visualizations of contigs, ORFs, and protein domains, alongside detailed data tables that are exportable in TSV and SVG formats. Conclusion: ViralQuest provides an accessible and comprehensive solution for the post-assembly analysis of viral metagenomic data. By combining rigorous bioinformatics methods with novel AI-driven features and an intuitive reporting interface, it streamlines the complex process of viral identification and characterization. The tool enhances the interpretability and reliability of results, making in-depth virome analysis more accessible to the broader research community. ViralQuest is available on GitHub at https://github.com/gabrielvpina/viralquest/.

Read full paper


Ultrafast and Ultralarge Distance-Based Phylogenetics Using DIPPER

Walia, S., Chen, Z., Tseng, Y.-H., Turakhia, Y.

Source: bioRxiv Published: 2025-08-12

bioinformatics, genomics

tl;dr: DIPPER is a distance-based phylogenetic method for ultrafast and ultralarge phylogenetic reconstruction on GPUs, designed to maintain high accuracy and a small memory footprint.

Abstract: Motivation: Distance-based methods are commonly used to reconstruct phylogenies for a variety of applications, owing to their excellent speed, scalability, and theoretical guarantees. However, classical de novo algorithms are hindered by cubic time and quadratic memory complexity, which makes them impractical for emerging datasets containing millions of sequences. Recent placement-based alternatives provide better algorithmic scalability, but they also face practical scaling challenges due to their high cost to compute evolutionary distances and significant memory usage. Current tools also do not fully utilize the parallel processing capabilities of modern CPU and GPU architectures. Results: We present DIPPER, a novel distance-based phylogenetic tool for ultrafast and ultralarge phylogenetic reconstruction on GPUs, designed to maintain high accuracy and a small memory footprint. DIPPER introduces several novel innovations, including a divide-and-conquer strategy, a placement strategy, and an on-the-fly distance calculator that greatly improve the runtime and memory complexity. These allow DIPPER to achieve runtime and space complexity of O(N.log(N)) and O(N), respectively, with N taxa. With divide-and-conquer, DIPPER is also able to maintain a low memory footprint on the GPU, independent of the number of taxa. DIPPER consistently outperforms existing methods in speed, accuracy, and memory efficiency, and scales to tree sizes 1-2 orders of magnitude beyond the limits of existing tools. With the help of a single NVIDIA RTX A6000 GPU, DIPPER is able to reconstruct a phylogeny from 10 million unaligned sequences in under 7 hours, making it the only distance-based method to operate at this scale and efficiency. Availability: DIPPER's code is freely available under the MIT license at https://github.com/TurakhiaLab/DIPPER, and the documentation for DIPPER is available at https://turakhia.ucsd.edu/DIPPER.

Read full paper


When Cells Rebel: a comparative genomics investigation into marsupial cancer susceptibility

Petrohilos, C., Peel, E., Silver, L. W., Grady, P. G. S., O'Neill, R. J., Hogg, C. J., Belov, K.

Source: bioRxiv Published: 2025-08-12

bioinformatics, genomics

tl;dr: We identified large expansions in Ras genes, a family of oncogenes, in the Dasyuridae, a carnivorous marsupial family with high cancer prevalence, and a similar expansion in the bandicoot and bilby.

Abstract: Cancer is ubiquitous in multicellular life, yet susceptibility varies significantly between species. Previous studies have shown a genetic basis for cancer resistance in many species, but few studies have investigated the inverse: why some species are particularly susceptible to cancer. The Dasyuridae are a family of carnivorous marsupials that are frequently reported as having high rates of cancer prevalence. We hypothesised that this high susceptibility also has a genetic basis. To investigate this, we generated reference genomes for the kowari (Dasyuroides byrnei), a dasyurid species with one of the highest rates of reported cancer prevalence among mammals, and a non-dasyurid marsupial, the eastern barred bandicoot (Perameles gunnii). We used these to perform a comparative genomics analysis alongside nine previously assembled reference genomes: four dasyurid species and five non-dasyurid marsupial species. Genomes were annotated using FGENESH++ and assigned to orthogroups for input to CAFE (Computational Analysis of gene Family Evolution) analysis to identify gene families that had undergone significant expansions or contractions in each lineage. In the dasyurids, we identified large expansions in Ras genes, a family of oncogenes. Interestingly, a similar expansion of Ras genes was also identified in the bandicoot and bilby. These genes were primarily expressed in tissues such as testes, ovaries and yolk sac, so we hypothesise they serve a reproductive role. Future work is required to identify the potential roles of oncogene expansions in cancer susceptibility in these marsupial species.

Read full paper


MLL2 facilitates long-range gene regulation through LINE1 elements

Zorro Shahidian, L., Di Filippo, L., Robert, S. M., Rada-Iglesias, A.

Source: bioRxiv Published: 2025-08-12

bioinformatics, genomics

tl;dr: We identify a previously unrecognized regulatory function for MLL2 at the CG-rich 5' untranslated regions (5'UTR) of evolutionarily young LINE-1 (L1) transposable elements (TE).

Abstract: Transcriptional regulation is tightly linked to chromatin organization, with H3K4me3 commonly marking both active and bivalent promoters. In embryonic stem cells (ESC), MLL2 is essential for H3K4me3 deposition at bivalent promoters, which has been proposed to facilitate the induction of major developmental genes during pluripotent cell differentiation. However, prior studies point to a functional discrepancy between the loss of H3K4me3 at bivalent promoters and the largely unaltered transcription of major developmental genes in Mll2-/- cells. In this study, we investigated MLL2-dependent gene regulation in mouse ESC and during their differentiation. Contrary to the prevailing view, we show that MLL2's primary role is not to oppose Polycomb-mediated repression at the bivalent promoters of developmental genes. Instead, we identify a previously unrecognized regulatory function for MLL2 at the CG-rich 5' untranslated regions (5'UTR) of evolutionarily young LINE-1 (L1) transposable elements (TE). We found that MLL2 binds to the 5'UTR of L1 elements and is critical for maintaining their active state (H3K4me3 and H3K27ac), while preventing the accumulation of repressive H3K9me3. Using both global genomic approaches (i.e. RNA-seq, ChIP-seq and Micro-C) as well as targeted L1 deletions, we demonstrate that these MLL2-bound L1 elements act as enhancers, modulating the expression of neighboring genes in ESC and, more prominently, during differentiation. Together, our findings illuminate novel aspects of MLL2 regulatory function during early developmental transitions and highlight the emerging role of TE as key components of long-range gene expression control.

Read full paper


GFFx: A Rust-based suite of utilities for ultra-fast genomic feature extraction

Chen, B., Dongya, W., Zhang, G.

Source: bioRxiv Published: 2025-08-12

bioinformatics, genomics

tl;dr: We present GFFx, a Rust-based toolkit for ultra-fast and scalable genome annotation access, with significant improvements in runtime and scalability over existing tools.

Abstract: Genome annotations are becoming increasingly comprehensive due to the discovery of diverse regulatory elements and transcript variants. However, this improvement in annotation resolution poses major challenges for efficient querying, especially across large genomes and pangenomes. Existing tools often exhibit performance bottlenecks when handling large-scale genome annotation files, particularly for region-based queries and hierarchical model extraction. Here, we present GFFx, a Rust-based toolkit for ultra-fast and scalable genome annotation access. GFFx introduces a compact, model-aware indexing system inspired by binning strategies and leverages Rust's strengths in execution speed, memory safety, and multithreading. It supports both feature- and region-based extraction with significant improvements in runtime and scalability over existing tools. Distributed via Cargo, GFFx provides a cross-platform command-line interface and a reusable library with a clean API, enabling seamless integration into custom pipelines. Benchmark results demonstrate that GFFx offers substantial speedups and makes a practical, extensible solution for genome annotation workflows.

Read full paper


Scaling Molecular Representation Learning with Hierarchical Mixture-of-Experts

Zhang, X., Yu, S., Xia, J., Yang, F.

Source: bioRxiv Published: 2025-08-12

bioinformatics, genomics

tl;dr: Hierarchical Mixture-of-Experts network for molecular representation learning .

Abstract: Recent advancements in large-scale self-supervised pretraining have significantly improved molecular representation learning, yet challenges persist, particularly when addressing distributional shifts (e.g., under scaffold-split). Drawing inspiration from the success of Mixture-of-Experts (MoE) networks in NLP, we introduce H-MoE, a hierarchical MoE model tailored for molecular representation learning. Since conventional routing strategies struggle to capture global molecular information such as scaffold structures, which are crucial for enhancing generalization, we propose a hierarchical routing mechanism. This mechanism first utilizes scaffold-level structural guidance before refining molecular characteristics at the atomic level. To optimize expert assignment, we incorporate scaffold routing contrastive loss, ensuring scaffold-consistent routing while preserving discriminability across molecular categories. Furthermore, a curriculum learning approach and dynamic expert allocation strategy are employed to enhance adaptability. Extensive experiments on molecular property prediction tasks demonstrate the effectiveness of our method in capturing molecular diversity and improving generalization across different tasks.

Read full paper


Reinforcement Learning for Antibody Sequence Infilling

Lee, C. S., Hayes, C. F., Vashchenko, D., Landajuela, M.

Source: bioRxiv Published: 2025-08-12

bioinformatics, genomics

tl;dr: We introduce a flexible framework for antibody sequence design that combines an infilling language model with reinforcement learning to optimize functional properties.

Abstract: We introduce a flexible framework for antibody sequence design that combines an infilling language model with reinforcement learning to optimize functional properties. Our approach leverages a pretrained infilling language model to generate specific antibody regions within full sequences, guided by reinforcement learning to improve desired biophysical characteristics. We implement a range of online learning strategies, exploring both vanilla REINFORCE and Proximal Policy Optimization with Kullback-Leibler (KL) regularization, and demonstrate that KL regularization is essential for maintaining a balance between score optimization and sequence plausibility. We also adapt Direct Reward Optimization to the protein domain by adding a value head to the infilling model, allowing it to learn directly from static (prompt, response, feedback) datasets using a mean-squared error objective. This formulation is particularly useful when only single-trajectory data is available, which is commonly the case for historically collected experimental assays. We evaluate both the online and offline methods across multiple antibody design tasks (including binding affinity, immunogenicity, and expression) and show that our framework improves alignment with measured biophysical properties while outperforming likelihood-only baselines. This integrated online/offline approach enables functionally driven antibody design and provides a scalable toolkit for therapeutic sequence engineering. Code and data are available at url{https://github.com/LLNL/protein_tune_rl}.

Read full paper


LIPTER, a cardiomyocyte-enriched long noncoding RNA, controls cardiac cytoskeletal maturation and is regulated by a cardiomyocyte-specific enhancer.

Nzelu, G. A., Lee, M., Koslowski, S., Zheng, W., Benzaki, M., Mak, M., Xiao, W., Tan, L. W., Dashi, A., Zhu, Y., Fawaz, T., Ng, K., Pham, D., LeBlanc, F., Lettre, G., Hussin, J., Foo, R.

Source: bioRxiv Published: 2025-08-12

bioinformatics, genomics

tl;dr: We have unravelled a novel role of LIPTER in the cytoskeletal maturation of adult cardiomyocytes and have identified a CM-specific regulatory enhancer that regulates the expression of LipTER in CMs.

Abstract: Cardiac development is characterized by a complex series of molecular, cytoskeletal and electrophysiological changes that guarantee the proper functioning of adult cardiomyocytes (CMs). These changes are defined by cell-type-specific transcriptional rewiring of progenitor cells to form CMs, and are regulated by various epigenetic elements, such as long noncoding RNAs (lncRNAs). LncRNAs are versatile epigenetic regulators as they may act in cis or in trans to orchestrate important gene programs during cardiac development and may concurrently encode micropeptides. LIPTER is one such lncRNA, previously shown to regulate lipid droplet transport in cardiomyocytes and thus an important regulator of cardiomyocyte metabolism. Here we show that LIPTER also plays a role in the cytoskeletal maturation of CMs, as loss of LIPTER leads to persistent expression of fetal genes, changes in chromatin accessibility, disorganized sarcomeres and impaired calcium homeostasis in CMs. Furthermore, we have identified a cardiomyocyte-specific regulatory enhancer that regulates the expression of LIPTER in CMs. CRISPR-mediated inhibition of this enhancer led to reduced LIPTER expression in CMs and increased expression of fetal genes. This CM-specific enhancer could therefore be manipulated to control the expression of LIPTER for therapeutic benefit. In summary, we have unravelled a novel role of LIPTER in CMs cytoskeletal maturation and have identified a CM-specific enhancer for LIPTER.

Read full paper


Estrogen Receptor Enhancers Sensitive to Low Doses of Hormone Specify Distinct Molecular and Biological Outcomes

Kim, H. B., Nandu, T., Camacho, C. V., Kraus, W. L.

Source: bioRxiv Published: 2025-08-12

bioinformatics, genomics

tl;dr: We show that low and high doses of estradiol can have different biological effects on breast cancer cells.

Abstract: Adult women are typically exposed to estradiol (E2) concentrations of ~100-200 pM, yet most cell-based studies use 100 nM. We determined the molecular effects of E2 concentrations spanning six orders of magnitude (1 pM to 100 nM) in breast cancer cells. Estrogen receptor alpha (ER) enhancers formed at low physiological doses of E2 (1-100 pM) are mechanistically distinct from those that form at high pharmacological doses (10-100 nM). They (1) form in open chromatin bound by FOXA1, (2) produce enhancer RNAs enriched with functional eRNA regulatory motifs (FERMs), and (3) drive expression of cell proliferation genes with promoter-proximal paused RNA polymerase II. Importantly, low dose ER enhancer usage is elevated in breast cancer patients with poor responses to aromatase inhibitors, likely as a continued response to low circulating levels of E2. Collectively, our results identify mechanistic differences between low and high dose ER enhancers that specify distinct biological outcomes.

Read full paper


Gene co-expression network reveals key hub genes associated with endometriosis using bulk RNA-seq

Hashemi, A., Ghahramani, N., Eroglu, S., Kordlar, E. E.

Source: bioRxiv Published: 2025-08-12

bioinformatics, genomics

tl;dr: We identified hundreds of key genes involved in the molecular mechanisms of endometriosis and identified key regulatory networks in several KEGG pathways related to EMs.

Abstract: Endometriosis (EMs) is a complex and prevalent gynecological disorder with a significant genetic component, posing a major clinical challenge in reproductive medicine due to multifactorial inheritance patterns and the involvement of gene-environment interactions in pathophysiology. However, despite extensive research, reliable diagnostic biomarkers for EMs have yet to be identified. We utilized bulk transcriptome sequencing data obtained from the Gene Expression Omnibus to identify hub genes involved in EMs. This study was conducted using a system biology analysis, incorporating differential gene expression, meta-analysis of transcriptomic data, functional enrichment analysis, construction of gene co-expression networks, and comprehensive topological analysis to identify key regulatory genes. Bulk RNA-seq analysis revealed significant differential gene expression between healthy and EMs groups. Overall, 603 and 443 meta-genes were discovered using the Fisher and Invorm P-value combination methods, respectively. A total of 427 meta-genes were subjected to functional enrichment analysis, which revealed significant enrichment in several KEGG pathways related to EMs including "Adherens junction," "p53 signaling pathway," and "AMPK signaling pathway." Additionally, Gene Ontology analysis revealed key processes including "Regulation of Anatomical Structure Morphogenesis," Acetylglucosaminyltransferase Activity" and "Positive Regulation of Intracellular Signal". Co-expression network analysis identified the turquoise module as a critical functional module, within this significant module, the genes IGFBP7, IGFBP3, and NKAP were identified as EMs hub genes based on high connectivity and central roles in the network. The constructed protein-protein interaction network further highlighted STAR, PLCD3, RPAP2, MSI2, MAS1, TBX1, LIPT1, and SVIL, as key genes. These genes represented high centrality within the network, suggesting potential regulatory and functional significance in the molecular mechanisms underlying EMs. Notably, miR-143-3p, miR-340-5p, miR-410-3p, and miR-302b-5p were implicated in EMs-associated regulatory networks. This integrative approach significantly enhances our understanding of the molecular mechanisms underlying EMs and provides a robust foundation for the development of diagnostic biomarkers.

Read full paper


GeneF: A High-Performance Processing-in-Memory Accelerator for Efficient DNA Alignment

Verandani, K.

Source: bioRxiv Published: 2025-08-12

bioinformatics, synthetic biology

tl;dr: We propose GeneF, a Processing-in-Memory (PIM) accelerator designed specifically for DNA alignment tasks, leveraging 3D-stacked memory to enhance memory bandwidth and computing parallelism.

Abstract: In this paper, we explore the compute and memory characteristics of the FM-index and identify data movement as a significant contributor to overall energy consumption in genomic processing. We propose GeneF, a Processing-in-Memory (PIM) accelerator designed specifically for DNA alignment tasks, leveraging 3D-stacked memory to enhance memory bandwidth and computing parallelism. Our architecture features a custom RISC-V-based processing element (PE) array, a lightweight messaging mechanism to mitigate remote access latency, and specialized prefetchers for improved efficiency. Experimental results demonstrate that GeneF achieves substantial speedups - 1820x for counting and 1728x for determining stages - over traditional CPU implementations and offers remarkable energy efficiency, consuming only 25 percent of the energy compared to conventional CPU-DDR3 systems. The findings highlight the potential of PIM architectures in minimizing data movement and enhancing performance for genomic workloads, paving the way for more energy-efficient computing solutions in the field of bioinformatics.

Read full paper


GIN-CRC-Pareto: A graph-based Pareto-optimal multi-task learning framework to identify miRNA-target interactions in colorectal cancer

Li, L., Yang, Q., Li, L., Zhao, H., Xu, J., Xie, M., Yin, R.

Source: bioRxiv Published: 2025-08-12

bioinformatics, synthetic biology

tl;dr: We propose GIN-CRC-Pareto, a graph-based, Pareto-optimal multi-task learning framework that simultaneously predicts miRNA-mRNA binding pairs, identifies seed match pairings, and classifies seed match subtypes.

Abstract: Colorectal cancer (CRC) ranks as the third highest incidence among malignancies in humans and the second most common cause of cancer-related mortality in the United States. Accumulating evidence has established microRNAs (miRNAs) as critical regulators of cancer development and therapeutic response. Understanding miRNA-mRNA interactions is critical for elucidating the molecular mechanisms driving CRC and other malignancies. In this study, we proposed GIN-CRC-Pareto, a graph-based, Pareto-optimal multi-task learning framework that simultaneously predicts miRNA-mRNA binding pairs, identifies seed match pairings, and classifies seed match subtypes. By leveraging the power of graph neural networks and Pareto-optimal gradient balancing strategy, GIN-CRC-Pareto dynamically adjusted the task weights during training to optimize each task without compromising the others. Experimental results demonstrated that our framework consistently outperforms traditional deep learning models and existing state-of-the-art tools across multiple evaluation metrics, with 0.909 in accuracy, 0.909 in precision and 0.968 in AUC in the miRNA-mRNA binding pairs prediction task. Additionally, we further validated the generalizability of the framework in combination with transfer learning techniques to identify miRNA-target interactions across other cancers. These findings highlight the effectiveness of the proposed framework to comprehensively identify the miRNA-target interactions in CRC, with the potential to serve as a scalable and generalizable tool across diverse cancer types, ultimately facilitating the development of miRNA-based therapeutics for cancer treatment.

Read full paper


Multi-Modal Protein Representation Learning with CLASP

Bolouri, N., Szymborski, J., Emad, A.

Source: bioRxiv Published: 2025-08-12

bioinformatics, synthetic biology

tl;dr: We introduce CLASP, a unified tri-modal framework that combines the strengths of geometric deep learning, natural large language models (LLMs), protein LLMs, and contrastive learning to learn informative protein representations based on their structure, amino acid sequence, and description.

Abstract: Effectively integrating data modalities pertaining to proteins' amino acid sequences, three-dimensional structures,and curated biological descriptions can lead to informative representations capturing different views of proteins. Here, we introduce CLASP, a unified tri-modal framework that combines the strengths of geometric deep learning, natural large language models (LLMs), protein LLMs, and contrastive learning to learn informative protein representations based on their structure, amino acid sequence, and description. We show that CLASP enables accurate zero-shot classification and retrieval tasks, such as matching a protein structure to its sequence or description, outperforming state-of-the-art baselines. CLASP embeddings also exhibit superior clustering by protein family, and ablation studies confirm that all three modalities contribute synergistically to performance. Our results highlight the power of integrating structural, sequential, and textual signals in a single model, establishing CLASP as a general-purpose embedding framework for protein understanding.

Read full paper


b-move: faster lossless approximate pattern matching in a run-length compressed index.

Lore Depuydt, Luca Renders, Simon Van de Vyver, Lennart Veys, Travis Gagie, Jan Fostier

Source: PubMed Published: 2025-08-12

bioinformatics, genomics, pan-genomics

tl;dr: Arakawa et al.'s br-index initiates complete approximate pattern matching using bidirectional search in run-length compressed space, but with significant computational overhead due to complex memory access patterns.

Abstract: Due to the increasing availability of high-quality genome sequences, pan-genomes are gradually replacing single consensus reference genomes in many bioinformatics pipelines to better capture genetic diversity. Traditional bioinformatics tools using the FM-index face memory limitations with such large genome collections. Recent advancements in run-length compressed indices like Gagie et al.'s r-index and Nishimoto and Tabei's move structure, alleviate memory constraints but focus primarily on backward search for MEM-finding. Arakawa et al.'s br-index initiates complete approximate pattern matching using bidirectional search in run-length compressed space, but with significant computational overhead due to complex memory access patterns.

Read full paper


You’re receiving this email because you're using paper-trackr. Stay curious!