paper-trackr newsletter

bioinformatics, synthetic biology Nov 24, 2025 bioRxiv

General-Purpose Large Language Models, such as DeepSeek V3.2, Have Evolved Protein Design Capabilities

Li, J., Dong, X.

Abstract: General-Purpose Large Language Models (GLLMs), although primarily developed for natural language processing, are increasingly demonstrating emergent capabilities in specialized scientific domains. In this study, we explored the potential of GLLMs, specifically DeepSeek V3.2 Exp in reasoning mode, to perform practical protein engineering tasks without domain-specific biological training. Two representative design problems were addressed: Generation of amino acid sequences predicted to adopt the canonical 4-helix bundle topology, and targeted mutation design to improve protein solubility while preserving core structural integrity. Across 49 generated 4-helix bundle candidates, 40 adopted the desired geometry, with 36 achieving pLDDT scores above 70. Solubility optimization on 50 representative proteins yielded 46 mutants with an average predicted score increase of 0.178, and 29 maintained structural deviations below 3 Angstrom RMSD. These results indicate that general-purpose LLMs such as DeepSeek V3.2 can integrate sequence-structure-property relationships sufficiently to produce viable protein designs. We propose a hybrid workflow that couples GLLM-based mutation generation with established computational validation, offering an accessible route for protein and peptide engineering.

Latest in Science

General-Purpose Large Language Models, such as DeepSeek V3.2, Have Evolved Protein Design Capabilities

A programmable mRNA platform for miRNA detection via miRNA-mRNA2 triplex-mediated ribosomal frameshifting

Creating an energy efficient central metabolism for boosting biosynthesis without compromising cell growth of yeast

Sequence-to-graph alignment based copy number calling using a network flow formulation

MiRformer: a dual-transformer-encoder framework for predicting microRNA-mRNA interactions from paired sequences

Inferring virtual cell environments using multi-agent reinforcement learning

Accurate Probabilistic Reconstruction of Cell Lineage Trees from SNVs and CNAs with ScisTreeCNA

Sample-specific haplotype-resolved protein isoform characterization via long-read RNA-seq-based proteogenomics

Deconvolution of Sparse-count RNA Sequencing Data for Tumor Cells Using Embedded Negative Binomial Distributions

Reinforcement learning for adaptive control of phenotypically heterogeneous bacterial populations

TAp73 mediates anti-tumor immunity through regulation of lipid metabolism in the lung tumor microenvironment

In vivo assessment of differential toxicity of cancer treatment drugs in Fanconi Anemia

Multimodal Approach for Identification and Validation of Hepatocellular Carcinoma Targets for Radiotheranostics

Sphingosine-1-phosphate receptor modulators resensitize FLT3-ITD acute myeloid leukemia cells with NRAS mutations to FLT3 inhibitors

Prediction of graft loss in living donor liver transplantation during the early postoperative period.

OmniCLIC: A Unified Omics Contrastive Learning Framework for Effective Integration and Classification of Multiomics Data.