By Ezzah, M.Phil. Research Scholar (Pharmaceuticals)
Introduction
You can measure the progress of biology by how well we turn noise into signal. Single-cell RNA sequencing pushed the field into a new regime, millions of cells, thousands of genes per cell, and a combinatorial storm of context. Reading that by hand is like trying to understand a city by staring at every brick. The good news, we finally have a model that reads the city map.
Cell2Sentence turns raw gene expression into something language models can read, reason over, and write about. In practical terms, it converts a cell into a sentence, then asks an LLM to answer biological questions. The flagship result is not just a nice benchmark. Using this framework with the C2S-Scale 27B model, researchers predicted a drug context where antigen presentation jumps in tumor cells, then confirmed the effect in living human cell models. That is a credible step toward making “cold” tumors visible to the immune system, and it was born from text-like representations of cells, not from a bespoke algorithmic one-off.
Table of Contents
1. What Is Cell2Sentence, And Why Should You Care
At its core, Cell2Sentence reframes single-cell data as language. Think of a cell as a ranked list of gene tokens, most expressed to least expressed. The model learns to read that list, then predicts, explains, and even generates biologically plausible cells. The promise here is pragmatic. You get to use modern LLMs, and their rapidly improving toolchain, on transcriptomic data without inventing a new architecture for every task. The paper benchmarks across cell type annotation, dataset interpretation, perturbation prediction, and question answering, with consistent gains as the model scales.
There is a bigger story too. Cell2Sentence pretrains with a next-token objective, like mainstream LLMs, which fits generative biology use cases. You are not guessing masked tokens in a non-linguistic sequence. You are learning to continue the “sentence” in a way that respects the biology behind it. The authors explicitly contrast this with masked-token setups like Geneformer.
2. How The “Sentence” Trick Actually Works

The representation is the point. A “cell sentence” is a space-separated list of gene names ordered by expression level. That simple move unlocks two benefits.
- Use what already works. Because the data now looks like text, you can fine-tune capable, well-supported LLMs, including Gemma 2 variants, with the same training pipeline that powers open language models. Cell2Sentence leans on next-token prediction to capture gene program hierarchies and context.
- Bridge text and transcriptomes. You can pair cell sentences with biological text, lab metadata, and question prompts, then train jointly. That turns the model into a bilingual interpreter, fluent in both literature and gene expression, which shows up in downstream tasks like sc-specific QA.
2.1 Why This Is More Than A Clever Encoding
Two signals matter. First, the model’s representation captures higher-order structure, which the authors quantify using an adapted single-cell Fréchet Inception Distance. scFID measures similarity between distributions of real and generated cell embeddings, using a single-cell foundation model backend, rather than individual gene-by-gene errors. That aligns evaluation with the biology you care about, cell states and programs.
Second, reinforcement learning on biologically meaningful rewards tightens the loop. Group Relative Policy Optimization improves fidelity on immune pathways and helps the model generalize to unseen perturbations. It also uses scGPT as the embedding backbone.
3. What Is C2S-Scale 27B

The 27-billion parameter model is a Gemma 2 based decoder-only transformer, trained within the Cell2Sentence framework. Pretraining runs on a corpus of more than 50 million single-cell transcriptomes with metadata and text, and the study evaluates models from 410M up to 27B parameters across single-cell tasks. Training uses GPUs for smaller variants and TPUs for the larger ones. The key takeaway, bigger models deliver consistently better biological predictions and explanations, which aligns with the scaling behavior we see in language.
3.1 What It Does Well
C2S-Scale scores top marks on perturbation prediction for unseen combinations of cell type, cytokine, and exposure duration, and shows stable rankings under scFID and distributional metrics. It also improves apoptosis-focused bulk predictions and sc-specific QA when you fine-tune with targeted rewards. In short, it transfers to contexts the model never saw and still keeps the biology coherent.
4. From Hypothesis To Lab Bench, A Credible Cancer Result

Immunotherapy often fails because many tumors are “cold”, they do not show the immune system enough to trigger a strong response. The study set up a dual-context in-silico screen to predict drugs that would selectively boost MHC-I antigen presentation only when a modest interferon signal is present. That specificity matters in the clinic. You want a context-aware amplifier, not a blunt lever.
Cell2Sentence flagged silmitasertib, a CK2 inhibitor, as a top candidate with a pronounced context split, strong predicted increase in the interferon-positive setting, minimal effect in the neutral one. The effect was novel with respect to prior literature, which is exactly the kind of fresh angle you want from a model that reads both text and cells. The team then validated the prediction in two human neuroendocrine cell models that were unseen during training, with low-dose interferon plus silmitasertib producing a marked increase in antigen presentation compared to either alone. That is how an idea leaves the model card and lands on the bench.
5. How It Compares To Geneformer And scGPT
Cell2Sentence sits in the same family as single-cell foundation models, but with key differences that matter for work in the lab.
| Model | Data Representation | Pretraining Objective | Base Architecture | Distinct Strengths | Notes |
|---|---|---|---|---|---|
| Cell2Sentence, C2S-Scale 27B | Ranked “cell sentences” of gene tokens | Next-token generation that learns to continue gene sequences conditioned on context | Gemma 2 and Pythia, decoder-only |
|
|
| Geneformer | Gene sequences not cast as natural language | Masked-token prediction | Transformer | Solid baseline for sc tasks | Paper contrasts masked-token setup with the generative objective used by C2S-Scale |
| scGPT | Foundation model embeddings for single-cell profiles | Used in this study as the embedding backbone for scFID evaluation | Transformer | Good for representation learning | Noted as the embedding engine for scFID metric in this work |
The theme is clear. Cell2Sentence wins by choosing a representation that lets you bring mainstream LLM strengths into biology, then aligning learning and evaluation with the structures biologists care about. That is why the model can scale cleanly and why its predictions remain grounded when you push into unseen contexts.
6. How To Use Cell2Sentence, A Minimal Step-By-Step Guide
You can approach this from two angles, the Python package for end-to-end workflows, or the Hugging Face model for quick inference with Gemma-based weights. The steps below optimize for a fast path to “hello cell type,” then point to the deeper tutorials.
6.1 Grab The Core Resources
- Paper, Scaling Large Language Models for Next-Generation Single-Cell Analysis (bioRxiv).
- Code and docs, van Dijk Lab GitHub, vandijklab/cell2sentence.
- Models, Hugging Face collection, including C2S-Scale Gemma-2 27B.
These are the canonical entry points for Cell2Sentence and C2S-Scale workflows.
6.2 Set Up A Clean Environment
The repo includes a straightforward setup. From a terminal:
# clone
git clone https://github.com/vandijklab/cell2sentence.git
cd cell2sentence
# conda environment
conda create -n cell2sentence python=3.8 -y
conda activate cell2sentence
# install dev deps and package
make install
# or use pip for the package only
pip install cell2sentence==1.1.0
# optional speedups for long gene lists
pip install flash-attn --no-build-isolationThis gives you the Cell2Sentence package for core tasks, including inference with existing models and fine-tuning on your own datasets. The flash-attention step accelerates long sequences, useful when you generate or score cells with many genes.
6.3 Run A “Hello Cell Type” With The 27B Model
If you have a capable GPU, you can load the Gemma-based 27B model directly from the Hub:
# pip install accelerate transformers sentencepiece
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "vandijklab/C2S-Scale-Gemma-2-27B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)
# cell sentence: gene names ordered by expression
cell_sentence = "MALAT1 TMSB4X B2M EEF1A1 H3F3B ACTB ..." # use 200+ genes for better signal
prompt = (
"The following is a list of 1000 gene names ordered by descending expression "
"in a Homo sapiens cell. Your task is to give the cell type which this cell belongs "
"to based on its gene expression.\n"
f"Cell sentence: {cell_sentence}.\n"
"The cell type corresponding to these genes is:"
)
inputs = tokenizer(prompt, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))That is the lowest-friction way to query the model for cell type predictions using the Cell2Sentence format. It demonstrates how to use Gemma for biology without writing custom architectures.
6.4 Prepare Your Own Data As Cell Sentences
You need per-cell ranked gene lists. A minimal recipe:
- Normalize your scRNA-seq counts.
- Rank genes per cell by expression.
- Convert the top N genes to a space-separated string, most to least expressed.
- Add simple metadata tokens if you want the model to condition on tissue or species.
- Prompt as shown above, or use the package’s tutorial notebooks for end-to-end pipelines.
This matches the study’s pretraining formulation and keeps your inputs aligned with what the model expects.
| Your Goal | Fastest Path | Notes |
|---|---|---|
| Try the model on a sample cell | Load Gemma-2 27B from Hugging Face and prompt with a cell sentence | Good for exploration and demos |
| Batch annotate a dataset | Use the Cell2Sentence Python package and tutorial notebooks | Adds preprocessing and batching utilities |
| Speed up long prompts | Install flash-attention | Helpful for large N gene lists |
| Tune for your tissue of interest | Follow LoRA or full fine-tune recipes in the docs | Preserve base knowledge, adapt to domain |
| Explore perturbation predictions | Use the perturbation tutorials and reward-shaped prompts | Aligns with the study’s evaluation setup |
7. Where This Leaves Computational Biology
Once you cast cells as sentences, the rest of the LLM toolkit starts to fit. The study shows improvements in sc-specific QA after reinforcement learning with a text-based reward, which suggests these models can learn to explain data, not just label it. That matters for lab workflows, because explanation helps you decide what to validate, and where.
On the generative side, the introduction of scFID adds an evaluation lens that respects cell-state geometry rather than only per-gene errors. That gives you a metric that tracks realism in the right space and produces stable model rankings. Pair that with distributional distances and rank correlations, and you get a healthier picture of whether generated or predicted profiles look like biology, not just like numbers.
The cancer result is the best signal. A credible AI cancer breakthrough requires more than a good scatter plot. You need a mechanism that makes biological sense, context dependence you can control, and wet-lab confirmation in human cells. The silmitasertib prediction checks those boxes and points to combination therapy ideas worth pursuing. It does not declare victory. It shows a process you can repeat across pathways and tissues.
There is also a practical upside for teams building tools. If you are asking how to use Gemma for biology, this is a blueprint. Fine-tune on cell sentences with lightweight adapters, align with rewards that reflect your assay, and evaluate in an embedding space that knows what a cell is. That is a much more direct route than maintaining a zoo of bespoke models per task.
8. Final Thoughts, And A Clear Next Step
Cell2Sentence gives biology a common language with AI. It lets you unify gene expression, metadata, and literature into one workflow, then scale compute and training like any strong LLM effort. The success of C2S-Scale 27B is not just about parameter count. It is about choosing the right representation, aligning the objective with the work, and validating predictions where it matters, in living cells.
If you lead a lab or a platform team, here is your call to action. Spin up a small project this week. Pick one tissue, one perturbation, and one question you already care about. Convert a subset of cells into sentences, prompt C2S-Scale, shortlist the top explanations, and design a quick in-vitro test. Share your wins and misses, then iterate. This is how you turn Cell2Sentence from a paper into a habit.
Appendix: Practical Notes For Teams
- Data hygiene first. Ranking garbage yields readable garbage. Keep preprocessing simple and transparent.
- Start with knowns. Validate against a pathway where you expect a signal before prospecting for novelty.
- Right-sized prompts. Use a few hundred top genes per cell to begin. Increase if you need finer discrimination.
- Mix natural language with cell sentences. Context tokens and short task instructions help the model focus.
- Prefer in-distribution tests for your first run. Then push into the unknowns and log what drifts.
1) What is Cell2Sentence and how does it work?
Cell2Sentence converts a cell’s gene expression into a ranked list of gene tokens, a “cell sentence.” Large language models trained on these sentences can label cell types, predict perturbations, and generate realistic profiles by continuing or interpreting the sequence as text.
2) How does the C2S-Scale 27B model generate a new cancer hypothesis?
C2S-Scale 27B ran a dual-context virtual screen across thousands of drugs and predicted that silmitasertib, combined with low-dose interferon, would boost antigen presentation. Yale labs then validated the effect in human cell models, confirming the hypothesis from in-silico to in-vitro.
3) Is the Cell2Sentence model open-source and how can I access it?
The Cell2Sentence codebase and docs are public, and the C2S-Scale Gemma-2 27B weights are hosted on Hugging Face. You can read the preprint, clone the GitHub repo, and load the model card to run local inference or adapt workflows.
4) How does Cell2Sentence compare to models like scGPT or Geneformer?
All are single-cell foundation approaches. Cell2Sentence stands out by casting expression data as natural-language sequences and training with next-token objectives. This lets it leverage mainstream LLM tooling and improves transfer on tasks like QA, perturbation prediction, and dataset interpretation.
5) What are the practical applications of using Cell2Sentence for biological research?
drug responses, and summarize datasets in natural language. Teams can fine-tune for tissues of interest and evaluate realism with single-cell aware metrics before planning wet-lab validation.
