Long-Read Sequencing Data Analysis Whitepaper

KMAP Standard Pipeline White Paper

Version 1.0 (Draft) · 2026-04-05

Introduction
Reference Genome Selection
PacBio HiFi WGS Pipeline
MAS-ISO-Seq Transcriptome Pipeline
Single-Cell RNA Sequencing Pipeline
Pipeline Implementation and Deployment
Visualization Strategy
Conclusion and Future Directions
References

1. Introduction

Despite advances in next-generation sequencing (NGS), short-read-based analyses have inherent limitations in structural variant (SV) detection, haplotype resolution, full-length transcriptome analysis, and direct methylation detection. PacBio HiFi (High-Fidelity) long-read sequencing technology achieves both average read lengths exceeding 15–20 kb and base-level accuracy above 99.9%, providing a powerful platform to overcome these limitations [1].

The Korean Molecular Atlas Project (KMAP) is a national initiative to build multi-omics molecular atlases of human tissues. Establishing a standardized, integrated pipeline for the analysis of PacBio HiFi long-read sequencing data is essential to the project's success. This white paper defines standard analysis procedures for long-read sequencing data based on internal KMAP discussions and multi-institutional consultations. Specifically, it covers three major pipelines: (i) PacBio HiFi whole-genome methylation analysis, (ii) MAS-ISO-Seq-based full-length transcriptome analysis, and (iii) single-cell RNA sequencing.

The design principles of this pipeline are:

Reproducibility: All software versions and parameters are explicitly documented, and workflows are automated through a workflow management system.
Multi-institutional compatibility: Flexible entry points are provided to accommodate data produced by different sequencing vendors.
Standardization: Reference genomes, annotation databases, and tool versions are unified across the entire project.

2. Reference Genome Selection

2.1 GRCh38 (HG38) as the Standard

The current KMAP standard pipeline adopts GRCh38 (HG38) as the default reference genome. Although T2T-CHM13 offers a complete telomere-to-telomere assembly with superior resolution of repetitive and centromeric regions [2], GRCh38 was selected as the standard at this time considering the maturity of gene annotations, compatibility with existing databases, and multi-institutional consensus.

2.2 Sub-version Unification

Since subtle differences exist between GRCh38 releases from different providers, it is essential to use an identical sub-version (e.g., GRCh38.p14, analysis set) across the entire project. This ensures coordinate compatibility between datasets from different vendors.

3. PacBio HiFi Whole Genome Sequencing Pipeline

3.1 Methylation Analysis

PacBio HiFi sequencing enables direct detection of 5mCpG methylation without bisulfite conversion by leveraging kinetic information from DNA polymerase activity (IPD: inter-pulse duration, PW: pulse width).

Step 1: 5mC Base Modification Calling (Jasmine)

Jasmine, developed by PacBio, computes methylation probabilities at each CpG site based on kinetic signatures in HiFi reads. The output is stored as MM (methylation position) and ML (methylation likelihood) tags in the BAM file.

Step 2: Reference Genome Alignment (pbmm2/minimap2)

HiFi reads containing MM/ML tags are aligned to the reference genome using minimap2 [3] or pbmm2. The --MD flag is included to preserve mismatch information, and mapping quality (MAPQ) statistics are computed to monitor alignment quality.

Step 3: Small Variant Calling (DeepVariant)

DeepVariant [4] is used to detect SNPs and small indels from aligned reads. As a deep convolutional neural network-based variant caller, DeepVariant provides a model optimized for PacBio HiFi data. Output is generated in gVCF format.

Step 4: Haplotype Phasing (WhatsHap / HiPhase)

Two tools are considered for haplotype phasing:

WhatsHap [5]: The standard tool for read-based phasing, performing accurate SNP-based phasing. VCF phasing and BAM haplotagging are performed as separate steps.
HiPhase [6]: Developed by PacBio, this tool jointly phases SNVs, indels, SVs, and tandem repeats, simultaneously outputting a phased VCF and haplotagged BAM. It supports multi-threading and is advantageous for large-scale data processing.

A performance benchmarking comparison between the two tools is recommended early in the project using the same dataset.

Step 5: Per-CpG-Site Methylation Scoring (pb-CpG-tools)

pb-CpG-tools is used to compute modification scores at each CpG site. When using a haplotype-tagged BAM as input, the haplotype option must be enabled to independently calculate methylation scores for HP1, HP2, and total reads. Mod scores are derived from the counts of methylated and unmethylated reads at each site.

Step 6: Segment Analysis and ASM Annotation

Methylated and unmethylated regions are segmented based on total reads, and Allele-Specific Methylation (ASM) regions are identified by computing per-CpG-site mod score differences between haplotypes [7]. ASM is determined by averaging HP1–HP2 mod score differences across samples and comparing against a predefined threshold.

Note: When a SNP exists at a CpG site causing cytosine substitution to another base, methylation information at that site may be lost. This is a characteristic of kinetic-based (non-bisulfite) detection, and results should be cross-validated with variant calling outputs.

3.2 Variant Calling and Haplotype Phasing

As described in Steps 3–4 above, DeepVariant-based small variant calling followed by WhatsHap/HiPhase phasing constitutes the standard procedure. For multi-sample projects, joint genotyping prior to phasing is recommended.

3.3 Structural Variant Detection

The following tools are considered for SV detection:

Sniffles2 [8]: The standard long-read SV caller supporting population-level SV calling and genotyping. It provides balanced sensitivity for SV detection.
cuteSV [9]: A fast, sensitive alignment-based SV detection tool suitable for large-scale genome projects.

Internal KMAP evaluations have shown Sniffles to provide stable calling of moderate-sized SVs, leading to its adoption as the default tool.

3.4 De Novo Assembly

When coverage exceeds 30–40x, haplotype-resolved de novo assembly using hifiasm [10] is performed. Assembly results can be mapped to the reference genome to supplement SV calling accuracy, particularly for the interpretation of complex structural variants. When coverage is below 20x, analysis proceeds with reference-based alignment only.

4. MAS-ISO-Seq Long-Read Transcriptome Pipeline

MAS-ISO-Seq (Multiplexed Arrays Sequencing Isoform Sequencing) is a PacBio HiFi-based long-read RNA sequencing technology that concatenates multiple full-length cDNAs using adapters during library preparation to increase throughput, followed by computational separation into individual transcripts during analysis [11].

4.1 Read Preprocessing

Step 1: Adapter Demultiplexing (Skera)

Skera is used to separate concatenated reads into individual transcripts by recognizing and cutting at MAS-ISO-Seq adapter junctions.

Step 2: Primer Removal (LIMA)

LIMA removes 5'/3' primer and adapter sequences. Depending on the sequencing vendor, this step may already be completed in the delivered data; therefore, the pipeline must support flexible entry points.

4.2 Isoform Identification and Quantification

Step 3: Isoform Clustering and Collapsing

Preprocessed reads are aligned to the reference genome, similar reads are grouped into isoform clusters, and expression counts per isoform are quantified. Available tools include:

IsoSeq3 collapse (official PacBio tool)
IsoQuant [12]: Supports both reference-based and reference-free modes with reportedly low false positive rates.
FLAIR2 [13]: Integrates alignment, collapsing, and quantification, outputting isoform FASTA and GTF files.

4.3 Structural Annotation and Quality Control

Step 4: Isoform Classification (Pigeon / SQANTI3)

Identified isoforms are compared against the reference transcriptome (GTF) and assigned structural categories:

FSM (Full Splice Match): All splice junctions exactly match the reference
ISM (Incomplete Splice Match): Splice junctions match the reference but terminal exons are partially missing
NIC (Novel In Catalog): Novel combinations of known splice sites (novel junctions from known donors/acceptors)
NNC (Novel Not in Catalog): Completely novel junctions involving new splice sites

SQANTI3 [14] provides these classifications along with transcript model quality metrics (TSS, TES, splice junction confidence, etc.), artifact filtering (SQANTI QC, SQANTI filter), and a rescue module to recover known genes/transcripts with low-quality features.

4.4 Isoform Filtering Strategy

One of the greatest challenges in long-read transcriptome analysis is selecting reliable isoforms. No standardized methodology has been established in this area, and the number of identified isoforms can differ by more than 10-fold depending on the tool used.

The recommended strategy is:

Multi-tool comparison: Apply multiple tools (IsoSeq3, IsoQuant, FLAIR2) to the same dataset to characterize each tool's isoform identification profile.
Intersection (ensemble) strategy: Adopt isoforms consistently detected across multiple tools as high-confidence isoforms.
SQANTI3-based filtering: Apply SQANTI QC/filter to remove artifacts and evaluate the reliability of novel transcripts.
Expression-based filtering: Set thresholds for low-count isoforms (1–2 counts). However, since the biological significance of low-frequency isoforms cannot be entirely excluded, orthogonal proteomics-based validation (peptide-level) is recommended.

Future work: A benchmark dataset must be constructed, and project-standard thresholds for isoform filtering must be established through multi-tool comparison.

5. Single-Cell RNA Sequencing Pipeline

The single-cell RNA sequencing pipeline follows these standard procedures:

Raw Data Processing: Cell Ranger is used to generate per-cell gene expression matrices from raw FASTQ files.
Quality Control: Low-quality cells are filtered based on per-cell gene count, UMI count, and mitochondrial gene percentage.
Integration: Multi-sample/batch integration is performed, with the optimal integration method selected through benchmarking.
Dimensionality Reduction and Clustering: UMAP-based dimensionality reduction and cell type classification are performed.

6. Pipeline Implementation and Deployment

6.1 Workflow Management System: Nextflow

All pipelines are implemented using Nextflow [15], which provides:

Compatibility with diverse HPC schedulers (SLURM, PBS, SGE, etc.) and cloud environments
Software dependency management via Docker/Singularity containers
Reproducible workflow execution
Minimized scheduler conflict issues

6.2 Flexible Entry Point Design

Since sequencing vendors deliver data at varying levels of preprocessing (e.g., with or without Jasmine tagging, LIMA processing), the pipeline is designed to support execution from multiple entry points. The goal is to implement automatic detection logic that identifies the preprocessing level of input data and initiates execution from the appropriate step.

6.3 Protocol Distribution Strategy

To ensure data consistency across sequencing vendors, pipeline protocols (software versions, parameters, reference data specifications) are documented and distributed to contracted vendors. This enforces standardization from the data production stage.

7. Visualization Strategy

7.1 Methylation Visualization

Box Plot: Visualize methylation scores of CpG sites or CpG islands within a specific gene/region by sample group
Scatter Plot with Gene Model: Display per-CpG-site average methylation scores alongside gene models, with methylated/unmethylated segments and ASM regions distinguished by color (cf. GTEx [16])

7.2 Transcriptome Visualization

Per-gene TPM-based violin/box plots (sample group comparison)
Multi-gene median TPM-based heatmaps
Integrated visualization of per-isoform expression levels with gene models
Exon/junction-level heatmaps

7.3 Single-Cell Visualization

Dot plot in tissue (X-axis) × cell type (Y-axis) matrix format
Color: expression level; dot size: cell fraction; dot border: specificity

Note: Whole-genome single-base-level methylation visualization requires prior validation of computational load for web service delivery.

8. Conclusion and Future Directions

This white paper presents the initial draft of the standard analysis pipeline for long-read sequencing data within the KMAP molecular atlas project. Tool selections, parameter configurations, and quality control strategies have been defined for the major analysis modules (whole-genome methylation, transcriptome, and single-cell).

Key issues requiring resolution include:

WhatsHap vs HiPhase benchmark: Comparative analysis of phasing accuracy and speed on the same dataset
Isoform filtering standard establishment: Validation of intersection/ensemble strategies through multi-tool comparison
Real tissue data validation: Assessing pipeline robustness by applying cell-line-based best practices to autopsy tissue data
GRCh38 sub-version confirmation: Finalizing the exact reference files for multi-institutional use
Web visualization optimization: Verifying the feasibility of serving single-base-level methylation data through web services
Methylation-transcriptome integration: Exploring foundation models for isoform prediction using RNA methylation patterns

This pipeline will be continuously updated as the project progresses, with version control ensuring analysis reproducibility.

9. References

Wenger AM, Peluso P, Rowell WJ, et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol. 2019;37(10):1155-1162. DOI
Gershman A, Sauria MEG, Guitart X, et al. Epigenetic patterns in a complete human genome. Science. 2022;376(6588):eabj5089. DOI
Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34(18):3094-3100. DOI
Poplin R, Chang PC, Alexander D, et al. A universal SNP and small-indel variant caller using deep neural networks. Nat Biotechnol. 2018;36(10):983-987. DOI
Martin M, Ebert P, Marschall T. Read-Based Phasing and Analysis of Phased Variants with WhatsHap. Methods Mol Biol. 2023;2590:127-138. DOI
Holt JM, Saunders CT, Rowell WJ, et al. HiPhase: jointly phasing small, structural, and tandem repeat variants from HiFi sequencing. Bioinformatics. 2024;40(2):btae042. DOI
O'Neill K, Pleasance E, Fan J, et al. Long-read sequencing of an advanced cancer cohort resolves rearrangements, unravels haplotypes, and reveals methylation landscapes. Cell Genomics. 2024;4(11):100674. DOI
Smolka M, Paulin LF, Grochowski CM, et al. Detection of mosaic and population-level structural variants with Sniffles2. Nat Biotechnol. 2024;42(10):1571-1580. DOI
Jiang T, Liu S, Cao S, Wang Y. Structural Variant Detection from Long-Read Sequencing Data with cuteSV. Methods Mol Biol. 2022;2493:137-151. DOI
Cheng H, Concepcion GT, Feng X, Zhang H, Li H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods. 2021;18(2):170-175. DOI
Zajac N, Zhang Q, Bratus-Neuenschwander A, et al. Comparison of single-cell long-read and short-read transcriptome sequencing via cDNA molecule matching: quality evaluation of the MAS-ISO-seq approach. NAR Genom Bioinform. 2025;7(3):lqaf089. DOI
Prjibelski AD, Mikheenko A, Joglekar A, et al. Accurate isoform discovery with IsoQuant using long reads. Nat Biotechnol. 2023;41(7):915-918. DOI
Tang AD, Soulette CM, van Baren MJ, et al. Full-length transcript characterization of SF3B1 mutation in chronic lymphocytic leukemia reveals downregulation of retained introns. Nat Commun. 2020;11:1438. DOI
Pardo-Palacios FJ, Arzalluz-Luque A, Kondratova L, et al. SQANTI3: curation of long-read transcriptomes for accurate identification of known and novel isoforms. bioRxiv. 2023. DOI
Di Tommaso P, Chatzou M, Floden EW, et al. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017;35(4):316-319. DOI
GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science. 2020;369(6509):1318-1330. DOI
Guizard S, Miedzinska K, Smith J, et al. nf-core/isoseq: simple gene and isoform annotation with PacBio Iso-Seq long-read sequencing. Bioinformatics. 2023;39(5):btad150. DOI

Publications