lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Oblivionum.

2023-08-31 20:07:08 | Science News

(Created with Midjourney V5.2)




□ StarSpace: Joint representation learning for retrieval and annotation of genomic interval sets

>> https://www.biorxiv.org/content/10.1101/2023.08.21.554131v1

An application of the StarSpace method to convert annotated genomic interval data into low-dimensional distributed vector representations. A system that solves three related information retrieval tasks using embedding distance computations.

The StarSpace algorithm converts each region set and its corresponding label to a numerical vector / embedding / n-dimensional vector represented in embedding space, putting biologically related region set vectors and their labels close to one another in the shared latent space.





□ ClairS: a deep-learning method for long-read somatic small variant calling

>> https://www.biorxiv.org/content/10.1101/2023.08.17.553778v1

ClairS, a somatic variant caller designed for paired samples and primarily ONT long-read. ClairS uses Clair3 and LongPhase for germline variant calling, phasing and read haplotagging. The processed alignments are used for pileup- / full-alignment based somatic variant calling.

ClairS considers the power of the two neural networks equal. Full-alignment-based calling is performant at mid-range VAFs. However, pileup-based calling requires less evidence than full-alignment calling to draw the same conclusion.





□ ETNA: Joint embedding of biological networks for cross-species functional alignment

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad529/7252232

Existing transfer methods either formulate the alignment problem as a matching problem which pits network features against known orthology, or more recently, as a joint embedding problem.

ETNA (Embeddings to Network Alignment (ETNA) generates individual network embeddings based on network topological structures and then uses a Natural Language Processing-inspired cross-training approach to align the two embeddings using sequence-based orthologs.





□ DCAlign v1.0: Aligning biological sequences using co-evolution models and informed priors

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad537/7255914

DCAlign v1.0 is a new implementation of the DCA-based alignment technique, DCAlign, which conversely to the first implementation, allows for a fast parametrization of the seed alignment.

DCAlign v1.0 uses an approximate message-passing algorithm coupled with an annealing scheme over β (i.e. we iteratively increase β) to get the best alignment for the query sequence.





□ Ariadne: synthetic long read deconvolution using assembly graphs

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-03033-5

Ariadne, a novel assembly graph-based algorithm, that can be used to deconvolve a large metagenomic linked-read dataset. Ariadne is intuitive, computationally efficient, and scalable to other large-scale linked-read problems, such as human genome phasing.

Ariadne relies on cloudSPAdes parameters to generate the assembly graph (iterative k-mer sizes), the program by itself only has two: search distance and size cutoff. The maximum search distance determines the maximum path length of the Dijkstra graphs surrounding the focal read.

Ariadne deconvolution generates read clouds that are enhanced up to 37.5-fold, containing only reads from a single fragment. Since each read is modeled as the center of a genomic fragment, the search distance can be thought of as the width of the fragment.





□ ReDis: efficient metagenomic profiling via assigning ambiguous reads

>> https://www.biorxiv.org/content/10.1101/2023.08.29.555244v1

ReDis combines Kraken2 with Minimap2 for aligning sequencing reads against a reference database with hundreds of gigabytes (GB) in size accurately within feasible time.

ReDis's novel assigning ambiguous reads step significantly raises the accuracy of abundance estimation of the organism with many multi-mapped reads by establishing the statistical model including the unique mapping rate.





□ IsoFrog: a Reversible Jump Monte Carlo Markov Chain feature selection-based method for predicting isoform functions

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad530/7255910

IsoFrog adopts a Reversible Jump Monte Carlo Markov Chain (RJMCMC)-based feature selection framework to assess the feature importance to gene functions. A sequential feature selection (SFS) procedure is applied to select a subset of function-relevant features.

IsoFrog screens the relevant features for the specific function while eliminating irrelevant ones. The SFS are input into modified domain-invariant partial least squares, which prioritizes the most likely positive isoform and utilizes diPLS for isoform function prediction.





□ Minmers are a generalization of minimizers that enable unbiased local jaccard estimation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad512/7246743

The minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. By construction, miners, unlike minimizers, enable an unbiased estimation of the Jaccard.

This scheme does not yield an unbiased Jaccard estimator. The density of the [w/s]-minimizer scheme tracks closely with the density of (w, s)-miner intervals which, while not necessary for the use of minmers, serve as a helpful auxiliary index for improving query performance.





□ R2C2+UMI: Combining concatemeric consensus sequencing with unique molecular identifiers enables ultra-accurate sequencing of amplicons on Oxford Nanopore Technologies sequencers

>> https://www.biorxiv.org/content/10.1101/2023.08.19.553937v1

Processing the libraries into high molecular weight DNA using the R2C2. R2C2 circularizes library molecules using Gibson assembly. It then uses rolling circle amplification to generate long, linear concatemers containing multiple tandem repeats of the original library molecule.

After sequencing this concatemeric DNA on ONT sequencers, the computational C3POa and BC1 tools generate consensus sequences for each original library molecule. C3POa parses concatemeric raw reads into subreads and generates accurate R2C2 consensus reads from these subreads.

BC1 parses R2C2 consensus reads using a highly flexible syntax for the locating and parsing of UMI sequences, enabling the detection of fixed bases used as spacers or IUPAC wildcard base codes, which can be used to optimize UMIs for more indel-prone long-reads.





□ ggCaller: Accurate and fast graph-based pangenome annotation and clustering

>> https://genome.cshlp.org/content/early/2023/08/24/gr.277733.123

ggCaller (graph-gene-caller) uses population-frequency information to guide gene prediction, aiding the identification of homologous start codons across orthologues, and consistent scoring and functional annotation of orthologues.

ggCaller incorporates Balrog to filter open-reading frames (ORFs) to improve specificity of calls and Panaroo. ggCaller includes a query mode, enabling reference-agnostic functional inference for sequences of interest, applicable in pangenome-wide association studies (PGWAS).

ggCaller identifies all stop codons in the DBG and traverses the DBG to identify putative gene sequences. Each stop codon is paired with a downstream stop-codon in the same reading frame using a depth first search, thereby delineating the coordinates of all possible reading frames.





□ GraphCpG: Imputation of Single-cell Methylomes Based on Locus-aware Neighboring Subgraphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad533/7255916

GraphCpG, a graph-based deep learning method using locus-aware neighboring subgraphs to impute the missing methylation states. GraphCpG generates an optimized representation for the target methylation state, which consolidates follow-up neural networks in prediction.

Without CpG position information and DNA context, the completion of the methylation matrix is transformed into a graph-based link prediction problem in a non-Euclidean space and the computational complexity is also reduced.





□ Factorial state-space modelling for kinetic clustering and lineage inference

>> https://www.biorxiv.org/content/10.1101/2023.08.21.554135v1

The directed signal obtained from RNA velocity enables the estimation of transition probabilities between cell-states. This information can be represented as a directed and asymmetric graph.

A latent state-space Markov model that utilises cell-state transitions to model differentiation as a sequence of latent state transitions and to perform soft kinetic clustering of cell-states that accommodates the transitional nature of cells in a differentiation process.





□ scProjection: Projecting RNA measurements onto single cell atlases to extract cell type-specific expression profiles

>> https://www.nature.com/articles/s41467-023-40744-6

scProjection uses deeply sequenced single cell atlases to improve the precision of individual sc-resolution. It does so by jointly performing two tasks: deconvolution (estimating % RNA contributions of each of a set of cell types to a single RNA measurement) and projection.

scProjection can impute the expression levels of genes not directly measured. scProjection can separate RNA contributions of the target neuron from neighboring glial cells when analyzing Patch-seq data, leading to more accurate prediction of one data modality from another.





□ Scan: Scanning sample-specific miRNA regulation from bulk and single-cell RNA-sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.08.21.554111v1

Scan (Sample-specific miRNA regulation) framework to scan sample-specific miRNA regulation from bulk and single-cell RNA-sequencing data. Scan incorporates 27 network inference methods and two strategies to infer tissue-specific or cell-specific miRNA regulation.

Scan adapts two strategies: statistical perturbation and linear interpolation to infer sample-specific miRNA regulatory networks. Scan can help to cluster samples and construct sample correlation network.





□ pareg: Coherent pathway enrichment estimation by modeling inter-pathway dependencies using regularized regression

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad522/7248907

pareg follows the ideas of GSEA as it requires no stratification of the input gene list, of MGSA as it incorporates term-term relations in a database-agnostic way, and of LRPath as it makes use of the flexibility of the regression approach.

By re-gressing the differential expression p-values of genes on their membership to multiple gene sets while using LASSO and gene set similarity-based regularization terms, they require no prior thresholding and incorporate term-term relations into the enrichment computation.





□ CellAnn: A comprehensive, super-fast, and user-friendly single-cell annotation web server

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad521/7248909

CellAnn, a reference-based cell annotation web server. CellAnn uses a cluster-to-cluster alignment method to transfer cell labels from the reference to the query datasets, which is superior to the existing methods with higher accuracy and higher scalability.

CellAnn calculates the correlations and estimates correlation cutoffs b/n the query data and sub-clusters in reference datasets. CellAnn performs the Wilcoxon rank-sum test to determine cell types further if a query cluster is similar to multiple sub-clusters in the reference.





□ GeneGPT: Augmenting Large Language Models with Domain Tools for Improved Access to Biomedical Information

>> https://arxiv.org/abs/2304.09667

GeneGPT, a novel method that prompts Codex to use NCBI Web APIs. GeneGPT consists of a specifically designed prompt that consists of documentations and demonstrations of API usage, and an inference algorithm that integrates API calls in the Codex decoding process.

GeneGPT generalizes to longer chains of subquestion decomposition and API calls with simple demonstrations; GeneGPT makes specific errors that are enriched for each task. GeneGPT uses chain-of-thought API calls to answer a multi-hop question in GeneHop.

GeneHop contains three new multi-hop QA tasks based on the GeneTuring: SN gene function / Disease gene location, where the task is to list the chromosome locations / Sequence gene alias, which asks for the aliases of the gene that contains a specific DNA sequence.





□ CellAgentChat: Harnessing Agent-Based Modeling in CellAgentChat to Unravel Cell-Cell Interactions from Single-Cell Data

>> https://www.biorxiv.org/content/10.1101/2023.08.23.554489v1

CellAgentChat presents a unique agent-based perspective on cellular interactions, seamlessly integrating temporal, spatial, and biological data, offering a more precise and comprehensive understanding of cellular interaction dynamics.

CellAgentChat employs individual cell agents guided by simple behavior rules to investigate the arising complexity of cellular interactions.CellAgentChat enables in silico perturbations and in-depth analysis of the effects of cellular interactions on downstream gene expression.





□ SC2Spa: a deep learning based approach to map transcriptome to spatial origins at cellular resolution

>> https://www.biorxiv.org/content/10.1101/2023.08.22.554277v1

SC2Spa identified spatially variable genes and suggested negative regulatory relationships between genes. SC2Spa armored with deep learning provides a new way to map the transcriptome to its spatial location and perform subsequent analyses.

A key feature of SC2Spa is the ability to score the SVGs from their weight space. SC2Spa can choose either polar or Cartesian coordinates.As SC2Spa maps gene expression directly to coordinates the computational complexity of SC2Spa increases linearly.





□ eGADA: enhanced Genomic Alteration Detection Algorithm, a fast genomic segmentation algorithm

>> https://www.biorxiv.org/content/10.1101/2023.08.20.553622v1

eGADA is an enhanced version of GADA, which is a fast segmentation algorithm utilizing the Sparse Bayesian Learning (or Relevance Vector Machine) technique.

eGADA uses a Red-Black (RB) tree to store all segment breakpoints as nodes in the tree and then eliminate the least significant breakpoint based on the tree. Breakpoints are sorted by their corresponding t-statistic if either t-statistic is below a pre-set threshold.

The segment length of a breakpoint is defined as the length of the shorter flanking segment. Red-Black tree has a time complexity of O(log(n)) for both building and querving the tree. So the time complexity of the BE step is improved from O(n^2) to O(n*log(n)).





□ Gonomics: Uniting high performance and readability for genomics with Go

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad516/7251027

Gonomics, an open-source collection of command line programs and bioinformatic libraries implemented in Go that unites readability and performance for genomic analyses.

Gonomics contains packages to read, write, and manipulate a wide array of file formats (e.g. FASTA, FASTQ, BED, BEDPE, SAM, BAM, and VCF), and can convert and interface between these formats.

<bt />



□ CoFrEE: An Application to Estimate DNA Copy Number from Genome-wide RNA Expression Data

>> https://www.biorxiv.org/content/10.1101/2023.08.25.554898v1

Copy number from Expression Estimation (CoFrEE) is unique in providing an intuitively simple approach appropriate for both RNAseq and array-based expression cohorts. This is also the first such application to focus on facilitating copy number estimates.

The core methodology shares recursive median filtering with CaSpER [6] but employs dedicated by-gene pre-processing and by-sample post-processing to achieve final copy number estimates. The preprocessing step shares similarity to CNV-Kit.





□ scNanoHi-C: a single-cell long-read concatemer sequencing method to reveal high-order chromatin structures within individual cells

>> https://www.nature.com/articles/s41592-023-01978-w

scNanoHi-C applies Nanopore long-read sequencing to explore genome-wide proximal high-order chromatin contacts within individual cells. scNanoHi-C can reliably and effectively profile 3D chromatin structures and distinguish structure subtypes among individual cells.

scNanoHi-C could also be used to detect genomic variations, including copy-number variations and structural variations, as well as to scaffold the de novo assembly of single-cell genomes.

Extensive high-order chromatin structures exist in active chromatin regions across the genome, and multiway interactions between enhancers and their target promoters were systematically identified within individual cells.

scNanoHi-C sequencing data was first demultiplexed to single cells by Nanoplexer using known cell barcodes with default parameters. Adapter sequences were trimmed by Cutadapt and reads shorter than 500bp were also removed.





□ Sandy: A user-friendly and versatile NGS simulator to facilitate sequencing assay design and optimization

>> https://www.biorxiv.org/content/10.1101/2023.08.25.554791v1

Sandy, a user-friendly and computationally efficient tool with complete computational methods for simulating NGS data from three platforms: Illumina, Oxford Nanopore, and Pacific Bioscience. Sandy generates reads requiring only a fasta file as input.

Sandy simulates single-end and paired-end reads from both DNA and RNA sequencing. Sandy tracks a built-in database with predefined models extracted from real data for sequencer quality-profiles (i.e. Illumina hiseq, miseq, nextseq), expression-matrices generated from GTExV8 data.





□ Flow: a web platform and open database to analyse, store, curate and share bioinformatics data at scale

>> https://www.biorxiv.org/content/10.1101/2023.08.22.544179v1

Flow uses established nf-core pipelines, with some custom ones written to nf-core conventions including demultiplexing and CLIP-Seq pipelines. Once analysed, all stages of data processing can be seamlessly shared with the community via open database model.





□ Accurate human genome analysis with Element Avidity sequencing

>> https://www.biorxiv.org/content/10.1101/2023.08.11.553043v1

Element whole genome sequencing achieves higher mapping and variant calling accuracy compared to Illumina sequencing at the same coverage, with larger di�erences at lower coverages (20x-30x).

One new property of Element's AVITI platform is the ability to generate paired-end sequencing data with longer insert sizes (the distance between the paired reads) than is typical with Illumina preparations.





□ RichPathR: a gene set enrichment analysis and visualization tool

>> https://www.biorxiv.org/content/10.1101/2023.08.28.555198v1

RichPathR fills the gap of available tools for rapid mining of pre-annotated data of pathways/terms. A single transcriptomic or epigenetic high throughput sequencing experiment might generate several gene sets andmining these gene sets one at a time could be time consuming.





□ ASTA-P: a pipeline for the detection, quantification and statistical analysis of complex alternative splicing events

>> https://www.biorxiv.org/content/10.1101/2023.08.28.555224v1

ASTA-P, a pipeline for the analysis of arbitrarily complex splice patterns, using ASTALAVISTA to mine complete splicing events of different dimensions, followed by quantification with a custom script, and modelling the event counts using the Dirichlet-multinomial regression.

ASTA-P combines full-length transcript reconstruction for enriching the existing annotation model before assembling the splicing graph for each gene. This is followed by mining and quantification of local splice events incl. binary as well as high dimensional patterns.





□ HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad535/7255913

HAPNEST simulates pairs of synthetic haplotypes, where each haplotype is constructed as a mosaic of segments of various lengths imperfectly copied from a reference set of real haplotypes.

HAPNEST additionally models the coalescence age of segments using an approximate model inspired by the sequential Markovian coalescent model.





□ DosaCNV: Deep multiple-instance learning accurately predicts gene haploinsufficiency and deletion pathogenicity

>> https://www.biorxiv.org/content/10.1101/2023.08.29.555384v1

DosaCNV is a supervised deep MIL model designed to simultaneously infer the pathogenicity of coding deletions and the haploinsufficiency of genes, based on the assumption that the joint effect of gene haploinsufficiency determines deletion pathogenicity. DosaCNV, a deep multiple-instance learning framework that models deletion pathogenicity jointly with gene haploinsufficiency.





□ Galba: genome annotation with miniprot and AUGUSTUS

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05449-z

GALBA, a fully automated pipeline that utilizes miniprot, a rapid protein-to-genome aligner, in combination with AUGUSTUS to predict genes with high accuracy. Accuracy results indicate that GALBA is particularly strong in the annotation of large vertebrate genomes.

GALBA provides substantially higher accuracy than BRAKER2 in the genomes of large vertebrates because GeneMark-ES within BRAKER2 performs poorly in such genomes when generating seed regions for spliced-alignment of proteins to the genome.





□ DecentTree: Scalable Neighbour-Joining for the Genomic Era

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad536/7257068

DecentTree is designed as a stand-alone application and a header-only library easily integrated with other phylogenetic software (e.g., it is integral in the IQ-TREE). DecentTree shows improved performance over existing software (BIONJ, Quicktree, FastME, and RapidNJ).

DecentTree uses the Vector Class Library and the multithreading OpenMP to parallelize the computations. DecentTree accepts either a distance matrix in Phylip format or a multiple sequence alignment in common formats such as Phylip or Fasta.





□ VData: Temporally annotated data manipulation and storage

>> https://www.biorxiv.org/content/10.1101/2023.08.29.555297v1

VData, a solution for storing and manipulating single cell datasets that extends the widely used AnnData format and is designed with synthetic data in mind.

VData adds a third 'time' dimension beyond the usual 'cell and 'gene' axes to support time stamped longitudinal data and heavily focuses on low memory footprint to allow fast and efficient handling of large datasets of tens of Gigabytes even on regular laptops.





最新の画像もっと見る

コメントを投稿