2022年10月31日のブログ記事一覧-lens, align.

OUREA.

2022-10-31 22:13:31 | Science News

□ HAL-X: Scalable hierarchical clustering for rapid and tunable single-cell analysis

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010349

HAL-X builds upon the idea that clustering can be viewed as a supervised learning problem where the goal is to predict the “true class labels”. HAL-X can generate multiple clusterings at varied depths to account for the specificity/sensitivity trade-off.

HAL-x is designed to cluster datasets with up to 100 million points embedded in a 50+ dimensional space. HAL-x defines an extended density neighborhood for each pure cluster, identifying spurious clusters that are representative of the same density maxima.

□ SpaceX: Gene Co-expression Network Estimation for Spatial Transcriptomics

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac645/6731919

SpaceX employs a Bayesian model to infer spatially varying co-expression networks via incorporation of spatial information in determining network topology. The probabilistic model is able to quantify the uncertainty and based on a coherent dimension reduction.

SpaceX algorithm takes gene expression matrix, spatial locations and cluster annotations as input. The algorithm estimates the latent gene expression level using a Poisson mixed model while adjusting for covariates and spatial localization information.

SpaceX uses a tractable Bayesian estimation procedure along with a computationally efficient and scalable algorithm, as outlined below. As opposed to full-scale Markov chain Monte Carlo (MCMC) algorithm which tends to be computationally intensive.

Spatial Poisson mixed models (sPMM) is an additive structure that connects log-scaled Λ with covariate effect. The PQLseq algorithm which is a scalable penalized quasi-likelihood algorithm for sPMM with Gaussian priors using to obtain the latent gene expressions.

□ RADIAN: Language-Informed Basecalling Architecture for Nanopore Direct RNA Sequencing

>> https://www.biorxiv.org/content/10.1101/2022.10.19.512968v1

RADIAN (RNA lAnguage informeD decodIng of nAnopore sigNals), a nanopore direct RNA basecaller. RADIAN uses a probabilistic model of mRNA language, and is incorporated in a modified CTC beam search decoding algorithm.

RADIAN uses a novel way of combining chunk-level CTC matrices through averaging overlapping rows in each chunk to assemble a global matrix prior to CTC beam search decoding. Because chunk-level assembly is exact in matrix space but ambiguous in nucleotide space.

□ HALO: Towards Hierarchical Causal Representation Learning for Nonstationary Multi-Omics Data

>> https://www.biorxiv.org/content/10.1101/2022.10.17.512602v1

HALO (Hierarchical cAusal representation Learning for Omics data) adopts a causal approach to model these non- stationary causal relations using independent changing mechanisms in co-profiled single-cell ATAC- and RNA-seq data.

HALO enforces hierarchical causal relations between coupled and decoupled omics information in latent space. It allows us to identify the dynamic interplay between chromatin accessibility and transcription through temporal modulations.

□ WarpSTR: Determining tandem repeat lengths using raw nanopore signals

>> https://www.biorxiv.org/content/10.1101/2022.11.05.515275v1

Nanopore signal is scaled and shifted differently in each sequencing read and it needs to be normalized before analysis so that the resulting values can be compared to the expected signal levels defined in the k-mer tables.

WarpSTR is an alignment-free algorithm for analysing STR alleles using nanopore sequencing raw reads. The method uses guppy basecalling annotation output for the extraction of region of interest, and dynamic time warping based finite-state automata.

□ Falign: An effective alignment tool for long noisy 3C data

>> https://www.biorxiv.org/content/10.1101/2022.10.30.514399v1

Falign, a sequence alignment method that adapts to fragmented long noisy reads, such as Pore-C reads. Falign contains four modules: 1) long fragment candidate detection; 2) monosome long fragment candidate extension; 3) monosome gap filling; and 4) polysomy gap filling.

Falign uses a local DDF chain scoring algorithm to select fragment candidates and extend the long fragment candidates. Falign selects short fragments and uses a dynamic programming-based method to generate the most plausible set of fragment alignments.

□ Seed-chain-extend alignment is accurate and runs in close to O(m log n) time for similar sequences: a rigorous average-case analysis

>> https://www.biorxiv.org/content/10.1101/2022.10.14.512303v1

The first average-case bounds on runtime and optimality for the sketched k-mer seed-chain-extend alignment heuristic under a pairwise mutation model. The alignment is mostly constrained to be near the correct diagonal of the alignment matrix and that runtime is close to linear.

Finding the smallest s-mer among the k − s + 1 s-mers in a k-mer takes k − s + 1 iterations, so finding all open syncmer seeds in S′ takes O((k − s + 1)m) = O(mk) = O(m log n) time. Subsampling Θ( 1/log n ) of k-mers asymptotically reduces the bounds on chaining time.

□ Aligning Distant Sequences to Graphs using Long Seed Sketches

>> https://www.biorxiv.org/content/10.1101/2022.10.26.513890v1

MetaGraph Align (MG-Align) follows a seed-and-extend approach, with a dynamic program to deter- mine which path to take in the graph, producing a semi-global alignment. A few modifications to adjust for misaligned anchors in the MG-Sketch seeder.

Using long inexact seeds based on Tensor Sketching, to be able to efficiently retrieve similar sketch vectors, the sketches of nodes are stored in a Hierarchical Navigable Small Worlds.

The method scales to graphs with 1 billion nodes, with time and memory requirements for preprocessing growing linearly with graph size and query time growing quasi-logarithmically with query length.

□ MetaGraph-MLA: Label-guided alignment to variable-order De Bruijn graphs

>> https://www.biorxiv.org/content/10.1101/2022.11.04.514718v1

Multi-label alignment (MLA) extends current sequence alignment scoring models with additional label change operations for incorporating mixtures of samples into an alignment, penalizing mixtures that are dissimilar in their sequence content.

MetaGraph-MLA, an algorithm implementing this strategy using annotated De Bruijn graphs within the MetaGraph framework. MetaGraph-MLA utilizes a variable-order De Bruijn graph and introduce node length change as an operation.

□ IntegratedLearner: An integrated Bayesian framework for multi-omics prediction and classification

>> https://www.biorxiv.org/content/10.1101/2022.11.06.514786v1

IntegratedLearner algorithm proceeds by fitting a machine learning algorithm per-layer to predict outcome (base_learner) and combining the layer-wise cross-validated predictions using a meta model (meta_learner) to generate final predictions based on all available data points.

□ RecGraph: adding recombinations to sequence-to-graph alignments

>> https://www.biorxiv.org/content/10.1101/2022.10.27.513962v1

RecGraph is a sequence-to-graph aligner written in Rust. RecGraph is an exact approach that implements a dynamic programming algorithm for computing an optimal alignment that allows recombinations with an affine penalty.

RecGraph can allow recombinations in the alignment in a controlled (i.e., non heuristic) way. RecGraph identifies a new path of the variation graph which is a mosaic of two different paths, possibly joined by a new arc.

□ Echtvar: compressed variant representation for rapid annotation and filtering of SNPs and indels

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkac931/6775383

Echtvar efficiently encodes variant allele frequency and other information from huge pupulation datasets to enable rapid (1M variants/second) annotation of genetic variants. It chunks the genome into 1 - 20 (~1 million) bases, encodes each variant into a 32 bit integer.

□ Sketching and sampling approaches for fast and accurate long read classification

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05014-0

Hierarchical clustering requires O(n3) time / Ω(n2) space to cluster n elements. Computation of a minimizer sketch can be done naively in O(nw) by choosing the minimum of the hashes in the O(n) windows, or in O(n) by using an integer representation of the k-mers in the sequence.

□ Targeting non-coding RNA family members with artificial endonuclease XNAzymes

>> https://www.nature.com/articles/s42003-022-03987-5

Engineering a series of artificial oligonucleotide enzymes (XNAzymes) composed of 2’-deoxy-2’-fluoro-β-D-arabino nucleic acid (FANA) that specifically or preferentially cleave individual ncRNA family members under quasi-physiological conditions.

A catalytic XNA nanostructure has improved biostability and targets multiple microRNAs. An electrophoretic mobility shift equivalent to the assembled tetrahedron (207 nts) was observed when all three components were annealed.

□ SPACE: Exploiting spatial dimensions to enable parallelized continuous directed evolution

>> https://www.embopress.org/doi/full/10.15252/msb.202210934

SPACE, a system for rapid / parallelizable evolution of biomolecules, which introduces spatial dimensions into the continuous evolution system. The system leverages competition over space, wherein evolutionary progress is closely associated w/ the production of spatial patterns.

SPACE uses a mathematical model, RESIR - Range Expansion with Susceptible Infected Recovered kinetics. SPACE is applied to evolve the promoter recognition of T7 RNA polymerase to a library of 96 random sequences in parallel.

□ Holographic-(V)AE: an end-to-end SO(3)-Equivariant (Variational) Autoencoder in Fourier Space

>> https://www.biorxiv.org/content/10.1101/2022.09.30.510350v1

As spherical harmonics form a basis for the irreps of SO(3), the SO(3) group acts on spherical Fourier space via a direct sum of irreps. The ZFT encodes a data point into a tensor composed of a direct sum of features, each associated with a degree l indicating the irrep.

Refer to these tensors as SO(3)-steerable tensors and to the vector spaces they occupy as SO(3)-steerable vector spaces, or simply steerable for short since they only deal with the SO(3) group in this work.

H-(V)AE reconstructs the spherical Fourier space encoding of data, learning in the process a latent space with a maximally informative invariant embedding alongside an equivariant frame describing the orientation of the data.

□ Entropy predicts fuzzy-seed sensitivity

>> https://www.biorxiv.org/content/10.1101/2022.10.13.512198v1

The entropy of a seed cover (a stretch of neighboring seeds) is a good predictor for seed sensitivity. Proposing a model to estimate the entropy of a seed cover, and find that seed covers with high entropy typically have high match sensitivity.

Altstrobes are modified randstrobes where the strobe length alternates between shorter and longer strobes. Mixedstrobes samples either a k-mer or a strobemer at a specified fraction. Using subsampled randstrobes and mixedstrobes within minimap2 for the most divergent sequence.

□ The maximum entropy principle for compositional data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05007-z

Compositional Maximum Entropy (CME), a probabilistic framework for inferring the behaviors of compositional systems. By integrating the prior geometric structure of compositions, CME infers the underlying multivariate relationships b/n the constituent components.

The principle of maximum entropy deduces the simplex-truncated normal distribution from the given moment constraints. The simplex pseudolikelihood method provides consistent and asymptotically parameter estimates and is asymptotically equivalent to maximum likelihood estimation.

□ SDRAP for annotating scrambled or rearranged genomes

>> https://www.biorxiv.org/content/10.1101/2022.10.24.513505v1

SDRAP, Scrambled DNA Rearrangement Annotation Protocol, annotates DNA segments in DNA rearrangement precursor and product genomes which describe the rearrangement, and computes properties of the rearrangements reflecting their complexity.

SDRAP implements a heuristic adaptation of the Smith-Waterman gapped local sequence alignment algorithm. The regions on the precursor sequence in between precursor intervals of the union of all arrangements are annotated as eliminated sequences.

□ Free decomposition spaces

>> https://arxiv.org/pdf/2210.11192v1.pdf

Constructing an equivalence of ∞-categories. Left Kan extension along the inclusion j : ∆inert → ∆ takes general objects to Mobius decomposition spaces and general maps to CULF maps.

The Aguiar–Bergeron– Sottile map to the decomposition space of quasi-symmetric functions, from any Mobius decomposition space, factors through the free decomposition space of nondegenerate simplices, and offer an explanation of the zeta function in the universal property of QSym.

□ The central sheaf of a Grothendieck category

>> https://arxiv.org/pdf/2210.12419v1.pdf

The center Z(A) of an abelian category A is the endomorphism ring of the identity functor on that category. A localizing subcategory of a Grothendieck category C is said to be stable if it is stable under essential extensions.

The Grothendieck category C is locally noetherian. And constructing an alternative version of the central sheaf ZC which will be a sheaf on the topological space Sp(C) equipped with the so-called stable topology.

□ Enhanced Auslander-Reiten duality and tilting theory for singularity categories

>> https://arxiv.org/abs/2209.14090v1

Proving an equivalence exists as soon as there is a triangle equivalence between the graded singularity category of a Gorenstein ring and the derived category of a finite dimensional algebra.

Gorenstein rings of dimension at most 1, quotient singularities, and Geigle-Lenzing complete intersections, including finite or infinite Grassmannian cluster categories, to realize their singularity categories as cluster categories of finite dimensional algebras.

□ MD-Cat: Expectation-Maximization enables Phylogenetic Dating under a Categorical Rate Model

>> https://www.biorxiv.org/content/10.1101/2022.10.06.511147v1

MD-Cat (Molecular Dating using Categorical-models) uses a categorical model to approximate the unknown continuous clock model. It is inspired by non-parametric statistics and can approximate a large family of models by discretizing the rate distribution into k categories.

Although the rate categories are discrete, the model has the power to approximate a continuous clock model if k is large and there are enough data. MD-Cat has fewer assumptions about the true clock model than parametric models such as Gamma or LogNormal distribution.

EM algorithm maximizes the likelihood function associated w/ this model, where the k rate categories and branch lengths in time units are modeled as unknown parameters and co-estimated. The E-step / M-step can be computed efficiently, and the algorithm is guaranteed to converge.

□ STREAMLINE: Structural and Topological Performance Analysis of Algorithms for the Inference of Gene Regulatory Networks from Single-Cell Transcriptomic Data

>> https://www.biorxiv.org/content/10.1101/2022.10.31.514493v1

STREAMLINE quantifies the ability of algorithms to capture topological properties of networks and identify hubs. This repository contains all the necessary files that are necessary to perform the analysis. The implementation is compatible with BEELINE.

□ SCOR: Estimating the optimal linear combination of predictors using spherically constrained optimization

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04953-y

Spherically Constrained Optimization Routine (SCOR) can be used in various other statistical problems such as directional statistics or single-index models where fixing the norm of the coefficient vector is needed to avoid the issue of non-identifiability.

SCOR obtains better estimates of the empirical hypervolume under the manifold (EHUM). In the future, the SCOR algorithms can be extended to the variable selection problem over the coefficients belonging to the surface of a unit sphere.

□ BRANEnet: embedding multilayer networks for omics data integration

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04955-w

BRANEnet, a novel multi-omics integration framework for multilayer heterogeneous networks. BRANENET is an expressive, scalable, and versatile method to learn node embeddings, leveraging random walk information within a matrix factorization framework.

□ SCTC: inference of developmental potential from single-cell transcriptional complexity

>> https://www.biorxiv.org/content/10.1101/2022.10.14.512265v1

Calculating 0th-order complexities of cell and gene by summing over the weights of edges connected to them. 1st-order complexities of cell and gene can be obtained by averaging the 0th-order complexities. It calculate each order complexity and to reconstruct pseudo-temporal path.

□ DeepSelectNet: Deep Neural Network Based Selective Sequencing for Oxford Nanopore Sequencing

>> https://www.biorxiv.org/content/10.1101/2022.10.24.513498v1

DeepSelecNet is an improved 1D ResNet based model to classify Oxford Nanopore raw electrical signals as target or non-target for Read-Until sequence enrichment or depletion. DeepSelecNet provides enhanced model performances.

DeepSelectNet relies on neural net regularization to minimise model complexity thereby reducing the overfitting of data. A longer signal segment means having a larger k-mer size that allows distinguishing species better, thereby the model may classify better with longer segments.

□ INSERT-seq enables high-resolution mapping of genomically integrated DNA using Nanopore sequencing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02778-9

INSERT-seq incorporates amplification based enrichment and UMI amplification with a computational pipeline to process integration sites. INSERT-seq can sensitively detect insertion sites with frequencies as low as 1%. Such sensitivity could be improved with more sequencing depth.

□ Ultra-fast joint-genotyping with SparkGOR

>> https://www.biorxiv.org/content/10.1101/2022.10.25.513331v1

The pipeline accepts single sample gVCF-like input and generates pVCF-like output. By converting multi-allelic locus based variant calls to bi-allelic variants, It simplify the joint-genotyping computation dramatically while maintaining quality and concordance with GIAB samples.

Using a Spark implementation of XGBoost to train and predict variant classification. And they used the Sentieon release of the GATK VQSR Gaussian-mixture algorithm using the features MQ, QD, DP, MQRankSum, ReadPosRankSum, FS, SOR, InbreedingCoeff.

□ Deep mendelian randomization: Investigating the causal knowledge of genomic deep learning models

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009880

Deep Mendelian Randomization (DeepMR), obtains local and global estimates of linear causal relationship between marks. DeepMR gives accurate and unbiased estimates of the ‘true’ global causal effect, but its coverage decays in the presence of sequence-dependent confounding.

DeepMR can estimate overall per-exposure causal effects using a random effects meta-analysis across sequence regions (loci) and provide further evidence for previously hypothesized relationships between TFs identified by BPNet.

□ NanoBlot: A Simple Tool for Visualization of RNA Isoform Usage From Third Generation RNA-sequencing Data

>> https://www.biorxiv.org/content/10.1101/2022.10.26.513894v1

NanoBlot takes aligned, positionally-sorted, and indexed BAM files as input. NanoBlot requires a series of target genomic regions referred to as “probes”. NanoBlot removes any reads which map to the antiprobe(s) region.

□ MetaLP: An integrative linear programming method for protein inference in metaproteomics

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010603

MetaLP, a protein inference algorithm in metaproteomics using an integrative linear programming method. Taxonomic abundance information extracted from metagenomics shotgun sequencing or 16s rRNA gene amplicon sequencing, was incorporated as prior information in MetaLP.

MetaLP expresses the joint probability with a chain rule to transform it into a chain of conditional probabilities, which could be easily added as logical constraints. The LP model can be solved quickly by existing LP solvers.

□ HAT: Haplotype Assembly Tool using short and error-prone long reads

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac702/6779972

HAT creates seeds based on short read alignments and the location of SNPs. Then, it removes the combinations of alleles with low support as well as overlapping seeds. Next, HAT finds multiplicity blocks and creates the first phased blocks within them.

HAT assigns reads to the blocks and haplotypes; based on these read assignments it fills the unphased SNPs within blocks. (C.) Finally, HAT can also use miniasm to assemble haplotype sequences for each block and polishes the assemblies using Pilon.

□ HaploDMF: viral Haplotype reconstruction from long reads via Deep Matrix Factorization

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac708/6780015

HaploDMF utilizes a deep matrix factorization model with an adapted loss function to learn latent features from aligned reads automatically. The latent features are then used to cluster reads of the same haplotype.

□ kmdiff, large-scale and user-friendly differential k-mer analyses

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac689/6782954

kmdiff provides differential k-mers analysis between two populations (control and case). Each population is represented by a set of short-read sequencing. Outputs are differentially represented k-mers between controls and cases.

kmdiff deviates from HAWK in the k-mer counting part. HAWK counts k-mers of each sample before loading and testing batches of them using a hash table.

kmdiff constructs a k-mer matrix, i.e. an abundance matrix with k-mers in rows and samples in columns. this matrix is not represented as a whole but sub-matrices are streamed in parallel using kmtricks.

Goliath.

2022-10-31 22:13:13 | Science News

(Artwork by Carl Hsuser)

□ Velorama: Unraveling causal gene regulation from the RNA velocity graph using Velorama

>> https://www.biorxiv.org/content/10.1101/2022.10.18.512766v1

Velorama, a novel conceptual approach to causal GRN inference that newly represents scRNA-seq differentiation dynamics as a partial ordering of cells and operates on the directed acyclic graph (DAG) of cells constructed from pseudotime or RNA velocity measurements.

Velorama sub-stantially outperforms a diverse set of pseudotime-based GRN inference. Velorama uses a generalization of Granger causality to partial orderings that uses a graph neural network framework.

□ Deep unfolded convolutional dictionary learning for motif discovery

>> https://www.biorxiv.org/content/10.1101/2022.11.06.515322v1

The CDL approximates each input sequence with a sparse linear combination of shift-invariant filters. The basic idea is approximate each DNA string s as a sum of the convolution of feature vectors and sparse vectors.

The unfolded convolutional dictionary learning (uCDL) extends the resulting computational graph from deep un- folding for downstream regulatory genomics problems to extract the sparse code of syntactic and semantic structures in the DNA strings.

□ scMultiSim: simulation of multi-modality single cell data guided by cell-cell interactions and gene regulatory networks

>> https://www.biorxiv.org/content/10.1101/2022.10.15.512320v1

scMultiSim, a unified framework to jointly model biological factors including cell-cell inter- actions, with-in-cell GRNs and chromatin accessibility. scMultiSim simulates discrete or continuous cell populations and outputs the ground truth.

scMultiSim models the cellular heterogeneity and stochasticity of gene regulation effects through a mechanism with Cell Identity Factors and Gene Identity Vectors. A Gaussian random walk along the tree is performed for each cell to generate the n dimension diff-CIF vector.

□ scCobra: Contrastive cell embedding learning with domain adaptation for single-cell data integration

>> https://www.biorxiv.org/content/10.1101/2022.10.23.513389v1

scCobra employs contrastive learning and domain adaptation. The contrastive learning network is utilized to learn latent embeddings, domain-adaptation is employed to batch-normalize the latent embeddings, while generative adversarial networks further optimize the blending effect.

The cross-entropy discrimination loss will be backpropagated to optimize the encoder through adversarial training to remove the batch information from the cell embeddings. scCobra does not need to specify a batch as the anchor map.

□ FIST-nD: A tool for n-dimensional spatial transcriptomics data imputation via graph-regularized tensor completion

>> https://www.biorxiv.org/content/10.1101/2022.10.12.511928v1

FIST-nD (Fast Imputation of Spatially-resolved transcriptomes by graph-regularized Tensor completion in n-Dimensions) minimizes an objective function of graph-regularized tensor completion over the GE and a tensor product graph of the spatial chain graphs of each spatial axis.

FIST-nD generalizes any n-dimensional tensor completion and the matched higher-order graph. The objective function minimizes the difference between the observed and the imputed tensor under a smoothness constraint defined on the graph Laplacian of a Cartesian product.

□ Protein-to-genome alignment with miniprot

>> https://arxiv.org/pdf/2210.08052.pdf

Miniprot, a new aligner for mapping protein sequences to a complete genome. Miniprot integrates recent techniques such as syncmer sketch and SIMD-based dynamic programming.

Miniprot broadly follows the seed-chain-extend strategy used by minimap2. Miniprot extracts syncmers on a query protein, finds seed matches (aka anchors), and then performs chaining. It closes unaligned regions between anchors and extends from terminal anchors.

□ Efficient minimizer orders for large values of k using minimum decycling sets

>> https://www.biorxiv.org/content/10.1101/2022.10.18.512682v1

Decycling set-based minimizer orders, a new orders based on minimum decycling sets, which are guaranteed to hit any infinitely long sequence. It selects a number of k-mers comparable to that of minimizer orders based on universal k-mer hitting sets, and can also scale up to larger k.

An efficient method is developed to query in linear time if a k-mer belongs to a minimum decycling set without the need to construct, store, or query the whole set. The minimum decycling set constructed by Mykkeltveit’s algorithm.

□ scGSEA / scMAP: Single-cell gene set enrichment analysis and transfer learning for functional annotation of scRNA-seq data

>> https://www.biorxiv.org/content/10.1101/2022.10.24.513476v1

scGSEA is a statistical framework for scoring coordinated gene activity in individual cells to automatically determine the pathways are active in a cell. scGSEA is a tool that leverages NMF expression latent factors to infer pathway activity at a single cell level.

scMAP (single-cell Mapper), a transfer learning algorithm that combines text mining data transformation and a k-nearest neighbours’ (KNN) classifier (methods) to map a query set of single-cell transcriptional profiles on top of a reference atlas.

□ transmorph: a unifying computational framework for single-cell data integration

>> https://www.biorxiv.org/content/10.1101/2022.11.02.514912v1

transmorph capabilities and the value of its expressiveness by solving a variety of practical single-cell applications incl. supervised / unsupervised joint datasets embedding, RNA-seq integration in gene space and label transfer of cell cycle phase within cell cycle genes space.

□ iDNA-ABF: multi-scale deep biological language learning model for the interpretable prediction of DNA methylations

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02780-1

iDNA-ABF, a multi-scale biological language learning model to successfully build the mapping from natural language to biological language, and the mapping from methylation-related sequential determinants to their functions.

iDNA-ABF tokenizes a DNA sequence with k-mer representations. In this way, each token is represented by k bases, thus integrating richer contextual information for each nucleotide.

□ TRIAGE-Cluster: Inferring cell diversity in single cell data using consortium-scale epigenetic data as a biological anchor for cell identity

>> https://www.biorxiv.org/content/10.1101/2022.10.12.512003v1

TRIAGE-Cluster (Transcriptional Regulatory Inference Analysis of Gene Expression - Cluster) uses genome-wide repressive epigenetic data from diverse bio-samples to identify genes demarcating cell diversity in any scRNA-seq data set.

TRIAGE devises a genome-wide quantitative feature called a repressive tendency score (RTS) which can be used as an unsupervised independent reference point to infer cell-type regulatory potential for each protein-coding gene.

TRIAGE-Cluster integrates patterns of H3K27me3 domains deposited across hundreds of cell types with weighted density estimation to determine cell clusters. TRIAGE-ParseR parses any input rank gene list to define gene groups governing the identity and function of cell types.

□ AIscEA: Unsupervised Integration of Single-cell Gene Expression and Chromatin Accessibility via Their Biological Consistency

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac683/6762076

AIscEA defines a ranked similarity score to quantify the biological consistency between cell clusters across measurements. AIscEA uses the ranked similarity score and a novel permutation test to identify cluster alignment.

AIscEA further utilizes graph alignment for the aligned cell clusters to align the cells across measurements. AIscEA is highly robust to the choice of hyper-parameters and can better handle the cluster heterogeneity problem.

□ JAMIE: Joint Variational Autoencoders for Multi-Modal Imputation and Embedding

>> https://www.biorxiv.org/content/10.1101/2022.10.15.512388v1

JAMIE takes multi-modal data that can have partially matched samples across modalities. VAEs learn the latent embeddings of each modality. Then, embeddings from matched samples across modalities are aggregated to identify joint cross-modal latent embeddings before reconstruction.

The resultant latent space may be processed by the opposite decoder. JAMIE is able to use partial correspondence information. JAMIE combines the reusability and flexible latent space generation of autoencoders with the automated correspondence estimation of alignment methods.

□ WGT: Tools and algorithms for recognizing, visualizing and generating Wheeler graphs

>> https://www.biorxiv.org/content/10.1101/2022.10.15.512390v1

Wheelie, an algorithm that combines a renaming heuristic with a Sat- isfiability Modulo Theory (SMT) solver to check whether a given graph has the Wheeler properties, a problem that is NP complete in general. Wheelie can check a graph with 1,000s of nodes in seconds.

Graphs used for evaluation were generated using WGT’s generator algorithms, which can produce De Bruijn graphs, tries, a reverse deterministic graphs derived from a multiple alignments, complete random Wheeler graphs, and a d-NFA random Wheeler graphs.

□ DISA: Discriminative and informative subspace assessment with categorical and numerical outcomes

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0276253

DISA (Discriminative and Informative Subspace Assessment) is proposed to evaluate patterns in the presence of numerical outcomes using two measures together w/ a novel principle able to statistically assess the correlation gain of the subspace against the overall space.

DISA accomplishes this by approximating two probability density functions (e.g. Gaussians), one for all the observed targets and the other with targets of the pattern coverage.

Two interestingness measures are the confidence, Φ(φJ→c)/Φ(φJ), measuring the probability of c occurring when φJ occurs, and, the lift, (Φ(φJ→c)/(Φ(φJ)×Φ(c))×N, that considers the probability of the consequent to assess the dependence between the consequent and antecedent.

DISA extracts the element-wise indication of the sign of each number on the resulting array, calculate the discrete difference along the sign vector (value at position i+1 minus value at position i), and finally find the indices of elements that are non-zero, grouped by element.

□ GAVISUNK: Genome assembly validation via inter-SUNK distances in Oxford Nanopore reads

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac714/6793851

GAVISUNK, an open-source pipeline that detects misassemblies and produces a set of reliable regions genome-wide by assessing concordance of distances between unique k-mers in Pacific Biosciences high-fidelity (HiFi) assemblies and raw Oxford Nanopore Technologies reads.

GAVISUNK may be applied to any region or genome assembly to identify misassemblies and potential collapses and is, thus, particularly valuable for validating the integrity of regions with large and highly identical repeats that are more prone to assembly error.

□ Filter inference: A scalable nonlinear mixed effects inference approach for snapshot time series data

>> https://www.biorxiv.org/content/10.1101/2022.11.01.514702v1

Filter inference is a new variant of approximate Bayesian computation, with dominant computational costs that do not increase with the number of measured individuals, making efficient inferences from snapshot measurements possible.

Filter inference also scales well with the number of model parameters, using gradient-based Hamiltonian Monte Carlo (HMC) algorithms, such as the No-U-Turn Sampler (NUTS).

□ A graph clustering algorithm for detection and genotyping of structural variants from long reads

>> https://www.biorxiv.org/content/10.1101/2022.11.04.515241v1

The algorithm starts collecting evidence (Signatures) of SVs from read alignments. Then, signatures are clustered based on a Euclidean graph with coordinates calculated from lengths and genomic positions.

Clustering is performed by the DBSCAN algorithm, which provides the advantage of delimiting clusters with high resolution. Clusters are transformed into SVs and a Bayesian model allows to precisely genotype SVs based on their supporting evidence.

□ Dashing 2: genomic sketching with multiplicities and locality-sensitive hashing

>> https://www.biorxiv.org/content/10.1101/2022.10.16.512384v1

Dashing 2, a method that builds on the SetSketch data structure. SetSketch is related to HyperLogLog, but discards use of leading zero count in favor of a truncated logarithm of adjustable base.

Dashing 2 can sketch BigWig inputs encoding numerical coverage vectors. Dashing 2 has modes for computing Jaccard coefficients in an exact manner, without sketching or estimation.

Unlike HLL, SetSketch can perform multiplicity-aware sketching when combined with the ProbMinHash method. Dashing 2 integrates locality-sensitive hashing to scale all-pairs comparisons to millions of sequences.

□ scGWAS: landscape of trait-cell type associations by integrating single-cell transcriptomics-wide and genome-wide association studies

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02785-w

scGWAS effectively leverages scRNA-seq data to achieve two goals: (1) to infer the cell types in which the disease-associated genes manifest and (2) to construct cellular modules which imply disease-specific activation of different processes.

scGWAS only utilizes the average gene expression for each cell type followed by virtual search processes to construct the null distributions of module scores. scGWAS uses a sequential feedforward module expansion coupled with backward examination (MEBE) algorithm.

□ Vector-clustering Multiple Sequence Alignment: Aligning into the twilight zone of protein sequence similarity with protein language models

>> https://www.biorxiv.org/content/10.1101/2022.10.21.513099v1

vcMSA (vector-clustering Multiple Sequence Alignment) is a true multiple sequence aligner that aligns multiple sequences at once instead of progressively integrating pairwise alignments.

The core methodology diverges from standard MSA methods in that it avoids substitution matrices and gap penalties, and in most cases does not utilize guide tree construction.

vcMSA traces the path of each sequence through clusters and combine all paths into one network, taking edge weights from the number of sequences which traverse between the pairs of clusters.

□ GGCAT: Extremely-fast construction and querying of compacted and colored de Bruijn graphs

>> https://www.biorxiv.org/content/10.1101/2022.10.24.513174v1

GGCAT, a tool for constructing both types of graphs. Compared to Cuttlefish 2, the state-of-the-art for constructing compacted de Bruijn graphs, GGCAT has a speedup of up to 3.4× for k = 63 and up to 20.8× for k = 255.

Compared to Bifrost, GGCAT achieves a speedup of up to 12.6× for k = 27. GGCAT is up to 480× faster than BiFrost for batch sequence queries on colored graphs. GGCAT is based on a new approach merging the k-mer counting step with the unitig construction step.

□ DNRS: Identifying the critical state of complex biological systems by the directed-network rank score method

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac707/6772809

The progression of a complex biological system is described by the dynamic evolution of a high-dimensional nonlinear system, where a drastic or qualitative shift in a biological process is regarded as a phase transition at a bifurcation point.

DNRS, a model-free approach to detect the early-warning signal of critical transition in complex biological systems. The DNRS can be utilized to quantify the dynamic changes in gene cooperative effects of a time-specific directed network.

□ BEDwARS: A Robust Bayesian Approach to Bulk Gene Expression Deconvolution with Noisy Reference Signatures

>> https://www.biorxiv.org/content/10.1101/2022.10.25.513800v1

BEDwARS tackles the problem of signature mismatch from a complementary angle. It does not assume availability of multiple reference signatures, nor does it rely solely on transformations of data prior to deconvolution.

BEDwARS incorporates the possibility of reference signature mismatch directly into the statistical model used for deconvolution, using the reference to estimate the true cell type signatures underlying the given bulk profiles while simultaneously learning cell type proportions.

□ scTAM-seq enables targeted high-confidence analysis of DNA methylation in single cells

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02796-7

scTAM-seq, a targeted bisulfite-free method for profiling up to 650 CpGs in up to 10,000 cells per experiment, with a dropout rate as low as 7%. scTAM-seq can resolve DNA methylation dynamics across B-cell differentiation in blood and bone marrow, identifying intermediate differentiation states.

Since scTAM-seq exhibits a low FNR and FPR, it can also be used to further investigate imprinted regions, as well as other regions harbouring allele- and strand-specific methylation.

Ultimately, scDNAm values can help to discern cellular heterogeneity from allele-specific methylation, which in bulk data can only be achieved in special situations where SNPs are located on the same sequencing read.

Conversely, allele- and strand-specific methylation might lead to an overestimation of pseudo-bulk DNAm values by scTAM-seq.

□ GENLIB: new function to simulate haplotype transmission in large complex genealogies

>> https://www.biorxiv.org/content/10.1101/2022.10.28.514245v1

The gen.simuhaplo function combines the GENLIB R package’s existing support for handling large genealogies to allow users to simulate inheritance of large genomic regions even in genealogies with hundreds of thousands of individuals.

□ Bulk2Space: De novo analysis of bulk RNA-seq data at spatially resolved single-cell resolution

>> https://www.nature.com/articles/s41467-022-34271-z/

Bulk2Space, a spatial deconvolution algorithm based on deep learning frameworks, which generates spatially resolved single-cell expression profiles from bulk transcriptomes using existing high-quality scRNA-seq data and spatial transcriptomics as references.

Bulk2Space first generates single-cell transcriptomic data within the clustering space to find a set of cells whose aggregated data is proximate to the bulk data. Next, the generated single cells were allocated to optimal spatial locations using a spatial transcriptome reference.

□ Normalization and de-noising of single-cell Hi-C data with BandNorm and scVI-3D

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02774-z

BandNorm operates on the stratified off-diagonals (i.e., bands) of the contact matrix and its variants as fast baseline alternatives, namely CellScale and BandScale, which have been utilized for bulk Hi-C and have seen some uptake for scHi-C.

scVI-3D, a deep generative model which systematically takes into account the structural properties and accounts for genomic distance bias, sequencing depth effect, zero inflation, sparsity impact, and batch effects of scHi-C data.

□ Cooltools: enabling high-resolution Hi-C analysis in Python

>> https://www.biorxiv.org/content/10.1101/2022.10.31.514564v1

Cooltools is built directly on top of the cooler storage format and library, which allows it to operate on sparse matrices and/or out-of-core, either on raw counts or normalized contact matrices. In particular, many operations are performed via iteration over chunks of non-zero pixels.

□ Singletrome: A method to analyze and enhance the transcriptome with long noncoding RNAs for single cell analysis

>> https://www.biorxiv.org/content/10.1101/2022.10.31.514182v1

Singletrome interrogates lncRNAs in scRNA-seq data using a custom genome annotation of 110,599 genes consisting of 19,384 protein-coding genes from GENCODE and 91,215 lncRNA genes from LncExpDB.

□ GMMchi: gene expression clustering using Gaussian mixture modeling

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05006-0

GMMchi, a Python package that leverages Gaussian Mixture Modeling to detect and characterize bimodal gene expression patterns across cancer samples, as a tool to analyze such correlations using 2 × 2 contingency table statistics.

As GMMchi determines the numbers of bins based on the Mann and Wald bin criterion, this renders the bin numbers dynamic as data are trimmed away during tail-trimming. The GMMchi iterative tail pruning process so far allows for only a single tail at either the upper or lower end of the overall distribution.

□ BioBERT: Biomedical named entity recognition with the combined feature attention and fully-shared multi-task learning

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04994-3

BioBERT, a novel fully-shared multi-task learning model based on the pre-trained language model in biomedical domain, with a new attention module to integrate the auto-processed syntactic information for the BioNER task.

BioBERT uses a new attention mechanism, named Combined Feature Attention (CFA). The embeddings of context features are derived from BioBERT and the embeddings of syntactic labels are randomly initialized in the CFA module.

□ Sourmash Branchwater Enables Lightweight Petabyte-Scale Sequence Search

>> https://www.biorxiv.org/content/10.1101/2022.11.02.514947v1

Branchwater, a petabase-scale querying system that uses containment searches based on FracMinHash sketching to search all public metagenome data sets in the SRA in 24-36 hours on commodity hardware with 1-1000 query genomes.

Branchwater uses a scatter-gather approach based on a cluster-aware work�ow engine. Branchwater uses the Rust library underlying the sourmash implementation of FracMinHash to execute massively parallel searches of a presketched digest of the SRA.

□ Trade-off between conservation of biological variation and batch effect removal in deep generative modeling for single-cell transcriptomics

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05003-3

A multi-objective optimisation technique known as Pareto multi-task learning (Pareto MTL) is used to obtain the Pareto front between conservation of biological variation and batch effect removal.

A new batch effect measure based on the Mutual Information Neural Estimator (MINE) is proposed. MINE leverages the expressiveness of deep neural networks to learn the mutual information (MI) between two variables, which in this case is the MI between the latent z and batch s.

The 4th heaven.

2022-10-31 22:11:12 | Science News

(Paintings by Andrei (@Riabovitchev))

□ IReNA: Integrated regulatory network analysis of single-cell transcriptomes and chromatin accessibility profiles

>> https://www.cell.com/iscience/fulltext/S2589-0042(22)01631-5

Network decoding in IReNA included network modularization, identification of enriched transcription factors, and a unique function for the construction of simplified regulatory networks among modules. Network modularization was based on K-means clustering of gene expression.

IReNA statistically analyzes modular regulatory networks and identifies reliable transcription factors including known regulators. IReNA could directly calculate correlations using original expression data independent of the pseudotime.

□ EvoAug: Evolution-inspired augmentations improve deep learning for regulatory genomics

>> https://www.biorxiv.org/content/10.1101/2022.11.03.515117v1

EvoAug, an open-source PyTorch package that provides a suite of evolution-inspired data augmentations. EvoAug’s evolution-based augmentations uses the same labels as the original wildtype sequence. This provides a modeling bias to learn invariances of the (un)natural symmetries.

EvoAug randomly applies augmentations, individually or in combinations, online during training to each sequence in a minibatch of data. Each augmentation is applied stochastically and controlled by hyperparameters intrinsic to each augmentation.

□ ASCARIS: Positional Feature Annotation and Protein Structure-Based Representation of Single Amino Acid Variations

>> https://www.biorxiv.org/content/10.1101/2022.11.03.514934v1

ASCARIS, a method for the featurization (i.e., quantitative representation) of SAVs, which could be used for a variety of purposes, such as predicting their functional effects or building multi-omics-based integrative models.

ASCARIS is incorporated the correspondence between the location of the SAV on the sequence and 30 different types of positional feature annotations. ASCARIS constructed a 74-dimensional feature set to represent each SAV in a dataset composed of ~100,000 data points.

□ Computads and string diagrams for n-sesquicategories

>> https://arxiv.org/pdf/2210.07704.pdf

An n-sesquicategory is an n-globular set with strictly associative and unital composition and whiskering operations, which are however not re-quired to satisfy the Godement interchange laws which hold in n-categories.

The category of computads for this monad is equivalent to the category of presheaves on a small category of computadic cell shapes. Each of these trees has a unique canonical form in its equivalence class.

□ A logical analysis of fixpoint theorems

>> https://arxiv.org/pdf/2211.01782v1.pdf

A fixpoint theorem for Cauchy-complete Q-categories1 that holds for any quantale Q whose underlying complete lattice is continuous and for a specific notion of contraction.

The contractions determine Cauchy distributors under the appropriate algebraic condition on the quantale Q, and finally we formulate the resulting fixpoint theorem for Cauchy-complete Q-categories.

□ VeChat: correcting errors in long reads using variation graphs

>> https://www.nature.com/articles/s41467-022-34381-8

VeChat, a self-correction method to perform haplotype-aware error correction for long reads. VeChat distinguishes errors from haplotype-specific true variants based on variation graphs, which reflect a popular type of data structure for pangenome reference systems.

Unlike single consensus sequences, which current self-correction approaches are generally centering on, variation graphs are able to represent the genetic diversity across multiple, evolutionarily or environmentally coherent genomes.

□ DeepOM: Single-molecule optical genome mapping via deep learning

>> https://www.biorxiv.org/content/10.1101/2022.11.04.512597v1

DeepOM was compared against the state-of-the-art commercial Bionano Solve on human cell-line DNA data acquired with the Bionano Saphyr system. DeepOM enables higher genome coverage from a given sample, enhancing the ability to detect low frequency structural variations.

The DeepOM alignment of a DNA molecule to a reference genome sequence starts from query images of molecules fluorescently labeled at specific motifs. the localization neural network of DeepOM enables the separation of multiple fluorescent emitters that are within a diffraction limited spot.

□ BATCH-SCAMPP: Scaling phylogenetic placement methods to place many sequences

>> https://www.biorxiv.org/content/10.1101/2022.10.26.513936v1

BATCH-SCAMPP, a technique that improves scalability in both dimensions: the number of query sequences being placed into the backbone tree and the size of the backbone tree.

BSCAMPP can facilitate the initial tree decomposition of the divide-and-conquer tree estimation pipeline GTM for better placement of shorter, fragmentary sequences into an initial tree containing the longer full-length sequences, potentially leading to final tree estimation.

□ ICLUST: Solving Anscombe's Quartet using a Transfer Learning Approach

>> https://www.biorxiv.org/content/10.1101/2022.10.12.511920v1.full.pdf

ICLUST identifies distinct clustering. All scatterplots in the dataset were plotted and clustered using correlation strength alone and 4096-component feature vectors. Average image in each cluster as determined by correlation strength clustering, corresponding to the dendrogram.

□ Refphase: Multi-sample reference phasing reveals haplotype-specific copy number heterogeneity

>> https://www.biorxiv.org/content/10.1101/2022.10.13.511885v1

Refphase, an algorithm that leverages this multi-sampling approach to infer haplotype-specific copy numbers through multi-sample reference phasing. Unlike statistical phasing, Refphase does not require reference haplotype panels or large collections of genotypes.

Refphase creates a minimum consistent segmentation across the single-sample segmentations input. Allele-specific copy numbers are re-estimated for each sample, and the most parsimonious phasing solution along each chromosome is then chosen in horizontal phasing optimization.

□ ifCNV: A novel isolation-forest-based package to detect copy-number variations from various targeted NGS datasets

>> https://www.cell.com/molecular-therapy-family/nucleic-acids/fulltext/S2162-2531(22)00252-9

ifCNV is a CNV detection tool based on read-depth distribution obtained from targeted NGS data. ifCNV combines artificial intelligence using two isolation forests and a comprehensive scoring method to faithfully detect CNVs among various samples.

ifCNV integrates a pre-processing step to create a read-depth matrix using as input the aligned bam / bed files. This reads matrix is composed of the samples as columns and the targets as rows. Next, it uses an IF machine learning algorithm to detect the samples w/ a strong bias.

□ streammd: fast low-memory duplicate marking using a Bloom filter

>> https://www.biorxiv.org/content/10.1101/2022.10.12.511997v1

streammd closely reproduces the outputs of Picard MarkDuplicates, a widely-used duplicate marking program, while being substantially faster and suitable for pipelined applications, and that it requires much less memory than SAMBLASTER, another single-pass duplicate marking tool.

With a conventional hash structure the memory requirements of this approach may be considerable for large libraries — a 60x coverage human whole-genome BAM file is around 1B templates and the resulting hash structure tens of GB.

□ scDEF: Deep exponential families for single-cell data analysis

>> https://www.biorxiv.org/content/10.1101/2022.10.15.512383v1

scDEF consists of a deep exponential family model tailored to single-cell data in order to cluster cells using multiple levels of abstraction, which can be mapped to different gene signature levels.

By enforcing non-negativity, biasing towards sparsity and including hierarchical relationships among factors without using batch annotations, scDEF is a general tool for hierarchical gene signature identification in scRNA-seq data for both single- and multiple-batch scenarios.

scDEF models the gene expression heterogeneity of the cells of a tissue as a set of sparse factors containing gene signatures for different cell states. These factors are related to each other through higher-level factors that encode coarser relationships.

□ LotuS2: an ultrafast and highly accurate tool for amplicon sequencing analysis

>> https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-022-01365-1

LotuS2 is designed to run with a single command, where the only essential flags are the path to input files (fastq(.gz), fna(.gz) format), output directory, and mapping file.

The sequence input is flexible, allowing simultaneous demultiplexing of read files and/or integration of already demultiplexed reads.

The primary output is a set of tab-delimited OTU/ASV count tables, The phylogeny of OTUs/ASVs, their taxonomic assignments, and corresponding abundance tables at different taxonomic levels.

□ Adaptive Sampling as tool for Nanopore direct RNA-sequencing

>> https://www.biorxiv.org/content/10.1101/2022.10.14.512223v1

Taking advantage of a simple model system composed of two defined in vitro transcripts, they determine essential parameters of direct RNA-seq adaptive sampling (DRAS).

□ Cosbin: cosine score-based iterative normalization of biologically diverse samples

>> https://academic.oup.com/bioinformaticsadvances/article/2/1/vbac076/6764617

A Cosine score-based iterative normalization (Cosbin) method that eliminates aDEGs, identifies ideal CEGs (iCEGs) and calculates sample-wise normalization factors by equilibrating expression levels of iCEGs.

Impactful aDEGs with higher scores are sequentially identified and removed then interim normalization is performed by equilibrating expression levels for the remaining genes, and Cosbin iterates to the next round of aDEG identification and interim normalization.

Sequential elimination of impactful aDEGs should ease the asymmetry in differential expression, reduce normalization bias and improve the efficiency of identifying the next aDEG. Iterations continue until aDEG identification or interim normalization converges at a stable point.

□ MAGScoT - a fast, lightweight, and accurate bin-refinement tool

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac694/6764585

MAGScoT relies on two sets of microbial single-copy marker genes from the Genome Taxonomy Database Toolkit, 120 bacterial and 53 archaeal, stored as HMM-profiles for fast annotation of amino acid sequences predicted from the assembled contigs.

□ Taxonium, a web-based tool for exploring large phylogenetic trees

>> https://www.biorxiv.org/content/10.1101/2022.06.03.494608v4

Taxonium, a new tool that uses WebGL to allow the exploration of trees with tens of millions of nodes in the browser for the first time.

Taxonium links each node to associated metadata and supports mutation-annotated trees, which are able to capture all known genetic variation in a dataset. It can either be run entirely locally in the browser, from a server- based backend, or as a desktop application.

□ Census: accurate, automated, deep, fast, and hierarchical scRNA-seq cell-type annotation

>> https://www.biorxiv.org/content/10.1101/2022.10.19.512926v1

Census implements a collection of hierarchically organized gradient-boosted decision tree models that successively classify individual cells according to a predefined cell hierarchy.

Census begins by identifying a cell-type hierarchy from reference scRNA-seq data by hierarchically clustering pseudo-bulk cell-type gene expression data using Ward’s method, which splits each node into two child nodes.

□ Mora: abundance aware metagenomic read re-assignment for disentangling similar strains

>> https://www.biorxiv.org/content/10.1101/2022.10.18.512733v1

Mora is able to accurately re-assign reads by first estimating abundances through an expectation-maximization algo- rithm and then utilizing abundance information to re-assign query reads.

Mora maximizes read re-assignment qualities while simultaneously minimizing the difference from estimated abundance levels, allowing Mora to avoid over assigning reads to the same genomes.

□ DANCE: A Deep Learning Library and Benchmark for Single-Cell Analysis

>> https://www.biorxiv.org/content/10.1101/2022.10.19.512741v1

DANCE platform, the first standard, generic, and extensible benchmark platform for accessing and evaluating computational methods across the spectrum of benchmark datasets for numerous single-cell analysis tasks.

DANCE supports five models for this task. It includes scDeepsort as a GNN-based method. ACTINN and singleCellNet are representative deep learning methods. It also covers support vector machine (SVM) and Celltypist as traditional machine learning baselines.

□ PolyHaplotyper: haplotyping in polyploids based on bi-allelic marker dosage data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04989-0

A new method to reconstruct haplotypes from SNP dosages derived from genotyping arrays, which is applicable to polyploids. This method is implemented in the software package PolyHaplotyper.

PolyHaplotyper is restricted to relatively small haploblocks: in practice the maxima are 8 markers in tetraploids and 6 markers in hexaploids. This theoretically allows to distinguish many different haplotypes, precisely 256 for 8 markers and 64 for 6 markers.

□ SUsPECT: A pipeline for variant effect prediction based on custom long-read transcriptomes for improved clinical variant annotation

>> https://www.biorxiv.org/content/10.1101/2022.10.23.513417v1

SUsPECT (Solving Unsolved Patient Exomes/gEnomes using Custom Transcriptomes), a pipeline based on the Ensembl Variant Effect Predictor (VEP) to predict variant impact on custom transcript sets, such as those generated by long-read RNA-sequencing, for downstream prioritization.

□ KBeagle: An Adaptive Strategy and Tool for Improvement of Imputation Accuracy and Computing Efficiency

>> https://www.biorxiv.org/content/10.1101/2022.10.22.513369v1

Genotype imputation was performed using marker information from the linkage disequilibrium (LD) fragment. The estimated accuracy of fragments between individuals with known and unknown genotypes is the key factor in imputation ability.

KBeagle uses the K-Means algorithm to calculate the genetic distance of samples with missing genotypes, classifying the samples with close genetic distances into one clustered group, and then use the Beagle to estimate the missing genotype of samples in each clustered group.

□ RFR: Improving fine-mapping by modeling infinitesimal effects

>> https://www.biorxiv.org/content/10.1101/2022.10.21.513123v1

The Replication Failure Rate (RFR) – a metric that assesses the stability of posterior inclusion probability by evaluating the consistency of PIPs in random subsamples of individuals from a larger well-powered cohort – in this instance for 10 quantitative traits in the UK Biobank.

the RFR to be higher than expected across traits for several Bayesian fine-mapping methods. Moreover, variants that failed to replicate at the higher sample size were less likely to be coding.

□ NDEx IQuery: a multi-method network gene set analysis leveraging the Network Data Exchange

>> https://www.biorxiv.org/content/10.1101/2022.10.24.513552v1

The NDEx Integrated Query (IQuery) combines novel sources of pathways, integration with Cytoscape, and the ability to store and share analysis results. The IQuery web application performs multiple gene set analyses based on diverse pathways and networks stored in NDEx.

The cosine similarity calculation uses values derived from each gene's term frequency-inverse document frequency (TF-IDF) in the query set and the network. IQuery uses the INDRA system to assemble the output of multiple automated literature mining systems.

□ Genome ARTIST_v2-An Autonomous Bioinformatics Tool for Annotation of Natural Transposons in Sequenced Genomes

>> https://www.mdpi.com/1422-0067/23/20/12686

The new functions of GA_v2 qualify it as a tool for the mapping and annotation of natural transposons (NTs) in long reads, contigs and assembled genomes.

The new implemented functions allow users to retrieve subsequences from specific references coordinates without a prior alignment with a query sequence;

To extract a list of target site duplications (TSDs) or of flanking sequences consecutive to the alignment of a set of transposon-genome junction query (JQ) sequences versus reference sequences.

□ uORF4u: a tool for annotation of conserved upstream open reading frames

>> https://www.biorxiv.org/content/10.1101/2022.10.27.514069v1

uORF4u, a tool for conserved uORF annotation in 5ʹ upstream sequences of a user-defined protein of interest or a set of protein homologues. It can also be used to find small ORFs within a set of nucleotide sequences.

If the input is a single RefSeq protein accession number, uORF4u performs a BlastP search against the online version of the RefSeq protein database.

For identified potential frames, the tool searches for conserved ORFs using a greedy algorithm: uORF4u iterates through sequences and tries to maximise the sum of pairwise alignment scores between uORFs.

□ ConsensuSV-from the whole genome sequencing data to the complete variant list

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac709/6782956

The ConsensuSV-core algorithm uses the calls from the individual SV identification algorithms. ConsensuSV starts by preprocessing all the individual VCF files to establish a unified format for further processing.

Every SV is loaded into memory and iterated to find the list of closes ones in terms of their starting position, ending position and type. If the minimum requirement of the number of overlapping candidates is reached, the tool continues processing the list of variants.

□ T1K: efficient and accurate KIR and HLA genotyping with next-generation sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.10.26.513955v1

□ Comparing 10x Genomics single-cell 3' and 5' assay in short-and long-read sequencing

>> https://www.biorxiv.org/content/10.1101/2022.10.27.514084v1

Although the barcode detection, cell-type identification, and gene expression profile are similar in both assays, the 5’ assay captured more exonic molecules and fewer intronic molecules compared to the 3’ assay.

13.7% of genes sequenced have longer average read lengths and are more complete (spanning both polyA-site and TSS) in the long reads from the 5’ assay compared to the 3’ assay.

These genes are characterized by long average transcript length, high intron number, and low expression overall. Despite these differences, cell-type-specific isoform profiles observed from the two assays remain highly correlated.

□ Genetic determinism, essentialism and reductionism: semantic clarity for contested science

>> https://www.nature.com/articles/s41576-022-00537-x

□ ParseCNV2: efficient sequencing tool for copy number variation genome-wide association studies

>> https://www.nature.com/articles/s41431-022-01222-7

ParseCNV2, a next-generation approach to CNV association by natively supporting the popular VCF specification for sequencing-derived variants as well as SNP array calls using a PennCNV format.

ParseCNV2 presents a critical addition to formalizing CNV association for inclusion with SNP associations in GWAS Catalog. Clinical CNV prioritization, interactive quality control (QC), and adjustment for covariates are revolutionary new features of ParseCNV2 vs. ParseCNV.

□ RabbitFX: Efficient Framework for FASTA/Q File Parsing on Modern Multi-Core Platforms

>> https://ieeexplore.ieee.org/document/9937043/

RabbitFX can efficiently read FASTA and FASTQ files by combining a lightweight parsing method by means of an optimized formatting implementation.

RabbitFX inegrates three I/O-intensive applications: fastp, Ktrim, and Mash. compared to FQFeeder, in the task of counting ATCG of pair-end data, RabbitFX is 2 times faster in 20 thread.

□ Venus: An efficient virus infection detection and fusion site discovery method using single-cell and bulk RNA-seq data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010636

Venus consisted of two main modules: virus detection and integration site discovery. The recommended guideline is to always run the virus detection module but only run the integration module if the virus species is able to integrate its genomic information into the host.

Venus mapped to the integrSeq sequence. Venus classified its chimeric fusion transcripts by biological significance. Venus also ensured that each chimeric read had a clear junction breakpoint, with no gaps or overlaps between the two portions, a quality of true integration sites.

□ Sashimi.py: a flexible toolkit for combinatorial analysis of genomic data

>> https://www.biorxiv.org/content/10.1101/2022.11.02.514803v1

Sashimi.py offers a variety of approaches to use, and users could generate the desired plots by an application programming interface (API) from a script or Jupyter Notebook as well as a command-line interface (CLI).

Sashimi.py is a platform to visually interpret genomic data from a large variety of data sources incl. scRNA-seq, DNA/RNA interactions, long-reads sequencing data, and Hi-C data without any preprocessing, and also offers a broad degree of flexibility for formats of output files.

□ TreeTerminus - Creating transcript trees using inferential replicate counts

>> https://www.biorxiv.org/content/10.1101/2022.11.01.514769v1

TreeTerminus, a data-driven approach for grouping transcripts into a tree structure where leaves represent individual transcripts and internal nodes represent an aggregation of a transcript set.

TreeTerminus constructs trees such that, on average, the inferential uncertainty decreases as we ascend the tree topology. TreeTerminus provides a dynamic programming approach that can be used to find a cut through the tree that optimizes one of several different objectives.

□ Proton transfer during DNA strand separation as a source of mutagenic guanine-cytosine tautomers

>> https://www.nature.com/articles/s42004-022-00760-x

□ Entropy: A a visual representation of Entropy increasing on the blockchain. “Absolute Zero”

>> https://opensea.io/collection/entropy-by-nahiko

In Orbit.

2022-10-31 22:10:10 | Music20

□ Thomas Bergersen - In Orbit (feat. Cinda M.)

A Trip to Infinity.

2022-10-31 22:09:08 | 映画

□ 『A Trip to Infinity』(Netflix)

>> https://www.netflix.com/jp/title/81273453

“Eminent mathematicians, particle physicists and cosmologists dive into infinity and its mind-bending implications for the universe.”

著名な数学者・理論物理学者たちが、『無限』の概念や定義について、それぞれの領域における知見を語る。中学生の頃、私の世界観を永遠に変えてしまった数学者、Steven Strogatz氏の出演も感慨深い。 - ”That the universe itself gets to have its window of life.”

Lamb.

2022-10-31 18:06:06 | 映画

□ 『Lamb』

>> https://a24films.com/films/lamb

Director: Valdimar Jóhannsson
A24 films

アイスランドの白夜が美しく映える怪異スリラー。一見シュールに映る光景は記号の配置に依るものに過ぎず、出自と形質の『特異点』である子羊は、ただ純粋に『心を持つ子供』として描かれている。自然の掟と世界だけが、彼女を置き去りにしていく。

	ブログを読むだけ。毎月の訪問日数に応じてポイント進呈
	【コメント募集中】最も利用するコンビニはどこ？
	gooブロガーの今日のひとこと
	訪問者数に応じてdポイント最大1,000pt当たる！

2022年10月
日	月	火	水	木	金	土
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.