2021年8月8日のブログ記事一覧-lens, align.

Stella Regia.

2021-08-08 20:08:08 | Science News

□ HAL-x: Scalable Clustering with Supervised Linkage Methods

>> https://www.biorxiv.org/content/10.1101/2021.08.01.454697v1.full.pdf

HAL-x, a novel hierarchical density clustering algorithm that uses supervised linkage methods to build a cluster hierarchy on raw single-cell data. HAL-x is designed to cluster datasets with up to 100 million points embedded in a 50+ dimensional space.

HAL-x can ensure that the predictive power is limited by the reproducibility of our clustering assignments and not by the choice of classifier. HAL-x defines an extended density neighborhood for each pure cluster, identifying spurious clusters that are representative of the same density maxima.

□ dynDeepDRIM: a dynamic deep learning model to infer direct regulatory interactions using single cell time-course gene expression data

>> https://www.biorxiv.org/content/10.1101/2021.08.28.458048v1.full.pdf

dynDeepDRIM integrated the primary image, neighbor images with time-course into a four-dimensional tensor and trained a convolutional neural network to predict the direct regulatory interactions between TFs and genes.

dynDeepDRIM structure consists of T subcomponents and 3 fully connected layers to produce the prediction values using Sigmoid function. The embeddings are transformed into another condensed embedding with 512 dimensions used to integrate w/ the results for the other time points.

□ β-VAE: Out-of-distribution prediction with disentangled representations for single-cell RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.09.01.458535v1.full.pdf

In disentanglement learning, a single latent dimension is linked to a single generative feature, while being relatively invariant to changes in other features.

β-VAE, a fully unsupervised model for disentanglement learning. The deviation of the KL divergence loss from C is penalized by β. β-VAE outperforms dHSIC in both disentanglement learning and OOD prediction.

□ BWA-MEME: BWA-MEM emulated with a machine learning approach

>> https://www.biorxiv.org/content/10.1101/2021.09.01.457579v1.full.pdf

BWA-MEME performs exact match search with O(1) memory accesses leveraging the learned index. BWA-MEME is based on a suffix array search algorithm that solves the challenges in utilizing learned indices for SMEM search which is extensively used in the seeding phase.

BWA-MEME achieves up to 3.45x speedup in seeding throughput over BWA-MEM2 by reducing the number of instructions by 4.60x, memory accesses by 8.77x, and LLC misses by 2.21x, while ensuring the identical SAM output to BWA-MEM2.

BWA-MEME uses a partially-3-layer recursive model index (P-RMI) which adapts well to the imbalanced distribution of suffixes and provides accurate prediction, and an algorithm that encodes the input substring or suffixes into a numerical key.

□ AMULET: a novel read count-based method for effective multiplet detection from single nucleus ATAC-seq data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02469-x

AMULET (ATAC-seq MULtiplet Estimation Tool) enumerates regions with greater than two uniquely aligned reads across the genome to effectively detect multiplets. AMULET can detect multiplets with a runtime that scales near linearly with the number of cells/valid reads.

AMULET detected multiplets with high precision (assessed by sample multiplexing) and high recall (assessed by simulated multiplets), especially when samples are sequenced to a certain read depth, serving as an effective alternative to simulation-based ArchR.

□ GAMMA: a tool for the rapid identification, classification, and annotation of translated gene matches from sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab607/6355578

GAMMA is a command line tool that finds gene matches in microbial genomic data using protein coding (rather than nucleotide) identity, and then translates and annotates the match by providing the type (i.e., mutant, truncation, etc.) and a translated description.

GAMMA uses protein sequence similarity as the initial filter for determining calls, different calls occurred only when there were ambiguous, inexact matches at the protein level, which GAMMA resolves by using nucleotide similarity and then the least number of transversions.

□ STGATE: Deciphering spatial domains from spatially resolved transcriptomics with adaptive graph attention auto-encoder

>> https://www.biorxiv.org/content/10.1101/2021.08.21.457240v1.full.pdf

STGATE first constructs a spatial neighbor network (SNN) based on a pre-defined radius, and another optional one by pruning it according to the pre-clustering of gene expressions to better characterize the spatial similarity at the boundary of spatial domains.

STGATE learns low-dimensional latent representations with both spatial information and GE via a graph attention auto-encoder. The input of auto-encoder is the normalized expression matrix, and the graph attention layer is adopted in the middle of the encoder and decoder.

□ Supermeasured: Violating Statistical Independence without violating statistical independence

>> https://arxiv.org/pdf/2108.07292.pdf

Violations of Statistical Independence are commonly in- terpreted as correlations between the measurement settings and the hidden variables (which determine the mea- surement outcomes). Such correlations have been discarded as “finetuning” or a “conspiracy”.

The problem with the common interpretation is that Statistical Independence might be violated because of a non-trivial measure in state space, a possibility called “supermeasured”.

“supermeasured” is not under the control of the experimenter. ρBell contains information both about the intrinsic properties of the space and the distribution over the space. Interpretations of Bell’s theorem run afoul of physics whenever one is dealing with a theory μ(λ,X) ̸=μ0.

□ AENET: Interfaces for accurate and efficient molecular dynamics simulations with machine learning potentials

>> https://aip.scitation.org/doi/10.1063/5.0063880

ænet enables accurate simulations of large and complex systems with low computational cost that scales linearly with the number of atoms.

The ænet achieves excellent parallel efficiency on highly parallel distributed-memory systems and benefits from the highly optimized neighbor list. ænet make it possible to simulate atomic structures w/ millions of atoms w/ an accuracy close to first-principles calculations.

□ SiGMoiD: A super-statistical generative model for binary data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009275

Super-statistical Generative Model for binary Data (SiGMoiD) is a maximum entropy-based framework where we imagine the data as arising from super-statistical system.

SiGMoiD characterizes each binary variable using a K dimensional vector of features. SiGMoiD is significantly faster than typical max ent models, allowing us to analyze very high dimensional data sets (over 1000 dimensions) that remain well out of the reach of current max ent.

□ Regulus: a transcriptional regulatory networks inference tool based on Semantic Web technologies

>> https://www.biorxiv.org/content/10.1101/2021.08.02.454721v1.full.pdf

Regulus has been developed to be stringent and to limit the space of the candidates TF-genes relations highlighting the candidate relations which are the most likely to occur.

Regulus uses the system dynamics to decipher the inhibition and activation roles of regulators. Regulus relies on a principle of consistency between genomic landscape, genes and TF expressions to decide if a relation is susceptible to exist.

□ MegaLMM: Mega-scale linear mixed models for genomic predictions with thousands of traits

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02416-w

MegaLMM (linear mixed models for millions of observations), a novel statistical method for fitting massive-scale MvLMMs. MegaLMM dramatically improves upon existing methods that fit low-rank MvLMMs, allowing multiple random effects with large amounts of missing data.

MegaLMM decomposes a typical MvLMM into a two-level hierarchical model. MegaLMM is inherently a linear model and cannot effectively model trait relationships that are non-linear. MegaLMM estimates genetic values for all traits (both observed and missing) in a single step.

□ METAMVGL: a multi-view graph-based metagenomic contig binning algorithm by integrating assembly and paired-end graphs

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04284-4

METAMVGL applies the auto-weighted multi-view graph-based algorithm to optimize the weights of the two graphs and predict binning groups for the unlabeled contigs.

METAMVGL learns the two graphs’ weights automatically and predicts the contig labels in a uniform multi-view label propagation framework. METAMVGL made use of significantly more high-confidence edges from the combined graph and linked dead ends to the main graph.

□ uLTRA: Accurate spliced alignment of long RNA sequencing reads

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab540/6327681

uLTRA, an alignment method for long RNA sequencing reads based on a novel two- pass collinear chaining algorithm. uLTRA achieves an accuracy of about 60% for exons of length 10 nucleotides or smaller and close to 90% accuracy for exons of length between 11 to 20 nucleotides.

uLTRA uses minimap2’s primary alignments for reads aligned outside the regions indexed by uLTRA and chooses the best alignment of the two aligners for reads aligned in gene regions.

uLTRA uses a novel two-pass collinear chaining algorithm. In the first pass, uLTRA uses maximal exact matches (MEMs) between reads and the transcriptome as seeds. uLTRA solves the chaining instances by highest upper bound on coverage.

□ A divide and conquer metacell algorithm for scalable scRNA-seq analysis

>> https://www.biorxiv.org/content/10.1101/2021.08.08.453314v1.full.pdf

Metacell-2, a recursive divide and conquer algorithm allowing efficient decomposition of scRNA-seq datasets of any size into small and cohesive groups of cells denoted as metacells.

Metacell-2 uses a new graph partition score to avoid time-consuming resampling and directly control metacell sizes, implements a new adaptive outlier detection module, and employs a rare-gene- module detector ensuring high sensitivity for detecting transcriptional states.

□ SIRV: Spatial inference of RNA velocity at the single-cell resolution

>> https://www.biorxiv.org/content/10.1101/2021.07.26.453774v1.full.pdf

The SIRV (Spatially Inferred RNA Velocity) algorithm consists of four major parts: (i) integration of the spatial transcriptomics and scRNA-seq datasets, (ii) predictions of un/spliced expressions, (iii) label/metadata transfer (optional), and (iv) estimation of RNA velocities within the spatial context.

SIRV calculates RNA velocity vectors for each cell that are then projected onto the two-dimensional spatial coordinates, which are then used to derive flow fields by averaging dynamics of spatially neighboring cells.

□ TraSig: Inferring cell-cell interactions from pseudotime ordering of scRNA-Seq data

>> https://www.biorxiv.org/content/10.1101/2021.07.28.454054v1.full.pdf

TraSig (Trajectory-based Signalling genes inference) identifies interacting cell types pairs and significant ligand-receptors based on the expression of genes as well as the pseudo-time ordering of cells.

TraSig uses continuous state Hidden Markov model (CSHMM). It learns a generative model on the expression data using transition states and emission probabilities, and assumes a tree structure for the trajectory and assigns cells to specific locations on its edges.

□ Bi-Directional PBWT: Efficient Haplotype Block Matching

>> https://drops.dagstuhl.de/opus/volltexte/2021/14372/pdf/LIPIcs-WABI-2021-19.pdf

Bi-directional PBWT finds blocks of matches around each variant site and the changes of matching blocks using forward and reverse PBWT at each variant site at the same time.

The time complexity of the algorithms to find matching blocks using bi-PBWT is linear to the input. It provides an efficient solution that can tolerate genotyping errors. The divergence values in the forward PBWT can be updated using the block information in the reverse PBWT.

□ DeepNano-coral: Nanopore Base Calling on the Edge

>>

https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab528/6329259

DeepNano-coral, a new base caller for nanopore sequencing, which is optimized to run on the Coral Edge Tensor Processing Unit, a small USB-attached hardware accelerator.

A new design of the residual block, which is a fundamental building block of the QuartzNet speech recognition architecture and was also deployed for base calling in Bonito.

DeepNano-coral provides real-time base calling that is energy efficient. The k-blueprint-separable convolution factorizes the convolution into the two parts differently, in effect reducing the depthwise operation at the cost of increasing computation in the pointwise operation.

□ New strategies to improve minimap2 alignment accuracy

>> https://arxiv.org/pdf/2108.03515.pdf

A new heuristic to additional minimizers. If |x1 − x2| ≥ 500, minimap2 v2.22 selects ⌊|x1 − x2|/500⌋ minimizers of the lowest occurrence among minimizers between x1 and x2. And use a binary heap data structure to select minimizers of the lowest occurrence in this interval.

To see if minimap2 v2.22 could improve long INDEL alignment, running dipcall on contig-to-reference alignments and focused on INDELs longer than 1kb (real-sv-1k). v2.22 is more sensitive at comparable specificity, confirming its advantage in more contiguous alignment.

□ Co-evolutionary Distance Predictions Contain Flexibility Information

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab562/6349220

The predicted distance distribution of each residue pair was analysed for local maxima of probability indicating the most likely distance or distances between a pair of residues.

Rigid residue pairs tended to have only a single local maximum in their predicted distance distributions while flexible residue pairs more often had multiple local maxima.

□ Learning Invariant Representations using Inverse Contrastive Loss

>> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8366266/

If the extraneous variable is binary, then optimizing ICL is equivalent to optimizing a regularized Maximum Mean Discrepancy divergence. The formulation of ICL can be decomposed into a sum of convex functions of the given distance metric.

These models obtained by optimizing ICL achieve significantly better invariance to the extraneous variable for a fixed desired level of accuracy. Applicability of ICL for learning invariant representations for both continuous and discrete extraneous variables.

□ A scalable algorithm for clonal reconstruction from sparse time course genomic sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.08.19.457037v1.full.pdf

a novel scalable algorithm for clonal reconstruction from sparse time course data containing hundreds of novel mutations occurring at each sampled time point.

It employs a statistical method to estimate the sampling variance of VAFs derived from low coverage sequencing data and incorporated it into the maximum likelihood framework for clonal reconstruction.

□ MultiVI: deep generative model for the integration of multi-modal data

>> https://www.biorxiv.org/content/10.1101/2021.08.20.457057v1.full.pdf

MultiVI, a deep generative model probabilistic framework that leverages deep neural networks to jointly analyze scRNA, scATAC and multiomic (scRNA + scATAC) data.

MultiVI creates an informative low-dimensional latent space that reflects both chromatin and transcriptional properties even when one of the modalities is missing. MultiVI provides a batch- corrected view of the high-dimensional data, along with quantification of uncertainty.

□ MultiK: an automated tool to determine optimal cluster numbers in single-cell RNA sequencing data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02445-5

there exist different levels of cluster resolution (i.e., multi-resolution) that are biologically relevant in the data: some clusters are more distinct (e.g., cell types), and others are less distinct but still different (such as related subtypes within a common cell type).

MultiK presents multiple diagnostic plots to assist in the determination of meaningful Ks in the data and makes objective optimal K suggestions, which encompasses both high- and low-resolution parameters.

MultiK aggregates all the clustering runs that give rise to the same K groups regardless of the resolution parameter and computes a consensus matrix. To determine several multi-scale optimal K candidates, MultiK applies a convex hull approach.

MultiK first constructs a dendrogram of the cluster centroids using hierarchical clustering and then runs SigClust on each pair of terminal clusters to determine classes and subclasses.

□ Hierarchical Bayesian models of transcriptional and translational regulation processes with delays

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab618/6358716

Inferring the variability of parameters that determine gene dynamics. However, It’s complicated by the fact that the effects of many reactions are not observable directly. Unobserved reactions can be replaced w/ time delays to reduce model dimensionality and simplify inference.

a non-Markovian, hierarchical Bayesian inference framework for quantifying the variability of cellular processes within and across cells in a population. This hierarchical framework is robust and leads to improved estimates compared to its non-hierarchical counterpart.

□ scProject: Identifying Gene-wise Differences in Latent Space Projections Across Cell Types and Species in Single Cell Data

>> https://www.biorxiv.org/content/10.1101/2021.08.25.457650v1.full.pdf

scProject with projectionDrivers, a new framework to quantitatively examine latent space usage across single-cell exper- imental systems while concurrently extracting the genes driving the differential usage of the latent space across the defined testing parameters.

scProject uses unconstrained elastic net regression allowing for the use of latent spaces containing negative weights. The elastic net regression in scProject both encourages sparsity, a known feature of single-cell data, while also handling the potential for collinearity.

□ Tensor-decomposition--based unsupervised feature extraction in single-cell multiomics data analysis

>> https://www.biorxiv.org/content/10.1101/2021.08.25.457731v1.full.pdf

Singular value decomposition (SVD) was applied to individual omics profiles such that 34 individual omics profiles have common L singular value vectors.

Then, K omics profiles are formatted as an L × M × K dimensional tensor, where M is the number of single cells. Then, higher-order singular value decomposition (HOSVD), which is a type of TD, is applied to the tensor.

UMAP applied to singular value vectors attributed to single cells by HOSVD successfully generated two dimensional embedding, coincident with known classification of single cells.

□ ION: Inferring causality in biological oscillators

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab623/6360457

Conventional methods manipulate one or more components experimentally to investigate the effect on others in the system. However, these are time-consuming and costly, particularly as the number of components increases.

ION infers regulations within various network structures such as a cycle, multiple cycles, and a cycle with outputs from in silico oscillatory time-series data. ION predicts hidden regulations for the pS2 promoter after estradiol treatment, guiding experimental investigation.

□ NetRAX: Accurate and Fast Maximum Likelihood Phylogenetic Network Inference

>> https://www.biorxiv.org/content/10.1101/2021.08.30.458194v1.full.pdf

NetRAX can infer maximum likelihood phylogenetic networks from partitioned multiple sequence alignments and returns the inferred networks in Extended Newick format.

NetRAX uses a greedy hill climbing approach to search for network topologies. It deploys an outer search loop to iterate over different move types and an inner search loop to search for the best-scoring network using a specific move type.

□ Ultrafast homomorphic encryption models enable secure outsourcing of genotype imputation

>> https://www.cell.com/cell-systems/fulltext/S2405-4712(21)00288-X

Homomorphic Encryption -based imputation methods enable a general modular approach. The first step is imputation model building, where imputation models are trained using the reference genotype panel w/ a set of tag variants to impute the genotypes for a set of target variants.

The second step is the secure imputation step, where the encrypted tag variant genotypes are used to predict the target genotypes by using the imputation models. Imputation model evaluation using the encrypted tag variant genotypes, is where the HE-based methods are deployed.

□ RcppML NMF: Fast and robust non-negative matrix factorization for single-cell experiments

>> https://www.biorxiv.org/content/10.1101/2021.09.01.458620v1.full.pdf

RcppML NMF, an accessible NMF implementation that is much faster than PCA and rivals the runtimes of state-of-the-art Singular Value Decomposition (SVD).

RcppML NMF uses random initialization. NMF models learned with this implementation from raw count matrices yield intuitive summaries of complex biological processes, capturing coordinated gene activity and enrichment of sample metadata.

□ Semantics in High-Dimensional Space

>> https://www.frontiersin.org/articles/10.3389/frai.2021.698809/full

If we are in a 128-dimensional, 1,000-dimensional, or 10-dimensional space, the natural sense of space, direction, or distance we have acquired poking around over our lifetime on the 2-dimensional surface of a 3-dimensional sphere do not quite cut it and risk leading us astray.

An increasing majority of the points in a hypercube lies far from the surface of the hypersphere, and any projected structure that depends on the differences in distance from the origin is lost. The structures in the vector space are partially shadowed onto the hypersphere cave.

Eppur Si Mouve.

2021-08-08 20:07:08 | Science News

□ Sparse least trimmed squares regression with compositional covariates for high dimensional data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab572/6343442

Connecting robustness and sparsity in the context of variable selection in regression with compositional covariates with a continuous response.

The compositional character of the covariates is taken into account by a linear log-contrast model, and elastic-net regularization achieves sparsity in the regression coefficient estimates.

□ Sfaira: accelerates data and model reuse in single cell genomics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02452-6

Sfaira accelerates parallelized model training across organs, model benchmarking, and comparative integrative data analysis through a streamlined data access backend while improving deployment and access to pre-trained parametric models.

Sfaira allows us to relate the dimensions of the latent space to all genes. The gene space is explicitly coupled to a genome assembly to allow controlled feature space mapping.

□ omicsGAN: Multi-omics Data Integration by Generative Adversarial Network

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab608/6355579

Using a random interaction network does not allow the framework to learn meaningful information from the omics datasets; therefore, results in synthetic data with weaker predictive signals.

omicsGAN, a generative adversarial network (GAN) model to integrate two omics data and their interaction network. The model captures information from the interaction network as well as the two omics datasets and fuse them to generate synthetic data with better predictive signals.

□ scFlow: A Scalable and Reproducible Analysis Pipeline for Single-Cell RNA Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2021.08.16.456499v1.full.pdf

The scFlow R package is built to enable standardized workflows following best practices on top of popular single-cell R packages, including Seurat, Monocle, scater, emptyDrops, DoubletFinder, LIGER, and MAST.

scFlow uses Leiden/Louvain detection, automated cell-type annotation with rich cell-type metrics, flexible differential GE for categorical and numerical dependent variables, impacted pathway analysis with multiple methods, and Dirichlet modeling of cell-type composition changes.

□ DSGRN: Rational design of complex phenotype via network models

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009189

Dynamic Signatures Generated by Regulatory Networks (DSGRN) is agnostic to the specific biophysical design of the elements of the circuits. The input consists of a mathematical abstraction of GRN that consists of nodes and annotated directed edges indicating activation.

DSGRN provides a modeling framework that is capable of analyzing all 3-node RN for prevalence over a large range of parameter values. DSGRN captures complex dynamics—hysteresis arises from global organization of multiple phenotypes - monostability, bistability, monostability.

□ EnGRNT: Inference of gene regulatory networks using ensemble methods and topological feature extraction

>> https://www.biorxiv.org/content/10.1101/2021.08.05.455202v1.full.pdf

EnGRNT can be used to infer GRNs with acceptable accuracy for networks nodes using Gaussian kernel in experimental conditions.

EnGRNT is categorized in supervised learning methods which transforms GRN inference problem to binary classification problem for each transcription factor and ultimately improves the GRN structure.

□ Straglr: discovering and genotyping tandem repeat expansions using whole genome long-read sequences

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02447-3

Straglr, a new software tool that scans the entire genome for potential TR expansions by first extracting insertions composed of TRs and then genotyping the identified “expanded” loci.

Straglr not only spares the time and computing resources required for genotyping thousands of non-expanded TR loci but also enables the discovery of expansions at previously unannotated loci.

□ CLEIT: A Cross-Level Information Transmission Network for Hierarchical Omics Data Integration and Phenotype Prediction from a New Genotype

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab580/6352488

Cross-LEvel Information Transmission network (CLEIT) aims to represent the asymmetrical multi-level organization of the biological system by integrating multiple incoherent omics data and to improve the prediction power of low-level features.

CLEIT learns the latent representation of the high-level domain then uses it as ground-truth embedding to improve the rep learning of the low-level domain in the form of contrastive loss. And can leverage the unlabeled data to improve the generalizability of the predictive model.

□ NetworkDynamics.jl—Composing and simulating complex networks in Julia

>> https://aip.scitation.org/doi/10.1063/5.0051387

The structure of the problem leads to several difficulties that a simulation has to deal with: coupled dynamical systems are usually defined on a high-dimensional phase space,

often the asymptotic properties of the system are of interest leading to a need for long integration times, subsystems may contain algebraic constraints or exhibit chaotic dynamics, interactions may introduce a time delay or the system might be subject to noise.

Future development goals are an interface to the symbolic modeling package Modeling- Toolkit.jl, support for heterogeneous time-delays and automatically deriving Jacobian-Vector product operators in order to speed up implicit solver algorithms.

□ BioSANS: A Software Package for Symbolic and Numeric Biological Simulation

>> https://www.biorxiv.org/content/10.1101/2021.08.17.456661v1.full.pdf

BioSANS exact stochastic algorithms are tested by using the SBML discrete stochastic model test suite (SBML DSMTS).

The symbolic computation capability in BioSANS provides analytical expression of solvable cases without the need to type the ODE expression and declaring variables. BioSANS provides reliable algorithms that can facilitate the modeling process.

□ RefRGim: an intelligent reference panel reconstruction method for genotype imputation with convolutional neural networks

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab326/6353381

RefRGim, an intelligent genotype imputation reference reconstruction method with convolutional neural networks based on genetic similarity of individuals from input data and current references.

RefRGim estimates global genetic similarity to construct a universal reference panel. RefRGim can rank reference haplotypes by its genetic similarity with study individuals and select the most comparable haplotype group for each study individual to organize them into SSRP.

□ NOREC4DNA: using near-optimal rateless erasure codes for DNA storage

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04318-x

NOREC4DNA is an all-in-one Suite for analyzing, testing and converting Data into DNA-Chunks to use for a DNA-Storage-System using integrated DNA-Rules as well as the MOSLA DNA-Simulation-API. NOREC4DNA implements Luby transform (LT) code and Raptor Codes.

□ ksrates: positioning whole-genome duplications relative to speciation events in KS distributions

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab602/6354354

ksrates is a tool to position whole-genome duplications* (WGDs) relative to speciation events using substitution-rate-adjusted mixed paralog–ortholog distributions of synonymous substitutions per synonymous site (KS).

ksrates generates adjusted mixed plots by rescaling ortholog KS estimates of species divergence times to the paralog scale, producing shifts in the estimated KS position of speciation events proportional to the substitution rate difference b/n the diverged lineages/focal species.

□ qTeller: A tool for comparative multi-genomic gene expression analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab604/6354355

qTeller identifies potential evidence of regulatory subfunctionalization, or patterns of expression of equivalent gene models between different genetic backgrounds/genomics to identify genotype-specific patterns of regulation as the result of cis- or trans- regulatory divergence.

□ VariantStore: an index for large-scale genomic variant search

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02442-8

VariantStore, a system for efficiently indexing and querying genomic information (genomic variants and phasing information) from thousands of samples containing millions of variants.

The inverted index design allows one to quickly find all the samples and positions in sample coordinates corresponding to a variant.

□ Searchlight: automated bulk RNA-seq exploration and visualisation using dynamically generated R scripts

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04321-2

Searchlight provides a level of bulk RNA-seq EVI automation that is broadly comparable to commercial tools. Searchlight2 accepts typical downstream analysis inputs - such as a sample sheet, expression matrix and any number of differential expression tables.

□ Hashindu Gamaarachchi RT

>> https://twitter.com/hasindu2008/status/1428636104094224386?s=21

Demonstrating how fast (both implementation time and runtime) SLOW5 format can be:
spent around 15 minutes to get slow5 working on
@haowen_zhang's sigmap tool.
Result: mapping 80k reads that took around 2 hours with FAST5, now takes only 5 minutes with SLOW5!
That is >100X faster

□ SWALO: scaffolding with assembly likelihood optimization

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkab717/6355875

SWALO learns parameters automatically from the data and is largely free of user parameters making it more consistent than other scaffolders. It is also able to make use of multi-mapped read pairs through probabilistic disambiguation which most other scaffolding tools ignore.

SWALO is grounded in rigorous probabilistic models yet proper approximations make the implementation efficient and applicable to practical datasets. SWALO may also be extended to scaffolding with long reads generated by SMRT and nanopore sequencing.

□ ExOrthist: a tool to infer exon orthologies at any evolutionary distance

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02441-9

ExOrthist, a fully reproducible Nextflow-based software enabling inference of exon homologs and orthogroups, visualization of evolution of exon-intron structures, and assessment of conservation of alternative splicing patterns.

ExOrthist evaluates exon sequence conservation and considers the surrounding exon-intron context to derive genome-wide multi-species exon homologies at any evolutionary distance.

□ RNABERT: Informative RNA-base embedding for functional RNA structural alignment and clustering by deep representation learning
>> https://www.biorxiv.org/content/10.1101/2021.08.23.457433v1.full.pdf

by performing RNA sequence alignment combining this informative base embedding with a simple Needleman-Wunsch alignment algorithm, they succeed in calculating a structural alignment in a time complexity O(n2) instead of the O(n6) time complexity of Sankoff-style algorithms.

RNABERT model consists of three components, token and position embedding, transformer layer, and pre-training tasks. Token embedding randomly generates a 120-dimensional vector representing four RNA bases so that each base is assigned the same vector.

□ iSEEEK: A universal approach for integrating super large-scale single-cell transcriptomes by exploring gene rankings

>> https://www.biorxiv.org/content/10.1101/2021.08.23.457305v1.full.pdf

iSEEEK was trained in a stochastic manner that only a small batch of samples are processed at each time step.

iSEEEK is quite different from that of other traditional methods as they require selection of hyper-variable genes (HVGs), batch-correction and data normalization, whereas iSEEEK uses the ranking of top-expressing genes and does not require selection of HVGs.

□ GLUE: Multi-omics integration and regulatory inference for unpaired single-cell data with a graph-linked unified embedding framework

>> https://www.biorxiv.org/content/10.1101/2021.08.22.457275v1.full.pdf

GLUE (graph-linked unified embedding) utilizes accessible prior knowledge about regulatory interactions to bridge the gaps between feature spaces. the GLUE regulatory inference can be seen as a posterior estimate, which can be continuously refined upon the arrival of new data.

GLUE enables notable scalability for whole-atlas alignment over millions of unpaired cells, which remains a serious challenge for in silico integration.

□ proovframe: frameshift-correction for long-read (meta)genomics

>> https://www.biorxiv.org/content/10.1101/2021.08.23.457338v1.full.pdf

Gene prediction on long reads, aka PacBio and Nanopore, is often impaired by indels causing frameshift. Proovframe detects and corrects frameshifts in coding sequences from raw long reads or long-read derived assemblies.

Proovframe uses frameshift-aware alignments to reference proteins as guides, and conservatively restores frame-fidelity by 1/2-base deletions or insertions of “N/NN”s, and masking of premature stops (“NNN”).

□ SALT: Fast and SNP-aware short read alignment

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04088-6

SALT, a BWT-based short read aligner that incorporates genetic SNPs to augment the reference genome. It can effectively map reads to a reference genome with low memory requirements.

SALT was run with different overlap lengths in the seeding phase, leading to differences in speed and accuracy. BWA-MEM was run with the default settings. SALT can achieve higher accuracy and sensitivity than aligners that do not incorporate variation information.

□ TLGP: a flexible transfer learning algorithm for gene prioritization based on heterogeneous source domain

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04190-9

TLGP quantifies the similarity between the target and source domain by calculating the affinity matrix for genes. The TLGP algorithm also offers an alternative for integrative analysis of the heterogeneous genomic data.

TLGP consists of the affinity matrix construction, dimension reduction in source domain, fusion network construction and gene ranking. The fusion network is based on the integration of source and target data. The gene ranking is performed via exploring fusion matrix.

□ DCap: A novel method for predicting cell abundance based on single-cell RNA-seq data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04187-4

Most of the existing methods need the cell-type-specific gene expression profile as the input of the signature matrix. However, in real applications, it is not always possible to find an available signature matrix.

DCap is a deconvolution method based on non-negative least squares. DCap considers the weight resulting from measurement noise of bulk RNA-seq and calculation error, during the calculation process of non-negative least squares and performs the weighted iterative calculation.

□ indelPost: harmonizing ambiguities in simple and complex indel alignments

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab601/6357697

indelPost, a Python library that harmonizes these ambiguities for simple and complex indels via realignment and read-based phasing.

indelPost enables accurate analysis of ambiguous data and can derive the correct complex indel alleles from the simple indel predictions provided by standard small variant detectors, with improved performance over a specialized tool for complex indel analysis.

□ CALANGO: an annotation-based, phylogeny-aware comparative genomics framework for exploring and interpreting complex genotypes and phenotypes

>> https://www.biorxiv.org/content/10.1101/2021.08.25.457574v1.full.pdf

CALANGO (Comparative AnaLysis with ANnotation-based Genomic cOmponentes), a first-principles comparative genomics tool to search for annotation terms, associated with a quantitative variable used to rank species data, after correcting for phylogenetic relatedness.

CALANGO can leverage annotation information and phylogeny-aware protocols to enable the investigation of sophisticated biological questions.

□ FRMC: a fast and robust method for the imputation of scRNA-seq data

>> https://www.tandfonline.com/doi/full/10.1080/15476286.2021.1960688

The existing imputation methods all have their drawbacks and limitations, some require pre-assumed data distribution, some cannot distinguish between technical and biological zeros, and some have poor computational performance.

FRMC can not only precisely distinguish "true zeros" from dropout events and correctly impute missing values attributed to technical noises, but also effectively enhance intracellular and intergenic connections and achieve accurate clustering of cells in biological applications.

□ MOGAMUN: A multi-objective genetic algorithm to find active modules in multiplex biological networks

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009263

MOGAMUN optimizes both the density of interactions and the scores of the nodes (e.g., their differential expression). We compare MOGAMUN with state-of-the-art methods, representative of different algorithms dedicated to the identification of active modules in single networks.

MOGAMUN running time is, similarly to the other genetic algorithm COSINE, one order of magnitude slower than jActiveModules and PinnacleZ. This running time could be improved by using surrogate-assisted multi-objective evolutionary algorithms.

□ DeepConsensus: Gap-Aware Sequence Transformers for Sequence Correction

>> https://www.biorxiv.org/content/10.1101/2021.08.31.458403v1.full.pdf

DeepConsensus, which uses a unique alignment-based loss to train a gap-aware transformer-encoder (GATE) for sequence correction. DeepConsensus incorporates the signal-to-noise ratio for each nucleotide, and strand information.

DeepConsensus improves variant calling performance across samples in both SNP and INDEL categories with reads from two and three SMRT Cells.

□ DTUrtle: Differential transcript usage analysis of bulk and single-cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab629/6361547

DTUrtle, the first DTU calling workflow for bulk and single-cell RNA-seq data, and performs a ‘classical’ DTU analysis in a single-cell context. DTUrtle extends one recently presented DTU calling workflow, adding the capability to analyze (sparse) single-cell expression matrices.

DTUrtle extends established statistical frameworks, offers various result aggregation and visualization options and a novel detection probability score for tagged-end data. DTUrtle utilizes sparseDRIMSeq, which allows usage of dense as well as sparse data matrices.

□ miQC: An adaptive probabilistic framework for quality control of single-cell RNA-sequencing data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009290

a data-driven QC metric (miQC) that jointly models both the proportion of reads mapping to mtDNA genes and the number of detected genes with mixture models in a probabilistic framework to predict the low-quality cells in a given dataset.

miQC also maximizes the information gain from an individual experiment, often preserving hundreds or thousands of potentially informative cells that would be thrown out by uniform QC approaches.

□ CellRegMap: A statistical framework for mapping context-specific regulatory variants using scRNA-seq

>> https://www.biorxiv.org/content/10.1101/2021.09.01.458524v1.full.pdf

CellRegMap provides a principled approach to identify and characterize heterogeneity in allelic effects across cellular contexts of different granularity, including cell subtypes and continuous cell transitions.

CellRegMap incorporates the estimated cellular context covariance to account for interaction effects within the linear mixed model (LMM) framework. CellRegMap builds on and extends StructLMM, an LMM-based method to assess genotype-environment interactions.

□ Prowler: A novel trimming algorithm for Oxford Nanopore sequence data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab630/6362873

Prowler, a a trimmer that uses a window-based approach inspired by algorithms used to trim short read data. Importantly, they retain the phase and read length information by optionally replacing trimmed sections with Ns.

Compared to data filtered with Nanofilt, alignments of data trimmed with Prowler had lower error rates and more mapped reads. Assemblies of Prowler trimmed data had a lower error rate than those filtered with Nanofilt, however this came at some cost to assembly contiguity.

□ DEMETER: Efficient simultaneous curation of genome-scale reconstructions guided by experimental data and refined gene annotations

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab622/6362871

DEMETER (Data-drivEn METabolic nEtwork Refinement), a reconstruction pipeline that enables the efficient and simultaneous refinement of thousands of draft genome-scale reconstructions.

The refinement of draft reconstructions in DEMETER is guided by a wealth of experimental data, such as carbon sources, fermentation pathways, and growth requirements, for over 1,000 species, as well as by strain-specific comparative genomic analyses.

□ SHOOT: phylogenetic gene search and ortholog inference

>> https://www.biorxiv.org/content/10.1101/2021.09.01.458564v1.full.pdf

the output of a SHOOT search is not an ordered list of similar sequences but is instead a maximum likelihood phylogenetic tree with bootstrap support values inferred from a multiple sequence alignment with the query gene embedded within it.

□ SCSit: A high-efficiency preprocessing tool for single-cell sequencing data from SPLiT-seq

>> https://www.sciencedirect.com/science/article/pii/S2001037021003524

SCSit automatically identifies three rounds of barcode and UMI and significant increase the clean SCS reads due to the accurate detection of insertion and deletion of barcodes in the alignment.

The consistency of identified reads from SCSit increases to 97%, and mapped reads are twice than the original alignment method (e.g. BLAST and BWA).

Cosmogonia.

2021-08-08 20:06:04 | Music20

Mysterons • CosmogoniaEntitled "Cosmogonia", this ambient electronic piece was commissioned to accompany the launch of @viceversawines vintage Mysterons Cabernet Sauvignon (2019). Music by Richard DevineA film by Sean Curtis Patrick @seancurtispatrick pic.twitter.com/3JEhIehRRR
— Richard Devine (@RichardDevine) August 6, 2021

Pavel Karmanov - Oratorio 5 ANGELS - Normunds Sne - Latvian Radio Choir - Sinfonietta Riga

2021-08-08 20:04:02 | art music

□ Pavel Karmanov - Oratorio 5 ANGELS - Normunds Sne - Latvian Radio Choir - Sinfonietta Riga

>> https://youtu.be/NP18seGjsD8

Boy Daniels Cingujevs
16.06.16
Jaunā Sv. Ģertrūdes baznīca
Riga, Latvia

The most beautiful choir music in the last decade.🔊👼

Ludovico Einaudi / “Twice (Reimagined by Mercan Dede)”

2021-08-08 19:08:07 | Music20

[parts:eNozsjJkhIPUZENDA6NkM5OCbJPyDJ9A70wTJjMTAyZjMwMmAyYEcHBwAAAE1Qiz

□ Ludovico Einaudi / “Twice (Reimagined by Mercan Dede)”

From “Reimagined. Chapter 1, Volume 1”

℗ A Decca Records Recording; ℗ 2021 Ponderosa Music Records, under exclusive distribution to Universal Music Operations Limited

Released on: 2021-07-23

Producer: Ponderosa Music Records
Producer: Titti Santini
Studio Personnel, Recording Engineer, Mix Engineer: Gianluca Mancini
Studio Personnel, Asst. Recording Engineer: Patrick Phillips
Studio Personnel, Remixer: Mercan Dede
Studio Personnel, Mix Engineer: Tim Oliver
Studio Personnel, Mastering Engineer: Haig Vartzbedian
Studio Personnel, Mastering Engineer: Dexter Crowe
Associated Performer, Ney Flute: Burak Malcok
Associated Performer, Piano, Rhodes, Acoustic Guitar, Electric Guitar: Ludovico Einaudi
Associated Performer, Shaker, Percussion, Marimba, Drum: Mauro Refosco
Associated Performer, Kalimba: Francesco Arcuri
Associated Performer, Cello: Redi Hasa
Associated Performer, Electric Bass: Alberto Fabris
Composer: Ludovico Einaudi

	ブログを読むだけ。毎月の訪問日数に応じてポイント進呈
	【コメント募集中】最も利用するコンビニはどこ？
	gooブロガーの今日のひとこと
	訪問者数に応じてdポイント最大1,000pt当たる！

2021年8月
日	月	火	水	木	金	土
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Stella Regia.

Eppur Si Mouve.

Cosmogonia.

Pavel Karmanov - Oratorio 5 ANGELS - Normunds Sne - Latvian Radio Choir - Sinfonietta Riga

Ludovico Einaudi / “Twice (Reimagined by Mercan Dede)”