lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Ascend.

2021-06-17 06:07:13 | Science News

(Murat Pak)




□ SIGMA: A clusterability measure for single-cell transcriptomics reveals phenotypic subpopulations

>> https://www.biorxiv.org/content/10.1101/2021.05.11.443685v1.full.pdf

Using just these singular values and the dimensions of the measurement matrix, calculating the angles between the singular vectors of the measured expression matrix and those of the (unobserved) signal matrix.

SIGnal-Measurement-Angle (SIGMA), a clusterability measure derived from random matrix theory, that can be used to identify cell clusters with non-random sub-structure, testably leading to the discovery of previously overlooked phenotypes.

SIGMA corresponded well with a visual inspection of the cluster UMAPs. For all clusters, the bulk of the singular value distribution was well-described by the MP distribution and, by construction, only clusters with SIGMA > 0 had significant singular values.

SIGMA identifies variance-driving genes and brings renewed awareness to random noise as a factor setting hard limits on clustering and identifying differential expression. The relationship between the largest singular values and SIGMA only depends on the dimensions of the expression matrix.





□ XCVATR: Detection and Characterization of Variant Impact on the Embeddings of Single -Cell and Bulk RNA-Sequencing Samples

>> https://www.biorxiv.org/content/10.1101/2021.06.01.446668v1.full.pdf

XCVATR makes use of local spatial geometry of the embedding and multiscale analysis to provide a comprehensive workflow for detecting expressed variant clumps.

XCVATR relies on the distance matrix between cells based on the transcriptomic profiles. the first step is read count quantification for each cell, which are used for computing either the embedding coordinates or building the distance matrix directly from the expression levels.





□ scPhere: Deep generative model embedding of single-cell RNA-Seq profiles on hyperspheres and hyperbolic spaces

>> https://www.nature.com/articles/s41467-021-22851-4

scPhere minimizes the distortion by embedding cells to a lower-dimensional hypersphere instead of a low-dimensional Euclidean space, using von Mises–Fisher (vMF) distributions on hyperspheres as the posteriors for the latent variables.

Because the prior is a uniform distribution on a unit hypersphere and the uniform distribution on a hypersphere has no centers, points are no longer forced to cluster in the center of the latent space.

Applying scPhere with a hyperspherical latent space to each of the “small” datasets readily distinguished cell subsets. scPhere embeds cells to the hyperbolic space of the Lorentz model and visualize the embedding in a Poincaré disk.





□ Scelestial: fast and accurate single-cell lineage tree inference based on a Steiner tree approximation algorithm

>> https://www.biorxiv.org/content/10.1101/2021.05.24.445405v1.full.pdf

Scelestial, a method for lineage tree reconstruction from single-cell datasets, based on the Berman approximation algorithm for the Steiner tree problem. Scelestial infers the evolutionary history for single-cell data in the form of a lineage tree and imputes the missing values accordingly.

Scelestial is designed in a dynamic program that finds internal node sequences with non-missing values. Scelestial models a hypercube corresponding to the missing values with one representative vertex.





□ Minimizer-space de Bruijn graphs

>> https://www.biorxiv.org/content/10.1101/2021.06.09.447586v1.full.pdf

the concept of minimizer-space sequencing, where the minimizers rather than DNA nucleotides are the atomic tokens. By projecting DNA sequences into ordered lists of minimizers, the key is to enumerate k-min-mers - k-mers over a larger alphabet consisting of minimizer tokens.

mdBG, achieves orders-of-magnitude improvement in both speed and memory usage over existing methods without much loss of accuracy. To handle higher sequencing error rates, mdBG newly corrects for base errors by performing partial order alignment instead in minimizer-space.




□ VSTseed: periodic spaced seeds for reads with substitutions

>> https://www.biorxiv.org/content/10.1101/2021.06.09.447791v1.full.pdf

The minimum length of reads required for seeds of given weight is almost a linear function (or more strictly, an affine function) of a number of substitutions allowed.

VSTseed generates the seeds for reads of a given length and a known maximum number of substitutions, convert the spaced seeds into contiguous arrays (in order to generate “signatures”) using SIMD instructions.





□ Reverse-Complement Equivariant Networks for DNA Sequences

>> https://www.biorxiv.org/content/10.1101/2021.06.03.446953v1.full.pdf

a given DNA segment can be sequenced as two RC DNA sequences, depending on which strand is sequenced; any predictive model for, e.g., DNA sequence classification should therefore be reverse complement-invariant, which calls for RC-equivariant architectures.

Reverse Complement-equivariant pointwise nonlinearities adapted to different representations, as well as RC-equivariant embeddings of k-mers as an alternative to one-hot encoding of nucleotides.





□ Automated Boolean rule inference for models of biological processes: Unsupervised logic-based mechanism inference for network-driven biological processes

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009035

a generalizable, unsupervised approach to generate parameter-free, Boolean logic-based models of cellular processes, described by multiple discrete states. The algorithm employs a Hamming-distance based approach to formulate, test, and identify optimized logic rules.

The algorithm automatically recovers the relevant dynamics for the explored models and recapitulates key aspects of the biochemical species concentration dynamics by the Boolean formalism.




□ muon: Multimodal omics Python framework

>> https://github.com/gtca/muon

muon operates on multimodal data (MuData) that represents modalities as collections of AnnData objects. These collections can be saved to disk and retrieved using HDF5-based .h5mu files, which design is based on .h5ad file structure.

muon can incorporate disjoint multimodal experiments, i.e. the ones with different cells having different modalities measured. No redundant empty measurements are stored due to the distinct feature sets per assay as well as distinct cell sets mapped to a global set of observations.





□ MC-eNN: A multi-modal coarse grained model of DNA flexibility mappable to the atomistic level

>> https://academic.oup.com/nar/article/48/5/e29/5709710

an evolution of the helical CG model which assumes a novel multi-normal model which accounts for the non-Gaussian nature of some inter base pair deformations and considers a flexible extended nearest neighbor model.

a new Hamiltonian inspired by empirical valence bond theory, where they assume that the distribution of inter base pair parameters (shift, slide, rise, tilt, roll, twist) underlies a Boltzmann-averaged combination of Gaussian distributions.

The bi-dimensional inter base pair parameter distributions of MD and MC-eNN simulations are indistinguishable even when correlated in a highly non-linear manner which is impossible to capture by a standard harmonic model.





□ SOPHIE: Generative neural networks separate common and specific transcriptional responses

>> https://www.biorxiv.org/content/10.1101/2021.05.24.445440v1.full.pdf

SOPHIE, “Specific cOntext Pattern Highlighting In Expression data” produces a background set of transcriptomic experiments from which a gene and pathway-specific null distribution can be generated.

SOPHIE’s measure of specificity can complement log fold change activity generated from traditional differential expression analyses by, for example, filtering the set of changed genes to identify those that are specifically relevant to the experimental condition of interest.

SOPHIE uses this VAE approach to simulate realistic-looking transcriptome experiments that serve as a background set for analyzing common versus specific transcriptional signals.





□ Accel-Align: a fast sequence mapper and aligner based on the seed–embed–extend method

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04162-z

Using the SEE-approach to sequence alignment, Accel-Align can align 280,000 100bp reads per second on a commodity quad-core CPU, and is up to 9× faster than BWA-MEM, 12× faster than Bowtie2, and 3× faster than Minimap2.

Accel-Align calculates the Hamming distance between each embedded reference and the read, and selects the best candidates with the lowest Hamming distance for extension. Accel-Align processes each read by first extracting seeds to find candidate locations similar to SFE aligners.





□ SKSV: ultrafast structural variation detection from circular consensus sequencing reads

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab341/6272511

SKSV constructs a direct acyclic graph with all the extend matches and implement sparse dynamic programming to find an optimal path in the graph to build an alignment skeleton. SKSV greedily extracts potential SV signatures by identifying non-co-linear alignment segments.

SKSV collects the maximal exact matches between unitigs in the reference de Bruijn graph and the read. And uses Landau-Vishkin algorithm to extend U-MEMs along the reference genome with a user-defined maximal edit distance.




□ GSpace: an exact coalescence simulator of recombining genomes under isolation by distance

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab261/6272571

Simulation-based inference can bypass the limitations of statistical methods based on analytical approximations, but software allowing simulation of structured population genetic data without the classical n-coalescent approximations are scarce or slow.

GSpace, a simulator for genomic data, based on a generation-by-generation coalescence algorithm taking into account small population size, recombination, and isolation by distance.





□ SC3s - efficient scaling of single cell consensus clustering to millions of cells

>> https://www.biorxiv.org/content/10.1101/2021.05.20.445027v1.full.pdf

SC3s - Single Cell Consensus Clustering with Speed, where several steps of the original workflow have been optimized to ensure that both run time and memory usage scale linearly with the number of cells.

SC3s uses a streaming approach for the k-means clustering which makes it possible to only process a small subset of cells in each iteration. as part of an intermediary step, which was not part of the original method, a large number of microclusters are calculated.





□ SHINE: Structure Learning for Hierarchical Regulatory Networks

>> https://www.biorxiv.org/content/10.1101/2021.05.27.446022v1.full.pdf

SHINE - Structure Learning for Hierarchical Networks - a framework for defining data-driven structural constraints and incorporating a shared learning paradigm for efficiently learning multiple networks from high-dimensional data.

SHINE uses used the Random Walk with Restart algorithm, and improves performance when relatively few samples are available and multiple networks are desired, by reducing the complexity of the graphical search space and by taking advantage of shared structural information.




□ AQUARIUM: accurate quantification of circular isoforms using model-based strategy

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab435/6296829

AQUARIUM (Accurate QUAntification of circulaR Isoforms Using Model-based strategy) accepts output of circRNA identification tools (CIRI, CIRI-full) or a BED-format file to specify the circular RNA transcripts. Then, it transforms all circular transcripts to pseudo-linear transcripts. Finally, it estimates the expression of both linear and circular transcripts using salmon framework.





□ Deep cross-omics cycle attention model for joint analysis of single-cell multi-omics data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab403/6283577

Deep cross-omics cycle attention (DCCA) model, a computational tool for joint analysis of single-cell multi-omics data, by combining variational autoencoders (VAEs) and attention-transfer.

the DCCA model learned a coordinated but separate representation for each omics data, by mutually supervising each other based on semantic similarity between embeddings, and then reconstructed back to the original dimension as output through a decoder for each omics data.





□ Identifying strengths and weaknesses of methods for computational network inference from single cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2021.06.01.446671v1.full.pdf

While no method is a universal winner and most methods have a modest recovery of experimentally derived interactions based on global metrics such as AUPR, methods are able to capture targets of regulators that are relevant to the system under study.

LEAP and SILGGM to form a cluster on Shalek compared to another group comprising SCRIBE, PIDC, SCENIC, Pearson, MERLIN and Inferelator. PIDC, SCENIC, MERLIN and Pearson correlation were most stable in their performance across datasets based on F-score and AUPR.




□ IIMLP: integrated information-entropy-based method for LncRNA prediction

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03884-w

Characteristics features are extracted from the nucleic acid sequence itself, and the topological entropy and generalized topological entropy are regarded as new information theoretical features.

The features use constitute a 35-dimensional vector, which includes: 1 sequence length feature, 4 ORF, 4 Shannon entropy, 3 topological entropy, 3 generalized topology Entropy, 17 mutual information and 3 Kullback–Leibler divergence.




□ BoardION: real-time monitoring of Oxford Nanopore sequencing instruments

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04161-0

BoardION offers the possibility for sequencing platforms to remotely and simultaneously monitor all their ONT devices (MinION, Mk1C, GridION and PromethION).

BoardION’s dynamic and interactive interface allows users to explore sequencing metrics easily and to optimize in real time the quantity and the quality of the generated data by the ONT basecaller.






□ SCRaPL: hierarchical Bayesian modelling of associations in single cell multi-omics data https://www.biorxiv.org/content/10.1101/2021.05.13.443959v1.full.pdf

SCRaPL (Single Cell Regulatory Pattern Learning), a Bayesian hierarchical model to infer associations between different omics components. SCRaPL identifies a series of statistical associations between epigenomic and transcriptomic layers by addressing noise.

SCRaPL combines a latent multivariate Gaussian structure with noise models that are tailored to single cell sequencing data. Inference is implemented using a mixture of Hamiltonian Monte Carlo and Gibbs Sampler.





□ Demuxalot: scaled up genetic demultiplexing for single-cell sequencing

>> https://www.biorxiv.org/content/10.1101/2021.05.22.443646v1.full.pdf

Demuxalot, a novel and highly performant tradeoff between methods that rely on reference genotypes and methods that learn variants from the data, by selecting a small number of highly informative variants that maximize the marginal information with respect to reference SNVs.

Demuxalot’s conjugate Bayesian model smoothly integrates genotype information from reference SNVs and dataset-specific detected putative SNVs, as well as from historical experiments in a multi-batch setting.




□ DeLUCS: Deep Learning for Unsupervised Classification of DNA Sequences

>> https://www.biorxiv.org/content/10.1101/2021.05.13.444008v1.full.pdf

Deep Learning method for the Unsupervised Classification of DNA Sequences (DeLUCS), is a fully-automated method that determines cluster label assignments for its input sequences independent of any homology or same-length assumptions, and oblivious to sequence taxonomic labels.

DeLUCS uses Chaos Game Representations (CGRs) of primary
DNA sequences, and generates “mimic” sequence CGRs to self-learn data patterns (genomic signatures) through the optimization of multiple neural networks.





□ COSLIR: Direct Reconstruction of Gene Regulatory Networks underlying Cellular state Transitions without Pseudo-time Inference

>> https://www.biorxiv.org/content/10.1101/2021.05.12.443928v1.full.pdf

COSLIR (COvariance restricted Sparse LInear Regression) for directly reconstructing the gene regulatory networks (GRN) that drives the cell-state transition.

COSIR uses the alternative direction method of multipliers algorithm (ADMM) to solve this optimization problem, and apply the bootstrapping and clip thresholding for selecting significant gene-gene interactions to improve the precision and stability of the estimator.


□ MetaWorks: Profile hidden Markov model sequence analysis can help remove putative pseudogenes from DNA barcoding and metabarcoding datasets

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04180-x

it is possible to screen out apparent pseudogenes using ORF length filtering alone or combined with HMM profile analysis for greater sensitivity when pseudogene sequences contain frameshift mutations.

MetaWorks, a multi-marker metabarcode snakemake pipeline that processes paired-end Illumina reads that provides a pseudogene filtering step for protein coding markers.





□ Tejaas: reverse regression increases power for detecting trans-eQTLs

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02361-8

In forward regression (FR), they perform univariate regression of the expression level of each gene individually on the candidate SNP’s genotype (= centered minor allele frequency) and estimate whether the distribution of resulting association p values is enriched near zero.

In reverse regression, Tejaas performs L2-regularized multiple regression of the candidate SNP’s genotype jointly on all gene expression levels. Crucially, reverse regression is not negatively affected by correlations between gene expression levels.





□ IDEMAX: Inferring the experimental design for accurate gene regulatory network inference

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab367/6274652

IDEMAX (Infer DEsign MAtriX) infers the effective perturbation design from gene expression data in order to eliminate the potential risk of fitting a disconnected perturbation design to gene expression.

IDEMAX is able to identify the perturbation matrix P. P is a sparse matrix of the same size as the input expression data with n non-zero values in each row, where n is the requested number of replicates for each gene.




□ IDEAS: Individual Level Differential Expression Analysis for Single Cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2021.05.10.443350v1.full.pdf

The input data for IDEAS include gene expression data (a matrix of scRNA-seq fragment counts per gene and per cell), the variable of interest (e.g., case-control status), together with two sets of covariates.




□ Distinguishing chaotic from stochastic dynamics via the complexity of ordinal patterns

>> https://aip.scitation.org/doi/10.1063/5.0045731

The complexity measure based approaches cannot work well for short time series or discrete chaotic systems. Zunino declaimed that the presence of equalities may introduce spurious temporal correlations and thus can potentially lead to a false judgment on dynamic nature.

a new fuzzy entropy, Fuzzy Permutation Entropy (FPE), which can be used to detect determinism in time series. FPE immunes from repeated equal values in signals to some extent, especially for chaotic series.




□ LYRUS: A Machine Learning Model for Predicting the Pathogenicity of Missense Variants

>> https://www.biorxiv.org/content/10.1101/2021.05.10.443497v1.full.pdf

LYRUS, a machine learning method that uses an XGBoost classifier selected by TPOT to predict the pathogenicity of SAVs. LYRUS incorporates five sequence-based features, six structure-based features, and four dynamics-based features.





□ Hierarchical confounder discovery in the experiment–machine learning cycle

>> https://www.biorxiv.org/content/10.1101/2021.05.11.443616v1.full.pdf

a simple non-parametric statistical method called the Rank-to-Group (RTG) score that can identify hierarchical confounder effects in raw data and ML-derived data embeddings.

RTG scores correctly assign the effects of hierarchical confounders in cases where linear methods such as regression fail. RTG scores discovers cross-modal correlated variability in a complex multi-phenotypic biological dataset.





□ Sparse Allele Vectors and the Savvy Software Suite

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab378/6275747

The sparse allele vectors (SAV) file format is an efficient storage format for large-scale DNA variation data and is designed for high throughput association analysis by leveraging techniques for fast deserialization of data into computer memory.




□ SSBER: removing batch effect for single-cell RNA sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04165-w

SSBER normalizes each cell using natural logarithmic transformation method with a factor of 10,000. Next, it uses z-score transformation to standardize the expression value of each gene.

SSBER considers the partial shared cell types predicted by a cell annotation algorithm and detects mutual neighbor cell pairs among the shared cell types, which improves the accuracy of anchors. SSBER calculates correction vector for each cell with Gaussian kernel weights.





□ scHPL: Hierarchical progressive learning of cell identities in single-cell data

>> https://www.nature.com/articles/s41467-021-23196-8

scHPL, a hierarchical progressive learning method which allows continuous learning from single-cell data by leveraging the different resolutions of annotations across multiple datasets to learn and continuously update a classification tree.

scHPL adopts two alternatives to classify cells: a linear and a one-class SVM. scHPL can potentially be used to map these relations, irrespective of the assigned labels, and improve the Cell Ontology database.




□ ExTraMapper: Exon- and Transcript-level mappings for orthologous gene pairs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab393/6278896

ExTraMapper leverages sequence conservation between exons of a pair of organisms and identifies a fine-scale orthology mapping at the exon and then transcript level.

ExTraMapper identifies a larger number of exon and transcript mappings compared to previous methods. Further, it identifies exon fusions, splits, and losses due to splice site mutations, and finds mappings between microexons that are previously missed.





□ sigGCN: Single-Cell Classification Using Graph Convolutional Networks

>> https://www.biorxiv.org/content/10.1101/2021.06.13.448259v1.full.pdf

sigGCN, a multimodal end-to-end deep learning model for cell classification that combines a graph convolutional network (GCN) and a neural network to exploit gene interaction networks.

sigGCN employs a GCN paralleled with an NN model. Since sigGCN outputs the probability of cell class assignments, and also provides an additional function to predict a cell class as “unassigned” by setting a threshold of prediction.




Descend.

2021-06-17 06:06:12 | Science News

(The Genocide Memorial in Yerevan, Armenia, by architects Artur Tarkhanyan and Sashur Kalashyan.)





□ Dynamo: Mapping Vector Field of Single Cells

>> https://dynamo-release.readthedocs.io/en/latest/Differential_geometry.html

Dynamo goes beyond discrete RNA velocity vectors to continous RNA vector field functions. With differential geometry analysis of the continous vector field fuctions, Dynamo calculates the RNA Jacobian, which is a cell by gene by gene tensor, encoding the gene regulatory network.

Dynamo builds a cell-wise transition matrix by translating the velocity vector direction and the spatial relationship of each cell to transition probabilities. Dynamo uses a few different kernels to build a transition matrix which can then be used to run Markov chain simulations.





□ scAEspy: Analysis of single-cell RNA sequencing data based on autoencoders

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04150-3

scAEspy can be used to deal with the existing batch-effects among samples. Indeed, the application of batch-effect removal tools into the latent space allowed us to outperform state-of-the-art methods as well as the same batch-effect removal tools applied on the PCA space.

GMMMD and GMMMDVAE, two novel Gaussian-mixture AEs that combine MMDAE and MMDVAE with GMVAE to exploit more than one Gaussian distribution.

scAEspy is used to reduce the HVG space (k dimensions), and the obtained latent space can be used to calculate a t-SNE space. The corrected latent space by Harmony is then used to build a neighbourhood graph, which is clustered by using the Leiden algorithm.





□ Model guided trait-specific co-expression network estimation as a new perspective for identifying molecular interactions and pathways

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008960

a mathematically justified bridge between parametric approaches & co-expression networks in light of identifying molecular interactions underlying complex traits. a methodological fusion to cross-exploit all scheme-specific strengths via a built-in information-sharing mechanism.

A novel dependency metric is provided to account for certain collinearities in data that are considered problematic w/ the parametric methods. The underlying parametric model is used again to provide a parametric interpretation for the estimated co-expression network elements.





□ Recovering Spatially-Varying Cell-Specific Gene Co-expression Networks for Single-Cell Spatial Expression Data

>> https://www.frontiersin.org/articles/10.3389/fgene.2021.656637/full

a simple and computationally efficient two-step algorithm to recover spatially-varying cell-specific gene co-expression networks for single-cell spatial expression data.

The algorithm first estimates the gene expression covariance matrix for each cell type and then leverages the spatial locations of cells to construct cell-specific networks.

The second step uses expression covariance matrices estimated in step one and label information from neighboring cells as an empirical prior to obtain thresholded Bayesian posterior estimates.





□ scSNV: accurate dscRNA-seq SNV co-expression analysis using duplicate tag collapsing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02364-5

Identifying single nucleotide variants has become common practice for dscRNA-seq; however, a pipeline does not exist to maximize variant calling accuracy. Molecular duplicates generated in these experiments have not been utilized to optimally detect variant co-expression.

scSNV is designed from the ground up to “collapse” molecular duplicates and accurately identify variants and their co-expression. scSNV has fewer false-positive SNV calls than Cell Ranger and STARsolo when using pseudo-bulk samples.





□ Capturing dynamic relevance in Boolean networks using graph theoretical measures

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab277/6275260

the selection captures two types of compounds based on static properties. First, the detectable highly connected dynamic influencing drivers. Second, a new set of dynamic drivers, which called gatekeepers - nodes with high dynamic relevance but no high connectivity.

The existence of paths from gatekeeper nodes to hubs having a higher maximal mutual information than other classes further demonstrates that this principle extends to longer paths, that is there exist channels of information flow which are more stable carriers of signals.





□ DSBS: A new approach to decode DNA methylome and genomic variants simultaneously from double strand bisulfite sequencing

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab201/6289882

DSBS analyzer is a pipeline to analyzing Double Strand Bisulfite Sequencing data, which could simultaneously identify SNVs and evaluate DNA methylation levels in a single base resolution.

In DSBS, bisulfite-converted Watson strand and reverse complement of bisulfite-converted Crick strand derived from the same double-strand DNA fragment were sequenced in read 1 and read 2, and aligned to the same position on reference genome.





□ A semi-supervised deep learning approach for predicting the functional effects of genomic non-coding variations

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-03999-8

The semi-supervised deep learning model coupled with pseudo labeling has advantages in studying with limited datasets, which is not unusual in biology. This study provides an effective approach in finding non-coding mutations potentially associated with various biological phenomena.

This model included three fully-connected (FC) layers, which are also known as dense layers. The input to the first FC layer is generated by concatenating the output of the max pooling function with the additional feature map of the epigenetic and nucleotide composition features.





□ Deciphering biological evolution exploiting the topology of Protein Locality Graph

>> https://www.biorxiv.org/content/10.1101/2021.06.03.446976v1.full.pdf

The lossless graph compression from PLG to a power graph called Protein Cluster Interaction Network (PCIN) results in a 90% size reduction and aids in improving computational time.

the topology of PCIN and capability of deriving the correct species tree by focusing on the cross-talk between the protein modules. Traces of evolution are not only present at the level of the PPI, but are also very much present at the level of the inter-module interactions.




□ SSG-LUGIA: Single Sequence based Genome Level Unsupervised Genomic Island Prediction Algorithm

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbab116/6290171

SSG-LUGIA, a completely automated and unsupervised approach for identifying GIs and horizontally transferred genes.

SSG-LUGIA leverages the atypical compositional biases of the alien genes to localize GIs in prokaryotic genomes. The anomalous segments thus identified are further refined following a post-processing step, and finally, the proximal segments are merged to produce the list of GIs.





□ TreeVAE: Reconstructing unobserved cellular states from paired single-cell lineage tracing and transcriptomics data

>> https://www.biorxiv.org/content/10.1101/2021.05.28.446021v1.full.pdf

TreeVAE uses a variational autoencoder (VAE) to model the observed transcriptomic data while accounting for the phylogenetic relationships between cells.

TreeVAE couples a complex non-linear observation model with a more simple correlation model in latent space (any marginal distribution for the GRW is tractable). TreeVAE could be improved by exploiting just-in-time compilation(e.g. JAX), to speed-up the message passing algorithm.




□ FDDH: Fast Discriminative Discrete Hashing for Large-Scale Cross-Modal Retrieval

>> https://ieeexplore.ieee.org/document/9429177/

Formulating the learning of similarity-preserving hash codes in terms of orthogonally rotating the semantic data, so as to minimize the quantization loss of mapping data to hamming space and propose a fast discriminative discrete hashing for large-scale cross-modal retrieval.

FDDH introduces an orthogonal basis to regress the targeted hash codes of training examples to their corresponding semantic labels and utilizes the ϵ-dragging technique to provide provable large semantic margins.

FDDH theoretically approximates the bi-Lipschitz continuity. An orthogonal transformation scheme is further proposed to map the nonlinear embedding data into the semantic subspace. The discriminative power of semantic information can be explicitly captured and maximized.





□ BAVARIA: Simultaneous dimensionality reduction and integration for single-cell ATAC-seq data using deep learning

>> https://www.biorxiv.org/content/10.1101/2021.05.11.443540v1.full.pdf

Several methods have been introduced for dimensionality reduction using scATAC- seq data, including latent Dirichlet allocation (cisTopic), latent Semantic indexing (LSI), SnapATAC and SCALE.

BAVARIA, a batch-adversarial variational auto- encoder (VAE) that facilitates dimensionality reduction and integration for scATAC-seq data, which facilitates simultaneous dimensionality reduction and batch correction via an adversarial learning strategy.





□ ontoFAST: An R package for interactive and semi-automatic annotation of characters with biological ontologies

>> https://www.biorxiv.org/content/10.1101/2021.05.11.443562v1.full.pdf

The commonly used Entity-Quality (EQ) syntax provides rich semantics and high granularity for annotating phenotypes and characters using ontologies. However, EQ syntax might be time inefficient if this granularity is unnecessary for downstream analysis.

ontoFAST that aids production of fast annotations of characters and character matrices with biological ontologies. OntoFAST enhances data interoperability between various applications and support further integration of ontological and phylogenetic methods.




□ Unsupervised weights selection for optimal transport based dataset integration

>> https://www.biorxiv.org/content/10.1101/2021.05.12.443561v1.full.pdf

Horizontal integration describes the problem of merging two or more datasets expressed in a common feature space, each of those containing samples gathered across distinct sources or experiments.

Vertical dataset integration re- duces to horizontal dataset integration in this latent space. The extra layer of difficulty in this approach comes from con- structing a relevant latent space via mappings that preserve enough information.

a variant of the optimal transport (OT)- and Gromov-Wasserstein (GW)- based dataset integration algorithm introduced in SCOT.

Formulating a constrained quadratic program to adjust sample weights before OT or GW so that weighted point density is close to be uniform over the point cloud, for a given kernel.





□ Novel feature selection via kernel tensor decomposition for improved multi-omics data analysis

>> https://www.biorxiv.org/content/10.1101/2021.05.21.445049v1.full.pdf

Kernel tensor decomposition (KTD)-based unsupervised feature extraction (FE) was extended to integrate multi-omics datasets measured over common samples in a weight-free manner.




□ A graphical, interactive and GPU-enabled workflow to process long-read sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.05.11.443665v1.full.pdf

An Extended Biodepot-workflow-builder (Bwb) to provide a modular and easy-to-use graphical interface that allows users to create, customize, execute, and monitor bioinformatics workflows.

And observed a 34x speedup and a 109x reduction in costs for the rate-limiting basecalling step in the cell line data. The graphical interface and greatly simplified deployment facilitate the adoption of GPUs for rapid, cost-effective analysis of long-read sequencing.




□ bathometer: lightning fast depth-of-reads query

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab372/6275265

Bathometer aims for an index that is compact and can be used without having to be read into memory completely. An index stores for each strand of each reference sequence the list of starting positions and the list of end positions of all reads.





□ Crinet: A computational tool to infer genome-wide competing endogenous RNA (ceRNA) interactions

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0251399

Crinet (CeRna Interaction NETwork) considers all mRNAs, lncRNAs, and pseudogenes as potential ceRNAs and incorporates a network deconvolution method to exclude the spurious ceRNA pairs.


Crinet incorporates miRNA-target interactions with binding scores, gene-centric copy number aberration (CNA), and expression datasets. If binding scores are not available, the same score for all interactions could be used.




□ FAME: A framework for prospective, adaptive meta-analysis (FAME) of aggregate data from randomised trials

>> https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1003629

FAME can reduce the potential for bias, and produce more timely, thorough and reliable systematic reviews of aggregate data.

The FAME estimates of absolute information size and power, and the associated decision on meta-analysis timing should be included. FAME is suited to situations where quick and robust answers are needed, but prospective IPD meta-analysis would be too protracted.




□ POEMColoc: Estimating colocalization probability from limited summary statistics

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04170-z

POEMColoc (POint EstiMation of Colocalization) imputes missing summary statistics for one or both traits using LD structure in a reference panel, and performs colocalization using the imputed summary statistics.

POEMColoc does not discard information when full summary statistics are available for one but not both of the traits and does not assume that both traits have a causal variant in the region.



□ Swarm: A federated cloud framework for large-scale variant analysis

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008977

With Swarm, large genomic datasets hosted on different cloud platforms or on-premise systems can be jointly analyzed with reduced data motion. Swarm can in principle facilitate federated learning by transferring models across clouds.

Swarm can help transfer intermediate results of the machine learning models across the cloud, so that the model can continue to learn and improve using the new data in the second cloud. For instance, gradients of deep learning models can be transferred by Swarm.




□ Adversarial generation of gene expression data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab282/6278292

This model preserves several gene expression properties significantly better than widely used simulators such as SynTReN or GeneNetWeaver.

it exhibits real gene clusters and ontologies both at local and global scales, suggesting that the model learns to approximate the gene expression manifold in a biologically meaningful way.



□ XGraphBoost: Extracting Graph Neural Network-Based Features for a Better Prediction of Molecular Properties

>> https://pubs.acs.org/doi/10.1021/acs.jcim.0c01489

XGBOOST is an algorithm combining GNN and XGBOOST, which can introduce the machine learning algorithm XGBOOST under the existing GNN network architecture to improve the algorithm capability.The GNN used in this paper includes DMPNN, GGNN and GCN.

the integrated framework XGraphBoost extracts the features using a GNN and build an accurate prediction model of molecular properties using the classifier XGBoost. The XGraphBoost framework fully inherits the merits of the GNN-based automatic molecular feature extraction.




□ Caution against examining the role of reverse causality in Mendelian Randomization

>> https://onlinelibrary.wiley.com/doi/10.1002/gepi.22385

the MR Steiger approach may fail to correctly identify the direction of causality. This is true, especially in the presence of pleiotropy.

reverseDirection which runs simulations for user-specified scenarios to examine when the MR Steiger approach can correctly determine the causal direction between two phenotypes in any user specified scenario.




□ Robust Inference for Mediated Effects in Partially Linear Models

>> https://link.springer.com/article/10.1007/s11336-021-09768-z

G-estimators for the direct and indirect effects and demonstrate consistent asymptotic normality for indirect effects when models for the conditional means of M or X/Y are correctly specified, and for direct effects, when models for the conditional means of Y, or X/M are correct.

the GMM-based tests perform better in terms of power and small sample performance compared with traditional tests in the partially linear setting, with drastic improvement under model misspecification.




□ ARAMIS: From systematic errors of NGS long reads to accurate assemblies

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbab170/6278148

Within the hybrid methodologies, there are two main approaches: alignment of short reads to long reads using a variety of aligners to achieve maximum accuracy (e.g., HECIL); or to perform firstly an assembly with short reads and then to align against it the long reads to correct them (e.g., HALC).

Accurate long-Reads Assembly correction Method for Indel errorS (ARAMIS), the first NGS long-reads indels correction pipeline that combines several correction software in just one step using accurate short reads.




□ Superscan: Supervised Single-Cell Annotation

>> https://www.biorxiv.org/content/10.1101/2021.05.20.445014v1.full.pdf

Superscan (Supervised Single-Cell Annotation): a supervised classification approach built around a simple XGBoost model trained on manually labelled data.

Superscan aims to reach high overall performance across a range of datasets by including a large collection of training data. This is in contrast to a method like CaSTLE, which also employs an XGBoost model but requires specification of a sufficiently similar pre-labeled dataset.





□ Kmerator Suite: design of specific k-mer signatures and automatic metadata discovery in large RNA-Seq datasets.

>> https://www.biorxiv.org/content/10.1101/2021.05.20.444982v1.full.pdf

The core tool, Kmerator, produces specific k-mers for 97% of human genes, enabling the measure of gene expression with high accuracy in simulated datasets.

KmerExploR, a direct application of Kmerator, uses a set of predictor genes specific k-mers to infer metadata including library protocol, sample features or contaminations from RNA-seq datasets.




□ ccdf: Distribution-free complex hypothesis testing for single-cell RNA-seq differential expression analysis

>> https://www.biorxiv.org/content/10.1101/2021.05.21.445165v1.full.pdf

ccdf tests the association of each gene expression with one or many variables of interest (that can be either continuous or discrete), while potentially adjusting for additional covariates.

To test such complex hypotheses, ccdf uses a conditional independence test relying on the conditional cumulative distribution function, estimated through multiple regressions.





□ EM-MUL: An effective method to resolve ambiguous bisulfite-treated reads

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04204-6

EM-MUL not only rescues multireads overlapped with unique reads, but also uses the overall coverage and accurate base-level alignment to resolve multireads that cannot be handled by current methods.

The EM-MUL method can align partial BS-reads to the repeated regions, which is beneficial to the further analysis of the repeated regions.





□ Vulcan: Improved long-read mapping and structural variant calling via dual-mode alignment

>> https://www.biorxiv.org/content/10.1101/2021.05.29.446291v1.full.pdf

Vulcan leverages the computed normalized edit distance of the mapped reads via e.g. minimap2 to identify poorly aligned reads and realigns them using the more accurate yet computationally more expensive long read mapper.

Vulcan runs up to 4X faster than NGMLR alone and produces lower edit distance alignments than minimap2, on both simulated and real datasets. Vulcan could be used for any combination of long-read mappers that output the edit distance (NM tag) directly within sam/bam file output.




□ findere: fast and precise approximate membership query

>> https://www.biorxiv.org/content/10.1101/2021.05.31.446182v1.full.pdf

findere is a simple strategy for speeding up queries and for reducing false positive calls from any Approximate Membership Query data structure (AMQ). With no drawbacks, queries are two times faster with two orders of magnitudes less false positive calls.

The findere implementation proposed here uses a Bloom filter as AMQ. It proposes a way to index and query Kmers from biological sequences (fastq or fasta, gzipped or not, possibly considering only canonical Kmers) or from any textual data.





□ LazyB: fast and cheap genome assembly

>> https://almob.biomedcentral.com/articles/10.1186/s13015-021-00186-5

LazyB starts from a bipartite overlap graph between long reads and restrictively filtered short-read unitigs. This graph is translated into a long-read overlap graph G.

Instead of the more conventional approach of removing tips, bubbles, and other local features, LazyB stepwisely extracts subgraphs whose global properties approach a disjoint union of paths.





□ DIMA: Data-Driven Selection of an Imputation Algorithm

>> https://pubs.acs.org/doi/10.1021/acs.jproteome.1c00119

DIMA can take a numeric matrix or the file path to a MaxQuant ProteinGroups file as an input. The data is reduced to the columns which include pattern in their sample names.

DIMA reliably suggests a high-performing imputation algorithm, which is always among the three best algorithms and results in a root mean square error difference (ΔRMSE) ≤ 10% in 80% of the cases.





□ scRegulocity: Detection of local RNA velocity patterns in embeddings of single cell RNA-Seq data

>> https://www.biorxiv.org/content/10.1101/2021.06.01.446674v1.full.pdf

scRegulocity focuses on velocity switching patterns, local patterns where velocity of nearby cells change abruptly. These different transcriptional dynamics patterns can be indicative of transitioning cell states.

scRegulocity annotates these patterns with genes and enriched pathways and also analyzes and visualizes the velocity switching patterns at the regulatory network level. scRegulocity also combines velocity estimation, pattern detection and visualization steps.




□ Optimizing Network Propagation for Multi-Omics Data Integration

>> https://www.biorxiv.org/content/10.1101/2021.06.10.447856v1.full.pdf

Random Walk with Restart (RWR) and Heat Diffusion has revealed specific characteristics of the algorithms. Optimal parameters could also be obtained by either maximizing the agreement between different omics layers or by maximizing the consistency between biological replicates.




□ The reciprocal Bayesian LASSO

>> https://onlinelibrary.wiley.com/doi/10.1002/sim.9098

BayesRecipe includes a set of computationally efficient MCMC algorithms for solving the Bayesian reciprocal LASSO in linear models. It also includes a modified S5 algorithm to solve the reduced reciprocal LASSO problem in linear regression.

a fully Bayesian formulation of the rLASSO problem, which is based on the observation that the rLASSO estimate for linear regression parameters can be interpreted as a Bayesian posterior mode estimate when the regression parameters are assigned independent inverse Laplace priors.




Apparition.

2021-06-17 06:06:06 | Science News




□ LANTERN: Interpretable modeling of genotype-phenotype landscapes with state-of-the-art predictive power

>> https://www.biorxiv.org/content/10.1101/2021.06.11.448129v1.full.pdf

LANTERN learns interpretable models of GPLs by finding a latent, low-dimensional space where mutational effects combine additively. LANTERN then captures the non-linear effects of epistasis through a multi-dimensional, non-parametric Gaussian-process model.





□ OptICA: Optimal dimensionality selection for independent component analysis of transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2021.05.26.445885v1.full.pdf

OptICA, a novel method for effectively finding the optimal dimensionality that consistently maximizes the number of biologically relevant components revealed while minimizing the potential for over- decomposition.

Validating OptICA against known transcriptional regulatory networks and found that it outperformed previously published algorithms for identifying the optimal dimensionality. OptICA is organism-invariant.





□ Theory of local k-mer selection with applications to long-read alignment

>> https://www.biorxiv.org/content/10.1101/2021.05.22.445262v1.full.pdf

This turns out to be tractable enough for us to prove closed-form expressions for a variety of methods, including (open and closed) syncmers, (a, b, n)-words, and an upper bound for minimizers.

Colinear sets of k-mer matches are collected into chains, and then dynamic programming based alignment is performed to fill gaps between chains. This modification was to swap out the k-mer selection method, originally random minimizers, to an open syncmer.




□ GENIES: A new method to study genome mutations using the information entropy

>> https://www.biorxiv.org/content/10.1101/2021.05.27.445958v1.full.pdf

GENIES (GENetic Entropy Information Spectrum) is a fully functional code, that has an easy to use graphical interface and allows maximum versatility in choosing the computational parameters such as SS, WS and m-block size.





□ Super-cells untangle large and complex single-cell transcriptome networks

>> https://www.biorxiv.org/content/10.1101/2021.06.07.447430v1.full.pdf

a network-based coarse-graining framework where highly similar cells are merged into super-cells. super-cells not only preserve but often improve the results of downstream analyses including clustering, DE, cell type annotation, gene correlation, RNA velocity and data integration.

a super-cell gene expression matrix is computed by averaging gene expression within super-cells. Using walktrap algorithm, it enables users to explore different graining levels without having to recompute the super-cells for each choice of 𝛾.




Heng Li

>> https://github.com/lh3/minimap2/releases/tag/v2.19

Minimap2 v2.19 released with better and more contiguous alignment over long INDELs and in highly repetitive regions, improvements backported from unimap. These represent the most significant algorithmic change since v2.1. Use with caution.





Adam Phillipy RT

>> https://www.biorxiv.org/content/10.1101/2021.05.26.445798v1.full.pdf
>> https://www.biorxiv.org/content/10.1101/2021.05.26.445678v1.full.pdf
>> http://github.com/marbl/CHM13

"Segmental duplications and their variation in a complete human genome" led by @mrvollger identifies double the number of previously known near-identical SD alignments, revealing massive evolutionary differences in SD organization between humans and apes.




□ Vcflib and tools for processing the VCF variant call format

>> https://www.biorxiv.org/content/10.1101/2021.05.21.445151v1.full.pdf

The vcflib toolkit contains both a library and collection of executable programs for transforming VCF files consisting of over 30,000 lines of
source code written in the C++. vcflib also comes with a toolkit for population genetics: the Genotype Phenotype Association Toolkit (GPAT).





□ Tracking cell lineages to improve research reproducibility go.nature.com/3oDxZ2k

>> https://www.nature.com/articles/s41587-021-00928-1

Sophie Zaaijer

Cell lineage tracking is important, and is actually pretty easy given the right tools.

Academics please check out our (FREE!) tool called "FIND Cell": you can digitize, organize, and verify your cell line info.


https://twitter.com/sophie_zaaijer/status/1395083592368336901?s=21




□ HCMB: A stable and efficient algorithm for processing the normalization of highly sparse Hi-C contact data

>> https://www.sciencedirect.com/science/article/pii/S2001037021001768

Hi-C Matrix Balancing (HCMB) is architected on an iterative solution of equations combining with a linear search and projection strategy to normalize the Hi-C original interaction data.

HCMB can be seen as a variant of the Levenberg-Marquardt-type method, of which one salient characteristic is that the coefficient matrix of linear equations will be dense during the iterative process. HCMB algorithm a more robust practical behavior on highly sparse matrices.




□ G2S3: A gene graph-based imputation method for single-cell RNA sequencing data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009029

G2S3 imputes dropouts by borrowing information from adjacent genes in a sparse gene graph learned from gene expression profiles across cells.

G2S3 has superior overall performance in recovering gene expression, identifying cell subtypes, reconstructing cell trajectories, identifying differentially expressed genes, and recovering gene regulatory and correlation relationships.

G2S3 optimizes the gene graph structure using graph signal processing that captures nonlinear correlations among genes.

The computational complexity of the G2S3 algorithm is a polynomial of the total number of genes in the graph, so it is computationally efficient, especially for large scRNA-seq datasets with hundreds of thousands of cells.




□ MultiTrans: an algorithm for path extraction through mixed integer linear programming for transcriptome assembly

>> https://ieeexplore.ieee.org/document/9440797/

the transcriptome assembly problem as path extraction on splicing graphs (or assembly graphs), and propose a novel algorithm MultiTrans for path extraction using mixed integer linear programming.

MultiTrans is able to take into consideration coverage constraints on vertices and edges, the number of paths and the paired-end information simultaneously. MultiTrans generates more accurate transcripts compared to TransLiG and rnaSPAdes.





□ Automated Generation of Novel Fragments Using Screening Data, a Dual SMILES Autoencoder, Transfer Learning and Syntax Correction

>> https://pubs.acs.org/doi/10.1021/acs.jcim.0c01226

The dual model produced valid SMILES with improved features, considering a range of properties including aromatic ring counts, heavy atom count, synthetic accessibility, and a new fragment complexity score we term Feature Complexity.





□ SRC: Accelerating RepeatClassifier Based on Spark and Greedy Algorithm with Dynamic Upper Boundary

>> https://www.biorxiv.org/content/10.1101/2021.06.03.446998v1.full.pdf

Spark-based RepeatClassifier (SRC) which uses Greedy Algorithm with Dynamic Upper Boundary (GDUB) for data division and load balancing, and Spark to improve the parallelism of RepeatClassifier.

SRC can not only ensure the same level of accuracy as that of RepeatClassifier, but also achieve 42-88 times of acceleration compared to RepeatClassifier. At the same time, a modular interface is provided to facilitate the subsequent upgrade and optimization.




□ BaySiCle: A Bayesian Inference joint kNN method for imputation of single-cell RNA-sequencing data making use of local effect

>> https://www.biorxiv.org/content/10.1101/2021.05.24.445309v1.full.pdf

BaySiCle allows robust imputation of missing values generating realistic transcript distributions that match single molecule fluorescence in situ hybridization measurements.

By using priors as obtained by the dataset structures in the not just the experimental set-up batch, but also the same group of cells, BaySiCle improves accuracy of imputation to be that much closer to its similar alternatives.




□ nf-LO: A scalable, containerised workflow for genome-to-genome lift over

>> https://www.biorxiv.org/content/10.1101/2021.05.25.445595v1.full.pdf

nf-LO (nextflow-LiftOver), a containerised and scalable Nextflow pipeline that enables liftovers within and between any species for which assemblies are available. nf-LO is a workflow to facilitate the generation of genome alignment chain files compatible with the LiftOver utility.

Nf-LO can directly pull genomes from public repositories, supports parallelised alignment using a range of alignment tools and can be finely tuned to achieve the desired sensitivity, speed of process and repeatability of analyses.




□ Pseudo-supervised Deep Subspace Clustering

>> https://ieeexplore.ieee.org/document/9440402/

Self-reconstruction loss of an AE ignores rich useful relation information and might lead to indiscriminative representation, which inevitably degrades the clustering performance. It is also challenging to learn high-level similarity without feeding semantic labels.

Using pairwise similarity to weigh the reconstruction loss to capture local structure information, while a similarity is learned by the self-expression layer.

Pseudo-graphs and pseudo-labels, which allow benefiting from uncertain knowledge acquired during network training, are further employed to supervise similarity learning. Joint learning and iterative training facilitate to obtain an overall optimal solution.





□ Samplot: a platform for structural variant visual validation and automated filtering

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02380-5

Samplot provides a quick platform for rapidly identifying false positives and enhancing the analysis of true-positive SV calls. Samplot images are a concise SV visualization that highlights the most relevant evidence in the variable region and hides less informative reads.

Samplot-ML is a resnet-like model that takes Samplot images of putative deletion SVs as input and predicts a genotype. This model will remove false positives from the output set of an SV caller or genotyper.





□ RMAPPER: Fast and efficient Rmap assembly using the Bi-labelled de Bruijn graph

>> https://almob.biomedcentral.com/articles/10.1186/s13015-021-00182-9

There the term bi-label refers to two k-mers separated by a specified genomic distance. The redefinition of the de Bruijn graph with this extra information was shown to de-tangle the resulting graph, making traversal more efficient and accurate.

An equivalent paradigm can be effective for Rmap assembly. MAPPER was more than 130 times faster and used less than five times less memory than Solve, and was more than 2,000 times faster than Valouev et al.

RMAPPER successfully assembled the 3.1 million Rmaps of the climbing perch genome into contigs that covered over 95% of the draft genome with zero mis-assemblies.





□ diffBUM-HMM: a robust statistical modeling approach for detecting RNA flexibility changes in high-throughput structure probing data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02379-y

diffBUM-HMM is widely compatible, accounting for sampling variation and sequence coverage biases, and displays higher sensitivity than existing methods while robust against false positives.

diffBUM-HMM detects more differentially reactive nucleotides (DRNs) in the Xist lncRNA that are preferentially single-stranded A’s and U’s. diffBUM-HMM outperforms deltaSHAPE and dStruct in both sensitivity and/or specificity.





□ contrastive-sc: Contrastive self-supervised clustering of scRNA-seq data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04210-8

contrastive-sc maintains good performance when only a fraction of input cells is provided and is robust to changes in hyperparameters or network architecture.

contrastive-sc computes by default a cell partitioning with KMeans or Leiden. This phenomenon can be explained by the documented tendency KMeans has to identify equal-sized, combined with the significant class imbalance associated with the datasets having more than 8 clusters.




□ baredSC: Bayesian Approach to Retrieve Expression Distribution of Single-Cell

>> https://www.biorxiv.org/content/10.1101/2021.05.26.445740v1.full.pdf

baredSC, a Bayesian approach to disentangle the intrinsic variability in gene expressions from the sampling noise. Bared SC approximates the expression distribution of a gene by a Gaussian mixture model.

They also use real biological data sets to illustrate the power of baredSC to assess the correlation between genes or to reveal the multi-modality of a lowly expressed gene. baredSC reveals the trimodal distribution.





□ GenomicSuperSignature: interpretation of RNA-seq experiments through robust, efficient comparison to public databases

>> https://www.biorxiv.org/content/10.1101/2021.05.26.445900v1.full.pdf

GenomicSuperSignature matches PCA axes in a new dataset to an annotated index of replicable axes of variation (RAV) that are represented in previously published independent datasets.

GenomicSuperSignature also can be used as a tool for transfer learning, utilizing RAVs as well-defined and replicable latent variables defined by multiple previous studies in place of de novo latent variables.





Nature Genetics

>> https://www.nature.com/articles/s41576-021-00367-3

Long-read sequencing at the population scale presents specific challenges but is becoming increasingly accessible. The authors discuss the major platforms and analytical tools, considerations in project design and challenges in scaling long-read sequencing to populations.




□ Dysgu: efficient structural variant calling using short or long reads

>> https://www.biorxiv.org/content/10.1101/2021.05.28.446147v1.full.pdf

Dysgu detects signals from alignment gaps, discordant and supplementary mappings, and generates consensus contigs, before classifying events using machine learning.

Dysgu employs a fast consensus sequence algorithm, inspired by the positional de Brujin graph, followed by remapping of anomalous sequences to discover additional small SVs.




□ GeneGrouper: Density-based binning of gene clusters to infer function or evolutionary history

>> https://www.biorxiv.org/content/10.1101/2021.05.27.446007v1.full.pdf

GeneGrouper identified a novel, frequently occurring pduN pseudogene. When replicated in vivo, disruption of pduN with a frameshift mutation negatively impacted microcompartment formation.

Sequences are clustered using mmseqs2 linclust to generate a set of proximate orthology relationships, producing a set of representative amino acid sequences in FASTA format. The E-values from the filtered hits table is used as an input for Markov Graph Clustering with MCL.




□ A phylogenetic approach for weighting genetic sequences

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04183-8

Formalising the principle by rigorously defining the evolutionary ‘novelty’ of a sequence within an alignment. This results in new sequence weights that called ‘phylogenetic novelty scores’.

This phylogenetic novelty scores can be useful when an evolutionarily meaningful system for adjusting for uneven taxon sampling is desired. They have numerous possible applications, including estimation of evolutionary conservation scores and sequence logos.





□ PRESCIENT: Generative modeling of single-cell time series with PRESCIENT enables prediction of cell trajectories with interventions

>> https://www.nature.com/articles/s41467-021-23518-w

PRESCIENT (Potential eneRgy undErlying Single Cell gradIENTs) builds upon a diffusion-based model by enabling the model to operate on large numbers of cells over many timepoints with high-dimensional features, and by incorporating cellular growth estimates.

PRESCIENT’s ability to generate held-out timepoints and to predict cell fate bias, i.e. the probability a cell enters a particular fate given its initial state. PRESCIENT’s objective can be modified to maximize the likelihood of observing individual trajectories given lineage tracing data.





□ MetaVelvet-DL: a MetaVelvet deep learning extension for de novo metagenome assembly

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03737-6

MetaVelvet-DL builds an end-to-end architecture using Convolutional Neural Network and Long Short-Term Memory units. MetaVelvet-DL can more accurately predict how to partition a de Bruijn graph than the Support Vector Machine-based model in MetaVelvet-SL.




□ CaFew: Boosting scRNA-seq data clustering by cluster-aware feature weighting

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04033-7

By resolving the optimization problem of clustering, a weight matrix indicating the importance of features in different clusters is derived. CaFew filters out genes with small weight in all clusters or a small weight variation across all clusters.

With CaFew, the clustering performance of distance-based methods like k-means and SC3 can be considerably improved, but its effectiveness is not so obvious on the other types of methods like Seurat.




□ MiMiC: a bioinformatic approach for generation of synthetic communities from metagenomes

>> https://pubmed.ncbi.nlm.nih.gov/34081399/

MiMiC, a computational approach for data-driven design of simplified communities from shotgun metagenomes.

MiMiC predicts the composition of minimal consortia using an iterative scoring system based on maximal match-to-mismatch ratios between this database and the Pfam binary vector of any input metagenome.




□ TIGA: Target illumination GWAS analytics

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab427/6292081

Rational ranking, filtering and interpretation of inferred gene–trait associations and data aggregation across studies by leveraging existing curation and harmonization efforts.

TIGA, a method for assessing confidence in gene–trait associations from evidence aggregated across studies, including a bibliometric assessment of scientific consensus based on the iCite Relative Citation Ratio, and meanRank scores, to aggregate multivariate evidence.





□ Overcoming uncollapsed haplotypes in long-read assemblies of non-model organisms

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04118-3

The haploidy score is based on the identification of two peaks in the per-base coverage depth distribution: a high-coverage peak that corresponds to bases in collapsed haplotypes, and a peak at about half-coverage of the latter that corresponds to bases in uncollapsed haplotypes.

The haploidy score represents the fraction of collapsed bases in the assembly, and is equal to C/(C+U/2), i.e. the ratio of the area of the collapsed peak (C) divided by the sum of the area of the collapsed peak (C) and half of the area of the uncollapsed peak (U/2).

This metric reaches its maximum of 1.0 when there is no uncollapsed peak, in a perfectly collapsed assembly, whereas it returns 0.0 when the assembly is not collapsed at all.





□ BUTTERFLY: addressing the pooled amplification paradox with unique molecular identifiers in single-cell RNA-seq

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02386-z

the naïve removal of duplicates can lead to a bias due to a “pooled amplification paradox,” BUTTERFLY utilizes estimation of unseen species for addressing the bias caused by incomplete sampling of differentially amplified molecules.

BUTTERFLY uses a zero truncated negative binomial estimator implemented in the kallisto bustools workflow.

BUTTERFLY correction can be used to scale the gene expression of each gene to resemble the gene expression that more reads would yield, they do not necessarily imply that the corrected expression values are closer to ground truth.





□ NanoSpring: reference-free lossless compression of nanopore sequencing reads using an approximate assembly approach

>> https://www.biorxiv.org/content/10.1101/2021.06.09.447198v1.full.pdf

NanoSpring uses an approximate assembly approach partly inspired by existing assembly algorithms but adapted for significantly better performance, especially for the recent higher quality datasets. NanoSpring achieves close to 3x improvement in compression as compared to ENANO.

NanoSpring uses MinHash to index the reads and find overlapping reads during contig generation. NanoSpring uses the minimap2 aligner to align candidate reads to the consensus sequence and add them to the graph during contig generation.





□ EPIC: Inferring relevant tissues and cell types for complex traits in genome-wide association studies

>> https://www.biorxiv.org/content/10.1101/2021.06.09.447805v1.full.pdf

EPIC (cEll tyPe enrIChment), a statistical framework that relates large-scale GWAS summary statistics to cell-type-specific omics measurements from single-cell sequencing.

EPIC is the first method that prioritizes tissues and/or cell types for both common and rare variants with a rigorous statistical framework to account for both within- and between-gene correlations.





□ ASURAT: Functional annotation-driven unsupervised clustering of single-cell transcriptomes

>> https://www.biorxiv.org/content/10.1101/2021.06.09.447731v1.full.pdf

ASURAT simultaneously performs unsupervised cell clustering and biological interpretation in semi-automatic manner, in terms of cell type and various biological functions.

ASURAT creates a functional spectrum matrix, termed a sign-by-sample matrix (SSM). By analyzing SSMs, users can cluster samples to aid their interpretation.





□ eQTLsingle: Discovering single-cell eQTLs from scRNA-seq data only

>> https://www.biorxiv.org/content/10.1101/2021.06.10.447906v1.full.pdf

eQTLsingle discovers eQTLs only with scRNA-seq data, without genomic data. It detects mutations from scRNA-seq data and models gene expression of different genotypes with the ZINB model to find associations between genotypes and phenotypes at single-cell level.





□ EIR: Deep integrative models for large-scale human genomics

>> https://www.biorxiv.org/content/10.1101/2021.06.11.447883v1.full.pdf

EIR, a deep learning framework for PRS prediction which includes a model, genome-local-net (GLN), is specifically designed for large scale genomics data. The framework supports multi-task (MT) learning, automatic integration of clinical and biochemical data and model explainability.




□ Puffaligner : A Fast, Efficient, and Accurate Aligner Based on the Pufferfish Index

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab408/6297388

PuffAligner begins read alignment by collecting unique maximal exact matches, querying k-mers from the read in the Pufferfish index.

The aligner then chains together the collected uni-MEMs using a dynamic programming approach, choosing the chains with the highest coverage as potential alignment positions for the reads.




The Year Earth Changed — Official Trailer | Apple TV+

2021-06-13 06:07:13 | 映画


□ “The Year Earth Changed” (Apple TV+)
『その年、地球が変わった』

>> https://www.imdb.com/title/tt14372240/

Directed by Tom Beard
Narrator: David Attenborough

A fresh new approach to the global lockdown and the uplifting stories that have come out of it. People all over the world have had the chance to engage with nature like never before.

世界的パンデミックと地球規模のロックダウンを経て、自然環境や動物のエコシステムにどのような影響が及んだかを検証するドキュメンタリー。一面的な視点ではあるが、人類の経済活動と環境不可を測る実証例であり、貴重な指標には為り得そうだ。








“The Man from Laramie”

2021-06-12 10:37:28 | 映画


□ The Man from Laramie『ララミーから来た男』 (1955)

>> https://www.imdb.com/title/tt0048342/

Directed by
Anthony Mann

Cast: James Stewart


西部劇サスペンス。個々の動機が受動的かつ偏向性に満ち、主人公も復讐という動機に偏執している。主人公は赦しを与え、舞台は内在的な瑕疵によって瓦解し必然的に『報復』を為す。西部の広大な景観はまるで摂理を具現化したように、どこか冷ややかに茫漠と佇んでいる。








Apple Music Hi-Res/LossLess/Dolby Atmos

2021-06-09 06:09:13 | デジタル・インターネット


Apple MusicのHi-Res Lossless / Dolby Atmos配信が開始❗️(っ’ヮ’c)ウゥッヒョオアアァアアアァ とは言っても、現状Apple Musicからワイヤレスでハイレゾ出力する術がない為(UPnPはストリーミング非対応)、HomePodでアトモスを楽しむのが関の山😭✨




普段ハイレゾを聴く時は、Macbook AirからUPnPでMu-So QBにロスレス音源を飛ばしているのだけど、Apple MusicのHi-Res対応に期待したいのは、今後AirPlayでLoselessオーディオを直接出力できるようになりますように😌🔊✨





iPad Pro - 5th generation.

2021-06-01 06:06:06 | デジタル・インターネット


□ iPad Pro 2021 (12.9inch 256GB Silver Wi-Fi - Cellular)

>> https://www.apple.com/ipad-pro/

Liquid Retina XDRディスプレイがとにかく美麗。メディアインターフェイスとモビリティの極限の最適解。それぞれのハードのフォルムと想定される使用環境に適応していれば、OS統合は必要ない。Apple製品環境の互換性は既に完成しつつある。




□ Apple TV 4K (2021 64GB)

>> https://www.apple.com/apple-tv-4k/

コンテンツビューワーとして極めて優秀。大きな性能強化は無かったものの、設定の簡略化(iPhone連携)やリモコン(siriボタンが使いやすい!)といったUI動作・アクセシビリティが格段に向上して超快適。Home PodでDolby Atmos再生も嬉しい。ハイレゾ対応が待たれる…😌🔊