lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Celestial.

2022-09-17 23:13:39 | Science News




□ SpaCeNet: Spatial Cellular Networks from omics data

>> https://www.biorxiv.org/content/10.1101/2022.09.01.506219v1

SpaCeNet analyzes patterns of correlation in spatial transcriptomics data by extending the concept of conditional independence to spatially distributed information, facilitating reconstruction of both the intracellular / intercellular interaction networks.

SpaCeNet is built on Gaussian Graphical Models (GGMs). SpaCeNet infers a joint density function describing spatially distributed, potentially high-dimensional molecular features. It uses a proximal gradient descent with Nesterov acceleration.





□ Ultima sequencing: Mostly natural sequencing-by-synthesis for scRNA-seq

>> https://www.nature.com/articles/s41587-022-01452-6

Mostly natural sequencing-by-synthesis (mnSBS) is a new sequencing chemistry that relies on a low fraction of labeled nucleotides, combining the efficiency of non-terminating chemistry w/ the throughput and scalability of optical endpoint scanning within an open fluidics system.

The results from mnSBS-based scRNA-seq are very similar to those using Illumina, with minor differences in results related to the position of reads relative to annotated gene boundaries, owing to single-end reads of Ultima being closer to gene ends than reads from Illumina.





□ Sequence-based Optimized Chaos Game Representation and Deep Learning for Peptide/Protein Classification

>> https://www.biorxiv.org/content/10.1101/2022.09.10.507145v1

A novel energy function and enhanced the encoder quality by constructing a Supervised Autoencoders (SAE) neural network. The numerical Chaos Game Representation (CGR) and the SAE encoded representation and found that they are equivalent in the latent space.

The encoder φ can be used to encode the original sequences into new sets of points in the latent space. It can be used to measure the distance b/n different sequences through calculating the Jensen-Shannon Divergence, and compute the corresponding LCGR of the whole system.





□ Genome assembly with variable order de Bruijn graphs

>> https://www.biorxiv.org/content/10.1101/2022.09.06.506758v1

The definition of voDBG resembles a generalized suffix trie. Both the nodes of the generalized suffix trie and the nodes of the voDBG correspond to all substrings occurring in the read set.

Thus the nodes of voDBG correspond one-to-one to the generalized suffix trie nodes, extension edges correspond one-to-one to the trie edges and contraction edges correspond one-to-one to the suffix links.

For the node centric definition of a DBG, the DBG edges of voDBG correspond to transitive edges composed of a contraction edge followed by an extension edge. The DBG edges of voDBG correspond to transitive edges composed of an extension edge followed by a contraction edge.





□ Pyro-Velocity: Probabilistic RNA Velocity inference from single-cell data

>> https://www.biorxiv.org/content/10.1101/2022.09.12.507691v1

Pyro-Velocity, a multivariate RNA Velocity model to estimate the cell future states. Pyro-Velocity models raw sequencing counts w/ the synchronized cell time across all expressed genes to provide quantifiable and improved information on cell fate choices and trajectory dynamics.

Pyro-Velocity recasts the velocity estimation problem into a latent variable posterior probability inference. The method is generative / fully Bayesian, w/ the different parameters considered as latent random variables. Central to the Pyro-Velocity model is a shared latent time.





□ scHiMe: Predicting single-cell DNA methylation levels based on single-cell Hi-C data

>> https://www.biorxiv.org/content/10.1101/2022.09.13.507815v1

scHiMe is a computational tool for predicting the base-pair-specific methylation levels in the promoter regions genome-wide based on the single-cell Hi-C data and DNA nucleotide sequences using the graph transformer algorithm.

The true base-pair-specific DNA methylation values or target values for the 1000 base pairs in the target promoter were generated based on meta-cell. Node / Edge features were generated and input into the graph transformer network, which contained 5 blocks of graph transformer.





□ MeHi-SCC: A Meta-learning based Graph-Hierarchical Clustering Method for Single Cell RNA-Seq Data https://www.biorxiv.org/content/10.1101/2022.09.06.506784v1

MeHi-SCC features a whole-graph-tuning based hierarchical clustering section. LANDER, the separator, only learns how inter-cellular relationship helps cluster step by step toward ground truth, ignoring specific expression values.

Different from GNN with fixed adjacent matrix, LANDER updates both edge-connections and related node features. MeHi-SCC enables sub-cell-type detection.

Hierarchical LANDER divides cell graphs into sub-cell graphs and aggregate them into more detailed clusters for all cells until they cannot be divided into sub-graphs any more, and the cluster number is usually more than ground truth given by manual annotations from morphology.





□ Ingres: from single-cell RNA-seq data to single-cell probabilistic Boolean networks

>> https://www.biorxiv.org/content/10.1101/2022.09.04.506528v1

Ingres provides another solution to this problem by representing different levels of activation/expression while still working with Boolean functions. Ingres uses VIPER algorithm to infer protein activity starting from a gene expression matrix and a list of regulons.

Ingres facilitates fitting models with cell-specific expression information without the need of inferring a new network for each cell or cluster.

Ingres runs the metaVIPER algorithm. Ingres provides several wrapper functions for relevant parts of BoolNet, which can be used to perform analyses on any PBN produced by Ingres, such as computing its attractors.





□ HexSE: Simulating evolution in overlapping reading frames

>> https://www.biorxiv.org/content/10.1101/2022.09.09.453067v1

HexSE is a Python module designed to simulate sequence evolution along a phylogeny while considering the coding context the nucleotides. The ultimate porpuse of HexSE is to account for multiple selection preasures on Overlapping Reading Frames.

HexSE uses the Gillespie algorithm to simulate mutations along branches of the phylogenetic tree in order to create a nucleotide alignment. Traversing the event probability tree from the root to a tip resolves the shared characteristics for a subset of substitution events.





□ DeepZ: Graph Neural Networks for Z-DNA prediction in Genomes

>> https://www.biorxiv.org/content/10.1101/2022.08.23.504929v1

There is potential for improvement of GNN architecture by incorporating long-range interactions b/n DNA nodes into the graph representation, by using different weighing schemes that capture the correlation b/n features of adjacent nodes and the use of L1 metrics.

DeepZ approach with GNN deep learning model instead of RNN. GraphZ is based on three major types of graph neural network modes – two types of Graph Convolutional Networks, two types of Graph Attention Networks and inductive representation learning network GraphSAGE.





□ Scelestial: Fast and accurate single-cell lineage tree inference based on a Steiner tree approximation algorithm

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009100

Scelestial, a method for lineage tree reconstruction from single-cell data. In this representation the phylogeny inference problem could be considered as a geometric Steiner tree problem, in which weight of edges are calculated as the Euclidean distances between the points.

Scelestial’s input is a set of genome sequences given as a matrix of point mutations, which may contain missing values. Scelestial iteratively improves the inferred tree by considering all subsets of samples of a size up to a constant parameter and all the potential phylogenies.





□ Sequence to graph alignment using gap-sensitive co-linear chaining

>> https://www.biorxiv.org/content/10.1101/2022.08.29.505691v1

A novel co-linear chaining problem formulations for sequence-to-DAG alignment that penalize gaps. It is designed gap cost functions such that they enable us to adapt the sparse dynamic programming framework, and solve the chaining problem optimally in O(KN log KN) time.

This algorithm for Problems 1a-1c uses a brute-force approach that evaluates all O(N2) pairs of anchors, and uses Dijkstra’s algorithm with a Fibonacci heap for shortest-path calculations. Problems 1a, 1b and 1c can be solved optimally in O(N2(|V|log|V|+|E|)) time.





□ CANTATA - prediction of missing links in Boolean networks using genetic programming

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac623/6696209

The CANTATA algorithm optimizes network models towards a certain behaviour based on a multi-objective genetic programming approach. CANTATA allows for perturbed network conditions with knocked-out or overexpressed compounds.

CANTATA is elaborated to guide an evolutionary transformation process, yielding network models that resemble the initial model drafts closely while matching the observed dynamic behaviour. The algorithm ensures minimal interventions by relying on symbolic representation.





□ SCING: Single Cell INtegrative Gene regulatory network inference elucidates robust, interpretable gene regulatory networks

>> https://www.biorxiv.org/content/10.1101/2022.09.07.506959v1

SCING, a gradient boosting and mutual information based approach for identifying robust GRNs from scRNAseq, snRNAseq, and spatial transcriptomics data.

SCING GRNs reveal unique disease subnetwork modeling capabilities, have intrinsic capacity to correct for batch effects, retrieve disease relevant genes and pathways.

SCING uses a random walk framework to determine the increase in performance of a GRN to model disease subnetworks versus a random GRN with similar node attributes. And it utilizes the leiden graph partitioning algorithm to identify GRN subnetworks.





□ nasw: Dynamic programming for aa-to-nt alignment with affine gap, splicing and frameshift

>> https://github.com/lh3/nasw

The DP involves 6 states and 20 transitions, similar to the GeneWise model. Different from GeneWise, nasw explicitly implements the DP recursion with SSE2 or NEON intrinsics and is tens of times faster.

nasw supports global alignment and left or right extension. In the extension mode, only extension ends and alignment score are computed. Users need to call the function again to get CIGAR.





□ miniprot: a new mapper for aligning proteins to genomes with splicing and frameshift.

>> https://github.com/lh3/miniprot

Miniprot aligns a protein sequence against a genome with affine gap penalty, splicing and frameshift. It is primarily intended for annotating protein-coding genes in a new species using known genes from other species.

Miniprot is not optimized for mapping distant homologs because distant homologs are less informative to gene annotations. Miniprot outputs alignment in the protein PAF format. miniprot uses more CIGAR operators to encode introns and frameshifts.





□ Gnocis: An integrated system for interactive and reproducible analysis and modelling of cis-regulatory elements in Python 3

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0274338

Deep-MOCCA has a layer of longer convolutions, and in order to model dinucleotides, a layer of 2bp convolutions. These two convolutional layers are concatenated. The 5-spectrum SVM achieves the highest sensitivity to independent PREs, but also the lowest precision.

Gnocis is a system for the interactive and reproducible analysis and modelling of CRE DNA sequences. Gnocis employs Cython and a variety of techniques in order to optimally implement the glue necessary in order to apply machine learning for CRE analysis and prediction.





□ AEON.py: Python Library for Attractor Analysis in Asynchronous Boolean Networks

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac624/6697883

AEON.py combines a known symbolic detection algorithm (adapted to better handle partially specified BNs) with a more advanced reduction method guided by the fire-ability of transitions in the Boolean network.

AEON.py allows solving attractor detection and source-target control problems on large, non-trivial networks. Furthermore, these problems can be addressed even in networks with logical parameters or partially unknown dynamics.





□ GPN: DNA language models are powerful zero-shot predictors of non-coding variant effects

>> https://www.biorxiv.org/content/10.1101/2022.08.22.504706v1

GPN (Genomic Pre-trained Network) learns variant effects in non-coding DNA using unsupervised pre-training on genomic DNA sequence alone. GPN is also able to learn gene structure and DNA motifs without any supervision.

GPN outperforms the DeepSEA model trained on functional genomics data. GPN’s internal representation of DNA sequences is able to accurately distinguish genomic regions such as introns, untranslated regions and coding sequences.





□ SCsnvcna: Integrating SNVs and CNAs on a phylogenetic tree from single-cell DNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.08.26.505465v1

SCARLET requires that the SNVs and CNAs are detected from the same sets of cells, which is technically challenging due to the sequencing errors or the low sequencing coverage associated with a particular WGA procedure.

SCsnvcna is a Bayesian probabilistic model that utilizes both the genotype constraints on the tree and the cellular prevalence to search the solution that has the highest joint probability. SCsnvcna aims at placing SNVs on a CNA tree whereas the sets of cells rendering independent.





□ IndepthPathway: an integrated tool for in-depth pathway enrichment analysis based on bulk and single cell sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.08.28.505179v1

WCSEA algorithm took a broader approach for assessing the functional relations of pathway gene sets to differentially expressed genes, and leverage the cumulative signature of molecular concepts characteristic of the highly differentially expressed genes.

“IndepthPathway” for deep pathway enrichment analysis from bulk and single cell sequencing data that took a broader approach for assessing gene set relations and leverage the universal concept signature of the target gene list to tolerate the high noise and low gene coverage.





□ LncDLSM: Identification of Long Non-coding RNAs with Deep Learning-based Sequence Model

>> https://www.biorxiv.org/content/10.1101/2022.09.02.506180v1

The whole lncDLSM consists of two parts, the first part is based on hierarchical input neural networks, called HINN-based analyzer, which is designed to extract the advanced features of the k-mer frequency features.

Another part is a CNN-based detector, which is designed to extract the advanced features of the spectrum features. Then it merges these high-level features using another neural network-based prediction module to identify lncRNAs finally.





□ SCENIC+: single-cell multiomic inference of enhancers and gene regulatory networks

>> https://www.biorxiv.org/content/10.1101/2022.08.19.504505v1.full.pdf

SCENIC+ predicts genomic enhancers along w/ candidate upstream TF and links these enhancers to candidate target genes. Specific TFs for each cell type or cell state are predicted based on the concordance of TF binding site accessibility, TF expression, and target gene expression.

SCENIC+ combines the gene expression values, the denoised region accessibility, and the cistromes to predict TF-region-gene triplets. Region-to-gene and TF-to-gene relationships are inferred using Pearson correlation and Gradient Boosting Machines.





□ Differential kinetic analysis using nucleotide recoding RNA-seq and bakR

>> https://www.biorxiv.org/content/10.1101/2022.09.02.505697v1

bakR (Bayesian analysis of the kinetics of RNA) relies on Bayesian hierarchical modeling of nucleotide recoding RNA-seq (NR-seq) data to increase statistical power by sharing information across transcripts.

bakR includes three distinct computational implementations of the Bayesian hierarchical mixture model (MLE / Hybrid / MCMC). Partial pooling across fraction new and variance estimates in a given replicate is performed to make use of the high-throughput nature of NR-seq datasets.





□ SiGra: Single-cell spatial elucidation through image-augmented graph transformer

>> https://www.biorxiv.org/content/10.1101/2022.08.18.504464v1.full.pdf

SiGra deciphers spatial domains and enhance spatial signals simultaneously. SiGra is one of the first method to utilize multi-modalities including multi-channel images of cell morphology and functions to address technology limitations and achieve augmented spatial profiles.

In SiGra, the multi-modal information from images and original transcriptomics are summarized at single-cell level, with the information from neighboring cells selectively captured by the attention mechanism.





□ BWA-MEM2-LISA: https://github.com/bwa-mem2/bwa-mem2/tree/bwa-mem2-lisa

bwa-mem2-lisa is an accelerated version of bwa-mem2. Accelerating the seeding phase of bwa-mem2 using: 1. LISA (Learned-Indexes for Sequence Analysis) and 2. binary interval tree.

BWA-MEM2-LISA accelerated seeding kernels achieve up to 4.5x speedup compared to the seeding phase. The ert branch of bwa-mem2 repository contains codebase of Enuerated Radix Tree based acceleration.





□ ntHash2: recursive spaced seed hashing for nucleotide sequences

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac564/6674501

ntHash2 is up to 2.1x faster at hashing various spaced seeds than the previous version and 3.8x faster than conventional hashing algorithms with naïve adaptation.

ntHash2 performs reverse-complement hashing w/o requiring extra iterations by swapping the corresponding indices in the blocks. Reducing the collision rate for longer k-mer lengths and improved the uniformity of the hash distribution by modifying the canonical hashing mechanism.





□ Paella: Decomposing spatial heterogeneity of cell trajectories

>> https://www.biorxiv.org/content/10.1101/2022.09.05.506682v1

Paella requires as input the spatial locations of cells or spatial spots and the cell trajectory information. Paella then identifies a parsimonious set of spatially continuous sub-trajectories where each sub-trajectory represents a unidirectional process of cell progression.

Paella constructs an undirected Delaunay network. Paella converts the undirected network into two directed networks by comparing the pseudotime values of the two nodes connected by an edge, and identifies with three modes all node sets where nodes in each set are reachable.





□ SEMgsa: topology-based pathway enrichment analysis with structural equation models

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04884-8

SEMgsa() represent a topological based and self-contained hypotesis method, in line with NetGSA, DEGraph and topologyGSA. SEMgsa() accepts as input directed and/or undirected networks that define pathway interconnectedness.





□ SCIΦN: Single-cell mutation calling and phylogenetic tree reconstruction with loss and recurrence

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac577/6674502

SCIPhIN considers the full read and variant counts for each cell at each genomic position to better distinguish mutations from sequencing and amplification noise. SCIPhIN allows for mutation loss and parallele mutations, relaxing the infinite sites assumption.





□ New algorithms for accurate and efficient de-novo genome assembly from long DNA sequencing reads

>> https://www.biorxiv.org/content/10.1101/2022.08.30.505891v1

A new hashing scheme for minimizers to efficiently identify overlaps and build OLC graphs. The implemented algorithm to build an overlap graph and a layout.

The graph construction is similar to that of the Best Overlap Graph, having two vertices for each read representing the start (5’-end) and the end (3’-end) of the read.

Edge features are combined based on their likelihood, replacing edge filtering by edge prioritization. This approach eliminates the need of hard filtering decisions and makes the algorithm adaptable to genomic regions with different repeat structures.





□ KMer-Node2Vec: Learning Vector Representations of K-mers from the K-meGraph

>> https://www.biorxiv.org/content/10.1101/2022.08.30.505832v1

KMer-Node2Vec, a graph-based DNA embedding algorithm, which converts the large DNA corpus into a k-mer co-occurrence graph, then takes the k-mer sequence samples from this graph by randomly traveling and finally trains the k-mer embedding on this sampling corpus.

KMer-Node2Vec uses an effective sampling strategy to generate the k-mer sequences, and the Skip-Gram algorithm is used to calculate the k-mer embedding on k-mer sequences. The KMer-Node2Vec’s time complex is O(|N | + nl + nllog(|V |)) and space complexity is O(m|V |+nl+d|V |).





□ bootRanges: Flexible generation of null sets of genomic ranges for hypothesis testing

>> https://www.biorxiv.org/content/10.1101/2022.09.02.506382v1

bootRanges software, with efficient vectorized code for performing block boot-strap sampling of genomic ranges. bootRanges is part of a modular analysis workflow, where bootstrapped ranges can be analyzed at block or genome scale using tidy analysis with plyranges.

bootRanges offers a simple “unsegmented” block bootstrap as well as a “segmented” block bootstrap: since the distribution of ranges in the genome exhibits multi-scale structure, It follows the logic of Bickel et al. and performs block bootstrapping within segments of the genome.





□ A fast and efficient path elimination algorithm for large-scale multiple common longest sequence problems

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04906-5

A mini Directed Acyclic Graph (mini-DAG) model and a novel Path Elimination Algorithm are proposed to address large-scale MLCS issues efficiently. mini-DAG employs the branch and bound approach to reduce paths during DAG construction, resulting in a very mini DAG.

Before obtaining the final MLCS, if we can judge that the currently calculated match point is not the point that constitutes the MLCS, then the path through this point will not be the longest; these are called the non-point and non-optimal paths.





□ Cuttlefish 2: Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02743-6

CUTTLEFISH 2 can seamlessly extract such maximal path covers by simply constraining the algorithm to operate on some specific subgraph(s) of the original graph. The edges ((k+1)-mers) are enumerated from the input, and optionally filtered based on the user-defined threshold.








Spherical.

2022-09-17 23:13:37 | Science News


We shall not cease from exploration
And the end of all our exploring
Will be to arrive where we started
And know the place for the first time
– T.S. Eliot



□ What puzzle are you in?

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02748-1

What you mistake for a complex jigsaw puzzle, where all you need to do is put the pieces in front of you into the right arrangement, may in fact be a puzzle you can only solve by identifying a connection to a different field.

We subsequently discover obstacles that force us to follow unforeseen connections to other phenomena (class III), to dive into deeper logical or mathematical problems (Class II), or to identify wrong assumptions that we had initially not questioned (Class IV).

We needed to reformulate the puzzle from a Class III to a Class IV puzzle to gain a deeper insight into the nature of the relationship b/n gene duplication and alternative splicing. The second example is a project that uses deep learning to predict the substrate scope of enzymes.






□ scWMC: Weighted Matrix Completion-based Imputation of scRNA-seq Data via Prior Subspace Information

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac570/6671838

scWMC, a regularization for leveraging that imperfect prior information to estimate the true underlying prior subspace and then embed it in a typical low-rank matrix completion-based framework.

scWMC adopts the Frobenius norm of the difference between the true gene expression matrix and the imputed gene expression matrix only to the zero-values yielded by the different computational models as the imputation error.





□ LatentVelo: Inferring single-cell dynamics with structured dynamical representations of RNA velocity

>> https://www.biorxiv.org/content/10.1101/2022.08.22.504858v1

LatentVelo embeds cells into a latent space with a variational auto-encoder, and describes differentiation dynamics on this latent space with neural ordinary differential equations.

LatentVelo’s main application is describing complex developmental dynamics in a low-dimensional latent space. Lineage-dependent dynamics are enabled by modelling state-dependent regulation of transcription. LatentVelo also enables constructing general dynamical models.





□ Re-genotyping structural variants through an accurate force-calling method

>> https://www.biorxiv.org/content/10.1101/2022.08.29.505534v1

cuteSV2, a long-read-based re-genotyping approach that is able to force-calling genotypes. cuteSV2 is an upgraded version of cuteSV and applies a strategy of the refinement and purification of the heuristic extracted signatures through spatial and allele similarity estimation.

cuteSV2 applies a strategy for fragile signatures affected by the erroneous read-alignment and generates agglomerated signatures. It computes the distribution of reads around each re-genotyped SV breakpoint. cuteSV2 records all alignment reads that cover the SV on the chromosome.





□ Multiple genome alignment in the telomere-to-telomere assembly era

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02735-6

Given a set of anchors represented as a graph, the next step is to identify locally colinear blocks (LCBs), i.e.regions which share a common ordering of anchors. While the initial set of anchors are sufficient to construct LCBs, they may contain artifacts of micro-rearrangements.

SibeliaZ constructs LCBs by iteratively extracting “carrier paths”. These carrier paths are constructed by starting from a random edge in the graph and iteratively following the heaviest unvisited edge, where the weight of an edge is the number of genomes that it represents.

The Cactus aligner seeks to construct another cactus graph from the set of adjacencies within a net. Cactus uses the Base-level Alignment Refinement algorithm (BAR). BAR uses a modification of the Pecan aligner to align adjacencies within a net that share an endpoint.





□ TBLDA: Telescoping bimodal latent Dirichlet allocation to identify expression QTLs across tissues

>> https://www.life-science-alliance.org/content/5/12/e202101297

A natural question that arises for all parametric latent factor models is how to determine the number of topics. There is no “correct” topic number and the user will want to make a reasonable trade-off b/n computational speed for inference and the granularity of signal captured.

A telescoping bimodal latent Dirichlet allocation (TBLDA) framework learns shared topics across gene expression and genotype data that allows multiple RNA sequencing samples to correspond to a single individual’s genotype.





□ Clover: tree structure-based efficient DNA clustering for DNA-based data storage

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac336/6668252

Clover is an efficient DNA sequence clustering algorithm, which applies to a large number of disordered DNA sequences generated after DNA sequencing in the DNA storage field.

Clover avoids the computation of the Levenshtein distance by using a tree structure for interval-specific retrieval. Clover can cluster 10 million DNA sequences into 50 000 classes in 10 seconds.





□ Statistical evidence for the presence of trajectory in single-cell data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04875-9

They employ clustering to partition the data into homogeneous partitions, which are ideal for capturing trajectory-like structures. The statistics promote trajectory patterns, and non-randomness is between linear pattern and star trees, when there is maximum branching.

Intuitively, different numbers of partitions on the same data may capture distinct types of structures. However, when the trajectory is perfectly linear, different numbers of partitions capture the same underlying trajectory structure.





□ mOTUpan: a robust Bayesian approach to leverage metagenome-assembled genomes for core-genome estimation

>> https://academic.oup.com/nargab/article/4/3/lqac060/6667502

As it is looking for patterns of synteny to determine the persistent fraction of the genomes, too much fragmentation could cause problems in calculations of the persistent fraction.

The core-genome prediction is computationally efficient and can be scaled up to thousands of genomes.

mOTUpan, a novel iterative Bayesian estimator of the observed presence/absence patterns of discrete genome-encoded traits (any trait that can be encoded in a genome, e.g. gene cluster, COG, functional annotations, etc.) in sets of incomplete MAGs/SAGs and complete genomes.





□ Fec: a fast error correction method based on two-rounds overlapping and caching

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac565/6670778

Fec is a error correction tool based on two-rounds overlapping and caching. The first round overlapping will find a number of overlaps quickly. Fec uses a large window size to quickly find enough overlaps to correct most of the reads.

Based on the overlaps, some reads can be corrected immediately, and the rest reads will be performed the second-round overlapping using finely tuned to find as more overlaps as possible.

Fec searches the cache first. If the alignment exists in the cache, Fec takes this alignment out and deduces the second alignment from it. Otherwise, Fec performs base-level alignment and stores the alignment in the cache.





□ FastRemap: A Tool for Quickly Remapping Reads between Genome Assemblies

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac554/6670620

FastRemap provides up to a 7.19× speedup (5.97×, on average) and uses as low as 61.7% (80.7%, on average) of the peak memory consumption compared to the state-of-the-art remapping tool, CrossMap.

To remap reads from one (source) reference to another (target) reference, FastRemap relies on a chain file (specific to the pair of references), which indicates regions that are shared between the two references.





□ InteRD: Omnibus and Robust Deconvolution Scheme for Bulk RNA Sequencing Data Integrating Multiple Single-Cell Reference Sets and Prior Biological Knowledge

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac563/6671214

Integrated and Robust Deconvolution (InteRD) infers cell-type proportions from bulk RNA-seq data. InteRD integrates deconvolution results from multiple scRNA-seq datasets without assuming that GEPs in different reference sets are similar to those in the underlying bulk tissue.

InteRD calibrates the RB estimates by incorporating a reference-free approach and taking into account prior biological knowledge. This boosts the deconvolution performance by incorporating more information into the deconvolution system.





□ Beacon V2 Reference Implementation: a Toolkit to enable federated sharing of genomic and phenotypic data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac568/6671215

Overall, two basic elements are needed to implement a local instance of Beacon v2: i) an internal database (where the biological data are stored), and ii) a REST API that provides a standardized way to receive requests and send responses.

The B2RI consists of: A set of tools for extraction, transformation and loading of metadata, phenotypic data and genomic variants into a database. The database. The Beacon v2 query engine. An example dataset consisting of synthetic data (CINECA synthetic cohort EUROPE UK1).





□ CausalCell: applying causal discovery to single-cell analyses

>> https://www.biorxiv.org/content/10.1101/2022.08.19.504494v1.full.pdf

CausalCell performs causal discovery. Some measures are developed and imbeded into the pipelinle to ensure reliability of causal discovery. The results indicating that complicated CI tests are crucial for generating reliable results.

The CausalCell pipeline consists mainly of feature selection and causal discovery. A parallel version of the PC algorithm is used to realize the parallel multi-task causal discovery, which is supported by a cluster of computers.





□ NSB: Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model

>> https://academic.oup.com/bioinformaticsadvances/article/doi/10.1093/bioadv/vbac055/6663762

Despite the appeal, alignment-free methods have not been competitive with alignment-based methods in terms of accuracy. One limitation of alignment-free methods is their reliance on simplified models of sequence evolution such as Jukes-Cantor.

NSB uses a base-substitution technique on k-mers to identify the frequencies of transitions and transversions, and allows the use of more complex sequence evaluation models. This enables NSB to estimate more accurate phylogenetic distances, even when the true distances are high.





□ Analysis of the Hamiltonian Monte Carlo genotyping algorithm on PROVEDIt mixtures including a novel precision benchmark

>> https://www.biorxiv.org/content/10.1101/2022.08.28.505600v1

An internal validation study of a DNA mixture algorithm based on Hamiltonian Monte Carlo sampling. HMC exhibited a lower misclassification rate, a significantly better ability to provide negative evidence, and a slightly higher area under the ROC curve for 3-contributor mixtures.

A novel large-scale precision benchmark of the Hamiltonian Monte Carlo method, indicating its improvements over existing solutions. This provided additional arguments that the strength of the evidence decreases with decreasing total amount of DNA material in the mixture.





□ Evaluation of vicinity-based hidden Markov models for genotype imputation

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04896-4

Focusing on Li–Stephens HMM-based imputation models and assess the performance of “vicinity-based HMMs”, i.e., the HMM evaluates the paths over only a short stretch of variants around the untyped variants.

This model describes a probability distribution on possible “paths” that pass over the reference haplotypes. The transitions between the haplotypes and errors on the haplotypes are probabilistic.

In the simplest sense, the minimal number of haplotype transitions and allelic errors can be thought of as the most likely path that describes the query haplotype.





□ SEMgraph: an R Package for Causal Network Inference of High-Throughput Data with Structural Equation Models

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac567/6678980

Within SEMgraph, this is practically achieved through algorithm-assisted search for the optimal trade-off b/n best model fitting (i.e., the optimal context) and perturbation (exogenous influence) given data, in which knowledge is used as supplementary confirmatory information.

Interchangeable model representation as either an igraph object or the corresponding SEM in lavaan syntax. Model management functions incl. graph-to-SEM conversion, automated covariance matrix regularization, graph conversion to DAG, and graph creation from correlation matrices.





□ A Genealogical Interpretation of Principal Components Analysis

>> https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1000686

The underlying genealogical history of the samples can be related directly to the PC projection. The expected location of samples on the principal components can, for single nucleotide polymorphism (SNP) data, be predicted directly from the pairwise coalescence times between samples.

It is worth pointing out that because PCA effectively summarizes structure in the matrix of average pairwise coalescent times, but in a manner that is influenced by sample composition, more direct inferences can potentially be made from the matrix of pairwise differences.





□ pcnaDeep: A Fast and Robust Single-Cell Tracking Method Using Deep-Learning Mediated Cell Cycle Profiling

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac602/6680181

pcnaDeep integrates cutting-edge detection techniques with tracking and cell cycle resolving models. Using the Mask R-CNN model under FAIR's Detectron2 framework, pcnaDeep is able to detect and resolve very dense cell tracks with PCNA fluorescence.

pcnaDeep uses a Greedy Phase Searching (GPS) algorithm to detect targeted phases in a noisy background. Tracks with detected mitosis phase are broken into mother and daughter tracks at the frame of maximum velocity, as an approximation of cytokinesis.





□ Archetypal Analysis for population genetics

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010301

Archetypal Analysis yields similar cluster structure to existing unsupervised methods such as ADMIXTURE and provides interpretative advantages. Since Archetypal Analysis can be used with lower-dimensional representations, significant reductions in computational time.

A method that combines the singular value decomposition (SVD) with Archetypal Analysis to perform fast and accurate genetic clustering by first reducing the dimensionality of the space of genomic sequences.





□ RedRibbon: A new rank-rank hypergeometric overlap pipeline to compare gene and transcript expression signatures

>> https://www.biorxiv.org/content/10.1101/2022.08.31.505818v1

RedRibbon, a complete rewrite of the original RRHO package, substantially increasing performance and accuracy, and introducing novel data structures and algorithms. It fea- tures the capability to analyse lists one or two orders of magnitude longer without any loss of accuracy.

Locating minimal P-value coordinates is independent of visualization map resolution. This minimal P-value search algorithm only keeps in memory for the grid algorithm the best coordinate.






□ grenepipe: A flexible, scalable, and reproducible pipeline to automate variant calling from sequence reads

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac600/6687127

Although grenepipe is agnostic to the genomic application, an important use is Pool-Seq for eco-evolutionary studies, where DNA of a population is combined (“pooled”) in the same sequencing library.

Allele frequencies, rather than genotype states, can be extracted from the VCF file or directly from BAM files using the complementary tool GRENEDALF; this lists frequencies of biallelic SNPs of each library based on base ratios within samples for downstream computations.





□ Heritability estimation for a linear combination of phenotypes via ridge regression

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac587/6687124

Existing methods for estimating heritability mainly focus on single phenotypes under random-effect models. These methods require some stringent conditions, which calls for a more flexible method for estimating heritability. Fixed-effect models emerge as a useful alternative.

A novel heritability estimator based on multivariate ridge regression for linear combinations of phenotypes, yielding accurate estimates in both sparse and dense cases. In the high-dimensional setting, It appears to be consistent and asymptotically normally distributed.





□ PEcnv: accurate and efficient detection of copy number variations of various lengths

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac375/6686740

PEcnv uses a strategy to use base coverage information around the target base to correct its coverage by the exponentially weighted moving average. Considering base coverage around the target base can effectively solve the complex distribution problem of the read depth.

PEcnv improves identification of varying sizes of CNVs by using a dynamic sliding window. Ir divides the genome into candidate / non-candidate CNV regions and set the dynamic sliding window bin sizes according to the different regions in the bias correction / segmentation steps.





□ ggcoverage: an R package to visualize and annotate genome coverage for various NGS data

>> https://www.biorxiv.org/content/10.1101/2022.09.01.503744v1

ggcoverage provides a flexible and user-friendly way to visualize genome coverage, and multiple available annotations such as base and amino acid annotation, GC content annotation, gene / transcript structure annotation, peak annotation and chromosome ideogram annotation.

ggcoverage can generate publication-ready plots with the help of ggplot2. The input file for ggcoverage can be in BAM, BigWig, BedGraph or tab-separated formats. For BAM files, ggcoverage can convert them to BigWig files with various normalization methods using deeptools.





□ ABEILLE: a novel method for ABerrant Expression Identification empLoying machine Learning from RNA-sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac603/6692305

ABEILLE (ABerrant Expression Identification empLoying machine LEarning from sequencing data) a variational autoencoder (VAE) based method for the identification of AGEs from the analysis of RNA-seq data without the need of replicates or a control group.

ABEILLE combines the use of a VAE, able to model any data without specific assumptions on their distribution, and a decision tree to classify genes as AGE or non-AGE. An anomaly score is associated to each gene in order to stratify AGE by severity of aberration.





□ TVAR: Assessing Tissue-specific Functional Effects of Non-coding Variants with Deep Learning

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac608/6692425

TVAR integrates multi-label learning and multi-instance learning. TVAR learns the differences and connections between tissues‭, and jointly considers the functional utility of a variant ‬acros‭s 49 tissues simultaneously to leverage the sharing of eQTL among tissues.‬

By using the 1247-dimensional functional genomics features, ‭TVAR accesses the tissue-specific funct scores of each variant across the GTEx tissues. ‬G‭-score, a multi-instance learning algorithm that provides an integrated funct score for each variant on the organism level.‬





□ ChimeraTE: A pipeline to detect chimeric transcripts derived from genes and transposable elements

>> https://www.biorxiv.org/content/10.1101/2022.09.05.505575v1

ChimeraTE was developed to detect chimeric transcripts with paired-end RNA-seq reads. It is developed in BASH scripting that is able to fully automate the process in only one command-line.

ChimeraTE has two Modes: Mode 1 is a genome-guided approach that employs the canonical method of genome alignment, whereas Mode 2 identifies chimeric transcripts without a reference genome, being able to predict chimeras derived from fixed or polymorphic TEs.





□ DMRscaler: a scale-aware method to identify regions of differential DNA methylation spanning basepair to multi-megabase features

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04899-1

DMRscaler, that accurately identifies regions of differential methylation that can span several basepairs up to those existing at much larger scales spanning many megabases of sequence across the global DNA methylation landscape.

DMRscaler uses an iterative windowing procedure to capture regions of differential DMRs ranging in size from single basepairs to whole chromosomes. DMRscaler was the only method that accurately called DMRs ranging in size from 100 bp to 1 Mb and up to 152 Mb on the X-chromosome.





□ Boosting single-cell gene regulatory network reconstruction via bulk-cell transcriptomic data

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac389/6693602

The bulk-cell transcriptomic data are a valuable resource, which could improve the prediction of single-cell GRN. GRN-transformer achieves the state-of-the-art prediction accuracy in comparison to existing supervised and unsupervised approaches.

GRN-Transformer Infers cell-type-specific GRNs from both the single-cell RNA sequencing data and the generic GRN derived from the bulk cells by constructing a weakly supervised learning framework based on the axial transformer.





□ CAMLU: A machine learning-based method for automatically identifying novel cells in annotating single cell RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac617/6694844

CAMLU trains an autoencoder with the labeled training data and applies the autoencoder to the testing data to obtain reconstruction errors.

By iteratively selecting features that demonstrate a bi-modal pattern and reclustering the cells using the selected feature, CAMLU can accurately identify novel cells that are not present in the training data.





Epsilon.

2022-09-17 23:13:17 | Science News




□ GCNCMI: A Graph Convolutional Neural Network Approach for Predicting circRNA-miRNA Interactions

>> https://www.frontiersin.org/articles/10.3389/fgene.2022.959701/full

GCNCMI predicts potential interactions between circRNAs and miRNAs. GCNCMI mines the latent interactions of adjacent nodes in a graph convolutional neural network, and then recursively propagates the interaction information on the graph convolutional layers.

GCNCMI propagates the information flow recursively over the graph structure and continuously aggregate the information of neighboring nodes to refine the embedding representation. GCNCMI concatenates the embeddings from different propagation layers and make the final prediction.





□ FFP: joint Fast Fourier transform and fractal dimension in amino acid property-aware phylogenetic analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04889-3

Fractal dimension describes the complexity of geometric objects. Smits used HFD to monitor the complexity of brain activity. There exists similarity between the whole and part of the protein sequence, so they can be represented by fractal curve.

FFP, it is a hybrid method for APPA. the primary amino acid sequence is converted into digital sequence using the pKa(COOH) value, which is critical for the dissociation constant. The feature vector of each protein is generated by integrating FFT and HFD.





□ BayesRCπ: Accounting for overlapping annotations in genomic prediction models of complex traits

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04914-5

BayesRCπ and BayesRC+ incorporate biological information in different ways, their performance is likely to be highly dependent on the underlying genetic architecture, the construction of annotation categories, and the biological relevance of the prior information.

The BayesRCπ model with a mixture of mixtures prior distribution on SNP effects, thus allowing multi-annotated SNPs to be assigned a posteriori to the most informative annotation. The BayesRC+ model assigns an additive impact of multiple annotation categories.





□ Mapping coalgebras II: Operads

>> https://arxiv.org/pdf/2208.14395v1.pdf

The Hadamard tensor product defines the structure of a monoidal context on the 2-category of enriched operads that lifts that of coloured symmetric sequences, in the sense that the forgetful functor sk(OperadE) → S−mod is a strict monoidal functor.

Monochromatic enriched operads are themselves algebras over a set-theoretical operad Op. Hence, the category AlgE (Op) of monochromatic enriched operads have the structure of a symmetric monoidal category ; this is the Hadamard tensor product.

Moreover, the category the category AlgE (Op) of Op-coalgebras endow a closed symmetric monoidal structure and operads are enriched-tensored-cotensored over Op-coalgebras.





□ Dual Fusion 2-Categories

>> https://arxiv.org/pdf/2208.08722v1.pdf

Given a fusion 2-category and a suitable module 2-category, the dual tensor 2-category is the associated 2-category of module 2-endofunctors. Proving the relative tensor product of modules over a separable algebra in a fusion 2-category exists.

Over a fusion 2-category, the 2-adjoint of a left module 2-functor carries a canonical left module structure. The dual tensor 2-category with respect to a separable module 2-category is a multifusion 2-category.





□ hCoCena: Horizontal integration and analysis of transcriptomics datasets

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac589/6677225

Horizontal-CoCena (hCoCena: horizontal construction of co-expression networks and analysis) allows for the analysis of a single transcriptomic dataset, using a co-expression network for the identification of gene clusters and their subsequent functional analysis.

hCocena is a completely remastered, stand-alone. hCoCena’s ready-to-use workflow implementation is provided as an R markdown file utilizing the package functions with minimal code exposure and detailed descriptions of all in-and outputs as well as function parameters.





□ MMGraph: a multiple motif predictor based on graph neural network and coexisting probability for ATAC-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac572/6673903

MMGraph is based on GNN and coexisting probability of k-mers, where the coexisting probability represents the degree of association between k-mers. MMGraph decomposes the heterogeneous graph into three sub-graphs, i.e. similarity graph, coexisting graph, and inclusive graph.

MMGraph consists of three components: a heterogeneous graph; a three-layer GNN model to get embeddings of k-mers and sequences; coexisting probability calculation for finding multiple motifs.





□ DA-DSL-L2: A novel meta-analysis based on data augmentation and elastic data shared lasso regularization for gene expression

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04887-5

DA-DSL-L2 is based on a new data augmentation (DA) strategy and elastic data shared lasso method. Various CPN methods exist that can preserve original biological information of gene expression datasets from different angles and add different “perturbations” to the dataset.

DA-DSL-L2 transforms the DSL-L2 method to a standard Lasso problem. Even though the Lasso problem can be solved by some very efficient method, i.e., glmnet, to solve a big matrix such as a matrix size of over 40,000 * 40,000.





□ IsofunGO: Isoform function prediction by Gene Ontology embedding

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac576/6673907

IsofunGO firstly introduces an attributed hierarchical network to model massive GO terms, and a GO network embedding strategy to learn compact representations of GO terms and project GO annotations of genes into compressed ones.

It develops an attention based multi-instance learning network to fuse genomics and transcriptomics data of isoforms and predict isoform functions by referring to compressed annotations.





□ scraps: an end-to-end pipeline for measuring alternative polyadenylation at high resolution using single-cell RNA-seq

>> https://www.biorxiv.org/content/10.1101/2022.08.22.504859v1

scraps (Single Cell RNA PolyA Site Discovery), a scalable, and reproducible end-to-end workflow, to identify polyadenylation sites at near-nucleotide resolution in single cells using 10X Genomics and other TVN-primed single-cell RNA-seq (scRNA-seq) libraries.

scraps performs best with long read 1 sequencing and paired alignment, is both unbiased relative to existing methods that utilize only read 2 and recovers more sites, despite the reduction in read quality observed on most modern DNA sequencers following homopolymer stretches.





□ CellDrift: inferring perturbation responses in temporally sampled single-cell data

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac324/6673850

CellDrift, a Generalized linear models (GLM)-based Functional data analysis model, to disentangle temporal patterns in perturbation responses in scRNA- seq data.

CellDrift first captures cell-type specific perturbation effects by adding an interaction term in the GLM and then utilizes predicted coefficients to calculate contrast coefficients, which represent perturbation effects.

Concatenated contrast coefficients over time are defined as functions, and Fuzzy C-mean clustering is used to identify temporal patterns, which is accompanied by FPCA to find the major components that account for the most temporal variance.





□ DeepGenePrior: A deep learning model to prioritize genes affected by copy number variants

>> https://www.biorxiv.org/content/10.1101/2022.08.22.504862v1

DeepGenePrior aims to uncover the genes contributing to the target disease and the underlying relationshio patterns. Based on the copy number variants of all cases and controls, they train a network, then use the model weights to calculate scores.

The model tries to encode the inputs into a Gaussian distribution with estimated mean and covariance. with DECIPHER data source, DeepGenePrior investigates how mutations in the detected genes influence other traits, and gene ontology analyses were also conducted.





□ PyWGCNA: A Python package for weighted gene co-expression network analysis

>> https://www.biorxiv.org/content/10.1101/2022.08.22.504852v1

PyWGCNA is designed to do Weighted correlation network analysis (WGCNA) can be used for finding clusters of highly correlated genes, for summarizing such clusters using the module eigengene for relating modules to one another and to external sample traits.

PyWGCNA can directly perform Gene Ontology enrichment on co-expression modules to characterize the functional activity of each module and supports addition or removal of data to allow for iterative improvement on network construction as new samples become available or defunct.





□ SPEX: A modular end-to-end analytics tool for spatially resolved omics of tissues

>> https://www.biorxiv.org/content/10.1101/2022.08.22.504841v1

SPEX (Spatial Expression Explorer), a comprehensive image analysis software implemented as a user- friendly web-based application with modules that can be put together by the user as pipelines conveniently through a graphical user interface.

SPEX introduced the novel application of the CLQ methodology. SPEX provides a clustering module that accommodates both proteomics and transcriptomics inputs. SPEX includes a modular pipeline to facilitate tissue-based single-cell segmentation.





□ Identifying and correcting repeat-calling errors in nanopore sequencing of telomeres

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02751-6

Telomeric regions were frequently miscalled as other types of repeats in a strand-specific manner. Specifically, although human telomeres are typically represented by (TTAGGG)n repeats, these regions were frequently recorded as (TTAAAA)n repeats.

When examining the reverse complementary strand of the telomeres which are represented as (CCCTAA)n repeats, we instead observed frequent substitution of these regions by (CTTCTT)n and (CCCTGG)n repeats.

The examination of each telomeric long read also indicates that these error repeats frequently co-occur with telomeric repeats at the ends of each read, and are observed on all chromosomal arms of CHM13.





□ DeDoc2 identifies and characterizes the hierarchy and dynamics of chromatin TAD-like domains in the single cells

>> https://www.biorxiv.org/content/10.1101/2022.08.23.505046v1

deDoc2 is a TAD-like domain(TLD) prediction tool using structural information theory, it treats the Hi-C contact map as a weighted graph, and applys dynamic programming algorithm to globally optimize the two-dimensional structural entropy of the graph partiton.

The deDoc2.w minimizes the structural entropy in the whole Hi-C contact map, while the deDoc2.s minimizes the structural entropy in the matrices of sliding windows along the genome. deDoc2.binsize determines the optimal binsize with normalized decoding information.





□ Deep surveys of transcriptional modules with Massive Associative Kbiclustering (MAK)

>> https://www.biorxiv.org/content/10.1101/2022.08.26.505372v1

Unsupervised Massive Associative K-biclustering (MAK) approach corrects this size bias while preserving high bicluster coherence both on simulated datasets with known ground truth and on real world data without, where we apply a new measure to evaluate biclustering.

MAK jointly maximizes bicluster coherence with biological enrichment and finds the most enriched biological functions. MAK reports the second-most enriched non-protein production functions, with higher bicluster coherence and arrayed across a large number of biclusters.





□ UltraSEQ: a universal bioinformatic platform for information-based clinical metagenomics and beyond

>> https://www.biorxiv.org/content/10.1101/2022.08.24.505213v1.full.pdf

UltraSEQ uses a novel, information-based approach that leverages a fast aligner that can handle both DNA and protein database to make sample-level predictions at the most specific taxonomic levels possible given the information in the sample and the database(s) used.

UltraSEQ was built from the ground up to make predictions for regions of sequences (including taxonomic binning), full sequences, and collections of sequences (i.e., a sample) without complicated user settings and the necessity for background subtraction.





□ SCOIT: Probabilistic tensor decomposition extracts better latent embeddings from single-cell multiomic data

>> https://www.biorxiv.org/content/10.1101/2022.08.26.505382v1

SCOIT is incorporated various distributions, including Gaussian, Poisson, and negative binomial distributions. SCOIT constructs a multiomic tensor with a union set of features. Second, it performs the probabilistic tensor decomposition.

SCOIT generates embedding matrices for omics, cells, and genes. SCOIT incorporates the global and local embeddings to capture global and local variability. SCOIT applies the Gaussian distribution for the continuous data type and the NBD for the count data with high variance.





□ APSCALE: advanced pipeline for simple yet comprehensive analyses of DNA Meta-barcoding data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac588/6677653

Apscale is a metabarcoding pipeline that handles the most common tasks in metabarcoding pipelines like paired-end merging, primer trimming, quality filtering, otu clustering and denoising as well as an otu filtering step.

APSCALE offers an internal python-based version of the LULU (Frøslev et al. 2017), an algorithm for post-clustering curation that aims to provide more reliable biodiversity estimates. Both OTUs and ESVs are filtered using the LULU to reduce the number of erroneous OTUs and ESVs.





□ Aclust2.0: a revamped unsupervised R tool for Infinium methylation beadchips data analyses

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac583/6677241

Aclust, one of the first unsupervised algorithms, was originally designed to analyze regional methylation of Infinium’s 27K and 450K arrays by clustering neighboring methylation sites prior to downstream analyses.

“aclust2.0.R” script provides all the necessary guidelines. “function GEE.clusters” runs GEE models with the identified clusters and takes as input “clusters.list”, betas data, exposure, covariates, “id” which is the column name of betas, and the correlation structure specification.





□ DeepBSA: A deep-learning algorithm improves bulked segregant analysis for dissecting complex traits

>> https://www.cell.com/molecular-plant/pdf/S1674-2052(22)00267-2.pdf

DeepBSA performs well in QTL mapping of multiple loci with marginal effects. DeepBSA usually requires shallower sequencing depth than alternative methods, making it more easily adoptable.

DeepBSA identifies the number of bulked pools automatically and integrates multiple algorithms. DeepBSA only requires pooled data; ΔSNP-index requires parental sequencing as a control. DeepBSA requires a simple input with standard VCF, whereas QTG-seq requires a gff annotation.





□ CoAtGIN: Marrying Convolution and Attention for Graph-based Molecule Property Prediction

>> https://www.biorxiv.org/content/10.1101/2022.08.26.505499v1

CoAtGIN uses the k-hop convolution in a graph convolution network for faster message aggregation within one iteration. CoAtGIN presents a new way to accomplish global message passing through the graph using the linear transformer.

CoAtGIN is composed of L layers. Each layer takes the Node Embedding (NE) and Graph Embedding (GE) as input, and then these two embeddings will be updated for layerwise iteration. CoAtGIN initializes the NE as the atom type of each node. And the GE are set to zeros.





□ SeQuiLa: Cloud-native distributed genomic pileup operations

>> https://www.biorxiv.org/content/10.1101/2022.08.27.475646v1

SeQuiLa, a scalable, distributed, and efficient implementation of a pileup algorithm that is suitable for deploying in cloud computing environments.

SeQuiLa is implemented a novel and unique approach to process alignment events from sequencing reads using the MD tags, the source code micro-optimizations for recurrent operations, and a modular structure of the algorithm.






□ Multiset partial least squares with rank order of groups for integrating multi-omics data

>> https://www.biorxiv.org/content/10.1101/2022.08.30.505949v1

Multiset partial least squares (PLS) is formulated as maximization of sum of covariance between scores for all combinations of each explanatory variables and between score for response and each explanatory variable.

multiset PLS-ROG is formulated as maximization of sum of covariance and almost the same constraint condition as PLS-ROG. multiset PLS-ROG loading is defined as the weighted correlation coefficient and could identify statistically significant compounds.





□ MELT: Metric learning for comparing genomic data with triplet network

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac345/6679451

MEtric Learning with Triplet network (MELT), which learns a nonlinear mapping from original space to the embedding space in order to keep similar data closer and dissimilar data far apart.

MELT is a weakly supervised and data-driven comparison framework that offers more adaptive and accurate dissimilarity learned in the absence of the label information when the supervised methods are not applicable.





□ genomicSimulation: fast R functions for stochastic simulation of breeding programs

>> https://academic.oup.com/g3journal/advance-article/doi/10.1093/g3journal/jkac216/6687129

genomicSimulation works as a scripting tool, with functions for performing targeted crosses, random crosses, doubled haploids and selfing. genomicSimulation’s inbuilt genotypic value calculator uses an additive model of marker effects.

Every genotype loaded or produced in genomicSimulation is allocated to a group. Mixing and separating groups allows for significant flexibility in regards to simulating multi-generational breeding pools, or having several interacting streams in the breeding program.





□ Var I Decrypt: a novel and user-friendly tool to explore and prioritize variants in whole-exome sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.09.02.506346v1

Var | Decrypt offers a wide range of gene and variant filtering possibilities, clustering and enrichment tools, providing an efficient way to derive patient-specific functional information and to prioritize gene variants for functional analyses.

Var | Decrypt imports the output results from the Exome-seq pipeline and provides many built-in enrichment analyses options. Var | Decrypt contains different disease ontology, gene ontology, and Reactome/Kegg pathway enrichment tab.





□ MacSyFinder v2: Improved modelling and search engine to identify molecular systems in genomes

>> https://www.biorxiv.org/content/10.1101/2022.09.02.506364v1

MacSyFinder version 2 (v2) was improved and rationalized to facilitate future maintainability. The novel v2 search engine explores the space of possible solutions more thoroughly. It provides optimal solutions with an explicit scoring system favouring complete but concise systems.

The systems are now searched one by one: the identified components are filtered by type of system and assembled in clusters if relevant. Using a system-by-system approach prevents the spurious elimination of relevant candidate systems.





□ MVsim is a toolset for quantifying and designing multivalent interactions

>> https://www.nature.com/articles/s41467-022-32496-6

MVsim, an interactive toolset with a simple graphical user interface (GUI) for the design, prediction, multidimensional parameter exploration, and quantification of multivalent binding phenomena.

MVsim accurately simulates both monospecific multivalent interactions (i.e., a single repeated ligand domain on one binding partner and a single repeated target domain on the other) and multispecific multivalent interactions.





□ Bridging The Evolving Semantics: A Data Driven Approach to Knowledge Discovery In Biomedicine

>> https://www.biorxiv.org/content/10.1101/2022.09.05.506661v1

Dynamic MeSH Embeddings: MeSH embeddings is a powerful diachronic tool, which is capable of capturing the semantic evolution. In the B-Med framework, MeSH embeddings with an augmented notion of time component to captures the evolutionary properties of medical concepts.

In the dynamic embedding space, the semantic change of a MeSH term can be easily mod- eled as the location shift of this term. Hence, MeSH terms are projected into the vector space based on their medical properties and gradually drift over time as they evolve.





□ muSignAl: An algorithm to search for multiple omic signatures with similar predictive performance

>> https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/10.1002/pmic.202200252

muSignAl (multiple signature algorithm) selects multiple signatures with similar predictive performance while systematically bypassing the requirement of exploring all the combinations of features.

muSignAl is applicable in various bioinformatics driven explorations, such as understanding the relationship between multiple biological feature sets and phenotypes, and development of biomarker panels while providing the opportunity of optimising their development cost.





□ Multi-agent Feature Selection for Integrative Multi-omics Analysis

>> https://ieeexplore.ieee.org/document/9871758/

MAgentOmics extends the ant colony optimization algorithm to multi-omics data, which iteratively builds candidate solutions and evaluates them.

Moreover, a new fitness function is introduced to assess the candidate feature subsets without using prediction target such as survival time of patients.





□ ScanExitronLR: characterization and quantification of exitron splicing events in long-read RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac626/6696711

ScanExitronLR, an application for the characterization and quantification of exitron splicing events in long-reads. From a BAM alignment file, reference genome and reference gene annotation, ScanExitronLR outputs exitron events at the individual transcript level.

Outputs of ScanExitronLR can be used in downstream analyses of differential exitron splicing. In addition, ScanExitronLR optionally reports exitron annotations such as truncation or frameshift type, nonsense-mediated decay status, and Pfam domain interruptions.















iPhone 14 Pro.

2022-09-17 03:09:25 | デジタル・インターネット


□ iPhone 14 Pro / Space Black / 256GB (MQ0Q3J/A) 購入

>> https://www.apple.com/iphone-14-pro/

私の”Dark Star”! スペースブラックの質感と色合いが最高にクール。超広角カメラの性能アップもだけれど、本命はiOS16により刷新されたUIとカスタマイズ性、要求性能を満たすA16チップの親和性、それがもたらすユーザ体験の向上にある。








Shimmer..

2022-09-13 21:37:56 | 日記・エッセイ・コラム


人は一つ一つの水槽に入れられてるわけじゃなくて、生きているだけで傷つくし、誰かを傷つける。逃げ出したくなるのは心が生きようとしている証拠で、怖くなるのは失いたくないものがある証拠。だから元の形に戻らなくてもいい。今ここで傷を刻んでいくんだ。

もし誰かを想って傷ついたなら、そのことを誰にも恥じなくていい。痛ければ痛いほど、誰かを愛する力があるのだから。痛みを抱えながら前に進めるのなら幸運だ。まだ出会っていない誰かのために、一つ一つ優しさを知ること。そしてどうしようもない痛みで泣いている自分を許すこと。













Ludovico Einaudi - Nuvole Bianche (Reimagined by Mercan Dede)

2022-09-09 21:17:37 | Music20


□ Ludovico Einaudi - Nuvole Bianche (Reimagined by Mercan Dede)

>> https://music.apple.com/album/reimagined-volume-2-chapter-3-single/1640602234


ルドヴィコ・エイナウディのReimaginedシリーズ。今回は私が最も衝撃を受けた現代ピアノ曲のArabtronica・アレンジ。Mercan Dedeの鋭角的なセンスが光るビートと編曲によって、哀愁の旋律がスーフィーの色彩を帯びる。

“Yerevan”はRobert RichやDeleriumを愛した人にぜひ聴いてほしい佳曲。







Under an Umbrella.

2022-09-09 21:07:13 | 写真


花火大会🎆は土砂降り☔️。バカほど歩くし、汗がギトギト、裾は泥だらけ、時間潰しで入った映画はハズレで、靴擦れはするし、並んで買った焼きそばは生焼けで、コンビニで下着のジャストサイズは品切れ、帰りは渋滞で温泉は激コミ♨️ それでも最高の夏の思い出❗️🎐