lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

More than words for the voices that can’t be heard.

2023-01-31 23:11:11 | Science News





□ Quantum mechanical electronic and geometric parameters for DNA k-mers as features for machine learning

>> https://www.biorxiv.org/content/10.1101/2023.01.25.525597v1

A large-scale semi-empirical quantum mechanical (QM) and geometric features calculations for all possible DNA heptamers in their three, B, A and Z, representative conformations. It used the same PM6-DH+ with COSMO solvation.

The DNA structures are optimized by using the semi-empirical Hamiltonian under the restricted Hartree-Fock approach. The procedure is comprised of: the building of the all-atom DNA models / geometry optimisation / feature extraction w/ the corresponding single-point calculations.




□ BLTSA: pseudotime prediction for single cells by Branched Local Tangent Space Alignment

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad054/7000337

BLTSA infers single cell pseudotime for multi-furcation trajectories. By assuming that single cells are sampled from a low-dimensional self-intersecting manifold, BLTSA identifies the tip and branching cells in the trajectory based on cells’ local Euclidean neighborhoods.

A small value of nonlinearity implies a big gap b/n the d-th & (d+1)-th singular values and the neighborhood shows a strong d-dimensional linearity - A large value of nonlinearity implies a small gap b/n 2 singular values and the neighborhood shows a weak d-dimensional linearity.

BLTSA can be used directly from the high dimensional space to one-dimensional space. BLTSA propagates the reliable tangent information from non-branching cells to branching cells. Global coordinates for all the single cells are determined by aligning the local coordinates based on the tangent spaces.





□ Gemini: Memory-efficient integration of hundreds of gene networks with high-order pooling

>> https://www.biorxiv.org/content/10.1101/2023.01.21.525026v1

Gemini uses random walk with restart to compute the diffusion states. Gemini then uses fourth-order kurtosis pooling of the diffusion state matrix as the feature vectors to cluster all networks. Gemini assigns each network a weight inversely proportional to its cluster size.

Gemini randomly samples pairs of networks. These pairs of diffusion state matrices are then mixed-up to create a new simulated network collection. Gemini aggregates the synthetic dataset and perform an efficient singular value decomposition to produce embeddings for all vertices.





□ HQAlign: Aligning nanopore reads for SV detection using current-level modeling

>> https://www.biorxiv.org/content/10.1101/2023.01.08.523172v1

HQAlign (which is based on QAlign), which is designed specifically for detecting SVs while incorporating the error biases inherent in the nanopore sequencing process. HQAlign pipeline is modified to enable detection of inversion variants.

HQAlign takes the dependence of Q-mer map into account to perform accurate alignment with modifications specifically for discovery of SVs. the nucleotide sequences that have indistinguishable current levels from the lens of the Q-mer map are mapped to a common quantized sequence.





□ Ankh ☥: Optimized Protein Language Model Unlocks General-Purpose Modelling: Ankh unlocks the language of life via learning superior representations of its ”letters”, the amino acids.

>> https://www.biorxiv.org/content/10.1101/2023.01.16.524265v1

The Ankh architecture constructs the information flow in the network starting from the input sequences, pre-processing, transformer, and then either a residue-level / protein-level prediction network that only differs in being preceded by a global max pooling layer.

Ankh provides a protein variant generation analysis on High-N and One-N input data scales where it succeeds in learning protein evolutionary conservation-mutation trends and introducing functional diversity while retaining key structural-functional characteristics.





□ FAME: Efficiently Quantifying DNA Methylation for Bulk- and Single-cell Bisulfite Data

>> https://www.biorxiv.org/content/10.1101/2023.01.27.525734v1

FAME, the first bisulfite-aware (BA) mapping method with an index that is tailored for the alignment of BS reads with direct computation of CpGm values. The algorithm is working on the full alphabet (A,C,G,T), resolving the asymmetric mapping problem correctly.

FAME enables ultra-fast and parallel querying of reads without I/O overhead. FAME is built on a novel data structure that exploits gapped k-mer counting within short segments of the genome to quickly reduce the genomic search space.





□ xAtlas: scalable small variant calling across heterogeneous next-generation sequencing experiments

>> https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giac125/6987867

xAtlas, a lightweight and accurate single- sample SNV and small indel variant caller. xAtlas includes fea- tures that allow it to easily scale to population-scale sample sets, incl. support for CRAM and gVCF file formats, minimal computational requirements, and fast runtimes.

xAtlas determines the most likely genotype and reports the candidate variant. xAtlas supplies to the SNV and indel logistic regression models were compiled. xAtlas reports only the variant at that position with the greatest number of reads supporting the variant sequence.





□ TransImp: Towards a reliable spatial analysis of missing features via spatially-regularized imputation

>> https://www.biorxiv.org/content/10.1101/2023.01.20.524992v1

TransImp leverages a spatial auto-correlation metric as a regularization for imputing missing features in ST. Evaluation results from multiple platforms demonstrate that TransImp preserves the spatial patterns, hence substantially improving the accuracy of downstream analysis.

TransImp learns a mapping function to translate the scRNA-seq reference to ST data. Related to the Tangram model, TransImp learns a linear mapping matrix from the ST data. One can view it as a multivariate regression model, by treating gene as sample and cell as dimension.





□ scGREAT: Graph-based regulatory element analysis tool for single-cell multi-omics data

>> https://www.biorxiv.org/content/10.1101/2023.01.27.525916v1

scGREAT can generate the regulatory state matrix, which is a new layer of information. With the graph-based correlation scores, scGREAT filled the gap in multi-omics regulatory analysis by enabling labeled and unlabeled analysis, functional annotation, and visualization.

Using the same KNN graph constructed in the sub-clustering process, trajectory analysis was performed with functions in scGREAT utilizing diffusion pseudo-time implemented by Scanpy , and the pseudo-time labels were transferred back to single-cell data.






□ VIMCCA: A multi-view latent variable model reveals cellular heterogeneity in complex tissues for paired multimodal single-cell data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad005/6978155

VIMCCA uses a common latent variable to interpret the common source of variances in two different data modalities. VIMCCA jointly learns an inference model and two modality-specific non-linear models via variational optimization and multilayer neural network backpropagation.

VIMCCA projects the single latent factor into multi-modal observation spaces by modality-specific non-linear functions. VIMCCA allows us to directly integrate raw peak counts of scATAC-seq and gene expression of scRNA-seq without converting peak counts into gene activity matrix.





□ MetaCortex: Capturing variation in metagenomic assembly graphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad020/6986127

MetaCortex, a de Bruijn graph metagenomic assembler that is built upon data structures and graph-traversal algorithms developed for the Cortex assembler.

MetaCortex captures variation by looking for signatures of polymorphisms in the de Bruijn graph constructed from the reads and represents this in sequence graph format (both FASTG and GFA v2), and the usual FASTA format.

MetaCortex generates sequence graph files that preserve intra-species variation (e.g. viral haplotypes), and implements a new graph traversal algorithm to output variant contig sequences.





□ Gentrius: Identifying equally scoring trees in phylogenomics with incomplete data

>> https://www.biorxiv.org/content/10.1101/2023.01.19.524678v1

Gentrius - a deterministic algorithm to generate binary unrooted trees from incomplete unrooted subtrees. For a tree inferred with any phylogenomic method and a species per locus presence-absence matrix, Gentrius generates all trees from the corresponding stand.

Gentrius systematically assesses the influence of missing data on phylogenomic analysis and enhances the confidence of evolutionary conclusions. When all trees from a stand are generated, one can subsequently study their topological differences employing phylogenetic routine.





□ ggCaller: Accurate and fast graph-based pangenome annotation and clustering

>> https://www.biorxiv.org/content/10.1101/2023.01.24.524926v1

ggCaller (graph gene-caller), a population-wide gene-caller based on De Bruijn Graphs . ggCaller uses population-frequency information to guide gene prediction, aiding the identification of homologous start codons across orthologues, and consistent scoring of orthologues.

ggCaller traverses Bifrost graphs constructed from genomes to identify putative gene sequences, known as open reading frames (ORFs). ggCaller can be applied in pangenome-wide association studies (PGWAS), enabling reference- agnostic functional inference of significant hits.





□ RawHash: Enabling Fast and Accurate Real-Time Analysis of Raw Nanopore Signals for Large Genomes

>> https://www.biorxiv.org/content/10.1101/2023.01.22.525080v1

RawHash provides the mechanisms for generating hash values from both a raw nanopore signal and a reference genome such that similar regions between the two can be efficiently and accurately found by matching their hash values.

RawHash combines multiple consecutive quantized events into a single hash value. RawHash uses a chaining algorithm that find colinear matching hash values generated from regions that are close to each other both in the reference genome and the raw nanopore signal.





□ HNNVAT: Adversarial dense graph convolutional networks for single-cell classification

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad043/6994183

HNNVAT, a hybrid neural network that not only extracts both low-order and high-order features of the data but also adaptively balances the features of the data extracted by different convolutional layers with self-attention mechanism.

HNNVAT uses virtual adversarial training to improve the generalization and robustness. A convolutional network structure w/ a dense connectivity mechanism is developed to extract comprehensive cell features and expression relationships b/n cells and genes in different dimensions.





□ ResActNet: Secure Deep Learning on Genomics Data via a Homomorphic Encrypted Residue Activation Network

>> https://www.biorxiv.org/content/10.1101/2023.01.16.524344v1

ResActNet, a novel homomorphic encryption (HE) scheme to address the nonlinear mapping issues in deploying secure deep models utilizing HE. ResActNet is built on a residue activation layer to fit the nonlinear mapping in hidden layers of deep models.

ResActNet employs a scaled power function as nonlinear activation, where a scalar term is worked for tuning the convergence of network. ResActNet deploys a residue activation strategy. ResActNet constraints the Scaled Power Activation (SPA) on the residue of latent vector.





□ EMERALD: Sensitive inference of alignment-safe intervals from biodiverse protein sequence clusters

>> https://www.biorxiv.org/content/10.1101/2023.01.11.523286v1

EMERALD effectively explores suboptimal alignment paths within the pairwise dynamic programming matrix. EMERALD embraces the diversity of possible alignment solutions, by revealing alignment-safe intervals of the two sequences.

EMERALD projects the safety intervals (safety windows) back to the representative sequence, thereby annotating the sequence intervals that are robust across all possible alignment configurations within the suboptimal alignment space.





□ PS-SNC: A partially shared joint clustering framework for detecting protein complexes from multiple state-specific signed interaction networks

>> https://www.biorxiv.org/content/10.1101/2023.01.16.524205v1

PS-SNC, a partially shared non-negative matrix factorization model to identify protein complexes in two state-specific signed PPI networks jointly. PS-SNC can not only consider the signs of PPIs, but also identify the common and unique protein complexes in different states.

PS-SNC employs the Hilbert-Schmidt Independence Criterion (HSIC) to construct the diversity constraint. HSIC can measure the dependence of variables by mapping variables to a Reproducing Kernel Hilbert Space (RKHS), which can measure more complicated correlations.





□ micrographs of 1D anatase-like materials, or 1DA, with each dot representing a Ti atom. (Cell Press)


□ NGC 346, one of the most dynamic star-forming regions in nearby galaxies. (esawebb)





EUROfusion

>> https://www.mpg.de/19734973/brennpunkte-der-kernfusion

#fusionenergy promises to be a clean and practically inexhaustible #energy source. But how do the different fusion designs compare?





□ UPP2: Fast and Accurate Alignment of Datasets with Fragmentary Sequences

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad007/6982552

UPP2, a direct improvement on UPP (Ultra-large alignments using Phylogeny-aware Profiles). The main advance is a fast technique for selecting HMMs in the ensemble that allows us to achieve the same accuracy as UPP but with greatly reduced runtime.

UPP2 computes a set of subset alignments by hierarchically decomposing the backbone tree at a centroid edge. UPP2 builds an HMM on each set created during this decomposition, incl. the full set, thus producing a collection of ensemble of HMMs (eHMM) for the backbone alignment.





□ scDCCA: deep contrastive clustering for single-cell RNA-seq data based on auto-encoder network

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac625/6984787

By increasing the similarities between positive pairs and the differences between negative ones, the contrasts at both the instance and the cluster level help the model learn more discriminative features and achieve better cell segregation.

scDCCA extracts valuable features and realizes cell segregation end-to-end by introducing contrastive learning and denoising ZINB -based auto-encoder into a deep clustering framework. scDCCA incorporates a dual contrastive learning module to capture the pairwise cell proximity.





□ SemiBin2: self-supervised contrastive learning leads to better MAGs for short- and long-read sequencing

>> https://www.biorxiv.org/content/10.1101/2023.01.09.523201v1

SemiBin2 uses self-supervised learning to learn feature embeddings from the contigs. SemiBin2 can reconstruct 8.3%–21.5% more high-quality bins and requires only 25% of the running time and 11% of peak memory usage in real short-read sequencing samples.





□ xcore: an R package for inference of gene expression regulators

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05084-0

Xcore provides a flexible framework for integrative analysis of gene expression and publicly available TF binding data to unravel putative transcriptional regulators and their activities.

Xcore takes promoter or gene expression counts matrix as input, the data is then filtered for lowly expressed features, normalized for the library size and transformed into counts per million.

Xcore intersected the peaks with promoter regions and used linear ridge regression to infer the regulators associated with observed gene expression changes.





□ SiFT: Uncovering hidden biological processes by probabilistic filtering of single-cell data

>> https://www.biorxiv.org/content/10.1101/2023.01.18.524512v1

SiFT (SIgnal FilTering) uncovers underlying processes of interest. Utilizing existing prior knowledge and reconstruction tools for a specific biological signal, such as spatial structure, SiFT filters the signal and uncovers additional biological attributes.

SiFT computes a probabilistic cell-cell similarity kernel, which captures the similarity between cells according to the biological signal we wish to filter. Using this kernel, we obtain a projection of the cells onto the signal in gene expression space.





□ skani: Fast and robust metagenomic sequence comparison through sparse chaining with skani

>> https://www.biorxiv.org/content/10.1101/2023.01.18.524587v1

skani, a method for calculating average nucleotide identity (ANI) using sparse approximate alignments. skani is more accurate than FastANI for comparing incomplete, fragmented MAGs.

skani uses a very sparse k-mer chaining procedure to quickly find orthologous regions between two genomes. skani’s fast ANI filter first computes the max-containment index for a very sparse set of marker FracMin-Hash k-mers to approximate ANI.





□ VAG: Visualization and review of reads alignment on the graphical pan-genome

>> https://www.biorxiv.org/content/10.1101/2023.01.20.524849v1

VAG includes multifunctional modules integrated into a single command line and an online visualization platform supported through a web server. VAG can extract specific sequence regions from a graph pangenome and display read alignments on different paths of a graph pangenome.

The utilization of mate-pair information in VAG provides a reliable reference for variation identification. VAG enables to display inversions of the graph pangenome and the direction of read alignments on the forward or reverse strands.





□ NORTA: Investigating the Complexity of Gene Co-expression Estimation for Single-cell Data

>> https://www.biorxiv.org/content/10.1101/2023.01.24.525447v1

Zero-inflated Gaussian (ZI-Gaussian) assumes non-zero values of the normalized gene expression matrix following a Gaussian distribution. This strategy generates a co-expression network and constructs a partial correlation matrix (i.e., the inverse of the covariance matrix).

Zero-inflated Poisson (ZI-Poisson) generates a gene expression matrix through a linear combination. In order to have zeros, it then multiplies each element in the GE matrix with a Bernoulli random variable.

NORmal-To-Anything (NORTA) is based on the normal-to-anything approach that transformes multivariate Gaussian samples to samples with any given marginal distributions while preserving a given covariance.

Single-cell ExpRession of Genes In silicO (SERGIO) models the stochasticity of transcriptions and regulators with stochastic differential equations (SDEs). Concretely, it first generates dense gene expression matrix in logarithmic scale at stationary state.





□ Species-aware DNA language modeling

>> https://www.biorxiv.org/content/10.1101/2023.01.26.525670v1

In MLM, parts of an input sequence are hidden (masked) and a model is tasked to reconstruct them. Models trained in this way learn syntax and semantics of natural language and achieve state-of-the-art performance on many downstream tasks.

A state space model for language modeling in genomics. A species-aware masked nucleotide language model trained on a large corpus of species genomes can be used to reconstruct known RNA binding consensus motifs significantly better than chance and species-agnostic models.





□ DeepSelectNet: deep neural network based selective sequencing for oxford nanopore sequencing

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05151-0

DeepSelectNet is a deep neural network-based method capable of classifying species DNA directly using nanopore current signals with superior classification accuracy. DeepSelectNet is built on a convolutional architecture based on ResNet’s residual blocks.

DeepSelectNet utilizes one-dimensional convolutional layers to perform 1D convolution over nanopore current signals in the time domain. Additionally, DeepSelectNet relies on neural net regularization to minimise model complexity thereby reducing the overfitting of data.





□ Co-evolution integrated deep learning framework for variants generation and fitness prediction

>> https://www.biorxiv.org/content/10.1101/2023.01.28.526023v1

EVPMM (evolutionary integrated viral protein mutation machine), a co-evolution profiles integrated deep learning framework for dominant variants forecasting, vital mutation sites prediction and fitness landscape depicting.

EVPMM consists of a position detector to directly detect the functional positions as well as a mutant predictor to depict fitness landscape. Moreover, pairwise dependencies between residues obtained by a Markov Random Field are also incorporated to promote reasonable variant generation.





□ SSWD: A clustering method for small scRNA-seq data based on subspace and weighted distance

>> https://peerj.com/articles/14706/

SSWD follows the assumption that the sets of gene subspace composed of similar density-distributing genes can better distinguish cell groups. SSWD uses a new distance metric EP_dis, which integrates Euclidean and Pearson distance through a weighting strategy.

Each of the gene subspace’s clustering results was summarized using the consensus matrix integrated by PAM clustering. The relative Calinski-Harabasz (CH) index was used to estimate the cluster numbers instead of the CH index because it is comparable across degrees of freedom.





□ scDASFK: Denoising adaptive deep clustering with self-attention mechanism on single-cell sequencing data

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbad021/7008799

scDASFK, a new adaptive fuzzy clustering model based on the denoising autoencoder and self-attention mechanism. It implements the comparative learning to integrate cell similar information into the clustering method and uses a deep denoising network module to denoise the data.

scDASFK consists of a self-attention mechanism for further denoising where an adaptive clustering optimization function for iterative clustering is implemented. scDASFK uses a new adaptive feedback mechanism to supervise the denoising process through the clustering.





It transformed to human form for her to see.

2023-01-31 23:10:11 | Science News

(A portrait of her looking into light. #midjourney)




□ Holographic properties of quantum space are recovered from the entanglement structure of spin network states in group field theories, revealing deep connections between quantum information and gravity.

>> https://avs.scitation.org/doi/full/10.1116/5.0087122

The focus is on finite regions of 3D quantum space modeled by spin networks, i.e., graphs decorated by quantum geometric data, which enter, as kinematical states, various background-independent approaches to quantum gravity.

Crucially, such states are understood as arising from the entanglement of the quantum entities (“atoms of space”) composing the spacetime microstructure in the group field theory (GFT) framework, that is, as graphs of entanglement.

The computation of the entanglement entropy of spin network states can be highly simplified by the use of random tensor network techniques. It shows hoe to compute the Rényi-2 entropy of a certain class of spin network states via a statistical model.





□ OKseqHMM: a genome-wide replication fork directionality analysis toolkit

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkac1239/6984591

OKseqHMM, an integrative bioinformatics toolkit to directly obtain RFD profiles genome-wide and at high resolution. In addition to the fork progression direction.

OKseqHMM gives information on replication initiation/termination zones and on long-travelling unidirectional forks using an algorithm based on HMM, and calculates the OEM to visualize the transition of RFD profile at multiple scales.





□ PRANA: A pseudo-value regression approach for differential network analysis of co-expression data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05123-w

A regression modeling method that regresses the jackknife pseudo-values derived from a measure of connectivity of genes in a network to estimate the effects of predictors.

PRANA, a novel pseudo-value regression approach for the DN analysis, which can incorporate additional clinical covariates in the model. This is a direct regression modeling, and it is therefore computationally amenable.





□ FastRecomb: Fast inference of genetic recombination rates in biobank scale data

>> https://www.biorxiv.org/content/10.1101/2023.01.09.523304v1

FastRecomb can effectively take advantage of large panels comprising more than hundreds of thousands of haplotypes. FastRecomb avoids explicit outputting of IBD segments, a potential I/O bottleneck.

FastRecomb leverages the efficient positional Burrows-Wheeler transform (PBWT) data structure for counting IBD segment boundaries as potential recombination events. FastRecomb uses PBWT blocks to avoid redundant counting of pairwise matches.





□ Alignment-free estimation of sequence conservation for identifying functional sites using protein sequence embeddings

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac599/6984799

A method for estimating protein sequence conservation using embedding vectors generated from protein language models. The embedding vectors generated from the ESM2 family of protein language models provide the best performance to computational cost ratio.

The sequence embedding is shown as a two-dimensional numerical matrix where each vertical column corresponds to a residue position-residue embeddings. Conservation scores can be calculated for each residue position using regression.





□ CellCharter: a scalable framework to chart and compare cell niches across multiple samples and spatial -omics technologies.

>> https://www.biorxiv.org/content/10.1101/2023.01.10.523386v1

CellCharter, an algorithmic framework for the identification, characterization, and comparison of cellular niches from heterogeneous spatial transcriptomics and proteomics datasets comprising multiple samples.

CellCharter introduces an approach that assesses the stability of a given number of clusters based on the Fowlkes-Mallows index. Switching from one VAE to another will not affect the rest of the analyses. CellCharter builds a network of cells/spots based on spatial proximity.





□ nleval: A Python Toolkit for Generating Benchmarking Datasets for Machine Learning with Biological Networks

>> https://www.biorxiv.org/content/10.1101/2023.01.10.523485v1

nleval (biological network learning evaluation), a Python package providing unified data (pre-)processing tools to set up ML-ready network biology datasets with standardized data splitting strategies.

nleval can show the need for specialized GNN architectures. nleval comes with seven genome-scale human gene interaction networks and four collections of gene classification tasks, which can be combined into 28 datasets to benchmark different graph ML methods’ capability.





□ MutExMatSorting: A heuristic algorithm solving the mutual-exclusivity sorting problem

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad016/6986128

MutExMatSorting: an R package implementing a computationally efficient algorithm able to sort rows and columns of a binary matrix to highlight mutual exclusivity patterns.

The MutExMatSorting algorithm minimises the extent of collective vertical overlap between consecutive non-zero entries across rows while maximising the number of adjacent non-zero entries in the same row.





□ EWF : simulating exact paths of the Wright-Fisher diffusion

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad017/6984715

EWF, a robust and efficient sampler which returns exact draws for the diffusion and diffusion bridge processes, accounting for general models of selection including those with frequency-dependence.

EWF returns draws at the requested sampling times from the law of the corresponding Wright–Fisher process. Output was validated by comparison to approximations of the transition density via the Kolmogorov–Smirnov test and QQ plots.





□ ODNA: Identification of Organellar DNA by Machine Learning

>> https://www.biorxiv.org/content/10.1101/2023.01.10.523051v1

ODNA, a minimalized pre-defined genome annotation software based on MOSGA, which gath- ers the same annotation features and includes the best ML model. ODNA can classify if a sequence inside a genome assembly belongs to organellar origin.

ODNA annotates for each sequence in each genome assembly the repeating elements via Red, the ribosomal RNAs with barrnap, transfer RNAs with tRNAScan-SE 2, CpG islands with newcpgreport from the EMBOSS, and DIAMOND searches against a mitochondrial and plastid gene databases.





□ DeepSom: a CNN-based approach to somatic variant calling in WGS samples without a matched normal

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac828/6986966

DeepSom - a new pipeline for identifying somatic SNP and short INDEL variants in tumor WGS samples without a matched normal. DeepSom can effectively filter out both artefacts and germline variants under conditions of a typical WGS experiment.

DeepSom could potentially be extended to a three-class problem, simultaneously classifying somatic vs germline vs artefact variants, by modifying the CNN architecture and changing the loss function accordingly.

The current design of DeepSom already considers mutational context, VAF, and read orientation-specific information encoded in the variant tensor, so DeepSom could potentially further classify detected artefacts into subclasses, including oxoG, FFPE or other strand bias artefacts.





□ GeneMark-ETP: Automatic Gene Finding in Eukaryotic Genomes in Consistence with Extrinsic Data

>> https://www.biorxiv.org/content/10.1101/2023.01.13.524024v1

GeneMark-ETP, a new computational tool integrating genomic, transcriptomic, and protein information throughout all the stages of the algorithm’s training and gene prediction.

Protein based evidence, producing hints to locations of introns and exons in genomic DNA, is generated by using homologous proteins of any evolutionary distance, If the number of high-confidence genes is sufficiently large, the GHMM training is done in a single-iteration.





□ Scalable Nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation

>> https://www.biorxiv.org/content/10.1101/2023.01.12.523790v1

An efficient and scalable wet lab and computational protocol for Oxford Nanopore Technologies (ONT) long-read sequencing that seeks to provide a genuine alternative to short-reads for large-scale genomics projects.

Small indel calling remains to be difficult inside homopolymers and tandem repeats, but is comparable to Illumina calls elsewhere. Using ONT-based phasing, we can then combine and phase small and structural variants at megabase scales.





□ Modeling and analyzing single-cell multimodal data with deep parametric inference

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbad005/6987655

Deep Parametric Inference (DPI) transforms single-cell multimodal data into a multimodal parameter space by inferring individual modal parameters. DPI can reference and query cell types without batch effects.

DPI can successfully analyze the progression of COVID-19 disease in peripheral blood mononuclear cells (PBMC). Notably, they further propose a cell state vector field and analyze the transformation pattern of bone marrow cells (BMC) states.





□ GSEL: A fast, flexible python package for detecting signatures of diverse evolutionary forces on genomic

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad037/6992660

GSEL’s built-in parallelization and vectorization enable rapid processing of large numbers of sets (each of which may contain many genomic regions), even when generating empirical backgrounds based on thousands of permutations each with thousands of control regions.

GSEL begins by identifying independent LD blocks among the input regions using the ‘--clump’ flag in PLINK. Each region is labeled is calculated based on a summary statistic computed across the extreme values at each region (e.g., mean or max).





□ NGSNGS: Next generation simulator for next generation sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad041/6994180

NGSNGS, a multithreaded next-generation simulation tool for NGS. NGSNGS can simulate reads with platform specific characteristics based on nucleotide quality score profiles, as well as incl. a post-mortem damage model which is relevant for simulating ancient DNA.

The simulated sequences are sampled (with replacement) from a reference DNA genome, which can represent a haploid genome, polyploid assemblies, or even population haplotypes and allows the user to simulate known variable sites directly.





□ LRphase: an efficient method for assigning haplotype identity to long reads

>> https://www.biorxiv.org/content/10.1101/2023.01.18.524565v1

LRphase is a command-line utility for phasing long sequencing-reads based on haplotype-resolved heterozygous variants from all contributing genomes, for example, the maternal and paternal genomes of a diploid organism.

In LRphase, Long sequencing reads are prepared from genomic DNA fragments isolated from cells w/ available haplotype data for all parental phases. Reads are mapped to the reference genome, either w/ LRphase w/ minimap2, or externally using any desired mapping/filtering workflow.





□ HiCLift: A fast and efficient tool for converting chromatin interaction data between genome assemblies

>> https://www.biorxiv.org/content/10.1101/2023.01.17.524475v1

HiCLift (previously known as pairLiftOver), a fast and efficient tool that can convert the genomic coordinates of chromatin contacts such as Hi-C and Micro-C from one assembly to another, including the latest T2T genome.

To maximize the mappability ratio, for each pair of bins, HiCLift searches for loci that can be uniquely mapped to the target genome, and randomly samples a pair of mappable loci for each contact between corresponding bins.





□ 4CAC: 4-class classification of metagenome assemblies using machine learning and assembly graphs

>> https://www.biorxiv.org/content/10.1101/2023.01.20.524935v1

4CAC (4-Class Adjacency-based Classifier) generates an initial four-way classification using several sequence length-adjusted XGBoost algorithms and further improves the classification using the assembly graph.

4CAC dynamically maintains a list of implied contigs sorted in decreasing order of the number of their classified neighbors. 4CAC utilizes the adjacency information in the assembly graph to improve the classification of short contigs and of contigs classified w/ lower confidence.





□ IGSimpute: Accurate and interpretable gene expression imputation on scRNA-seq data

>> https://www.biorxiv.org/content/10.1101/2023.01.22.525114v1

IGSimpute, an accurate and interpretable deep learning model, to impute the missing values in gene expression profiles derived from scRNA-seq by integrating instance-wise gene selection and gene-gene interaction layers into an autoencoder.

IGSimpute accepts all types of input gene expression matrices including raw counts, counts per million (CPM), reads per kilobase of exons per million mapped reads (RPKM), fragments per kilobase exons per million mapped fragments (FPKM) and transcripts per million (TPM).





□ NetProphet 3: A Machine-learning framework for transcription factor network mapping and multi-omics integration

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad038/7000334

NetProphet3 combines scores from multiple analyses automatically, using a tree boosting algorithm trained on TF binding location data. NP3 combines four weighted networks DE, LASSO, BART, and PWM using XGBoost.

Each possible (TF, target) edge is an instance with features consisting of its evidence scores and binary labels based on whether there is evidence that the TF binds the target’s regulatory DNA.





□ Utility of long-read sequencing for All of Us

>> https://www.biorxiv.org/content/10.1101/2023.01.23.525236v1

Investigating the utility of long-reads for the All of Us program using a combination of publicly available control / long-read data collected using a range of tissue types and extraction methods from samples previously used inside All of Us to establish the short-read pipeline.

To make the work scalable and reproducible, the pipeline is implemented using the Workflow Definition Language (WDL). They compare this pipeline with Illumina whole genome data processed with DRAGEN, the All of Us production short-read pipeline, to assess long-read utility.





□ Explain-seq: an end-to-end pipeline from training to interpretation of sequence-based deep learning models

>> https://www.biorxiv.org/content/10.1101/2023.01.23.525250v1

Explain-seq, an end-to-end computational pipeline to automate the process of developing and interpreting deep learning models in the context of genomics. Explain-seq takes input as genomic sequences and outputs predictive motifs derived from the model trained on sequences.

Explain-seq takes input as genomic region coordinates with labels for classification tasks or one-hot encoded sequences with continuous value for regression task. Optionally, weights from the pre-trained model can be transferred to the new model for transfer learning.





□ scNanoGPS: Delineating genotypes and phenotypes of individual cells from long-read single cell transcriptomes

>> https://www.biorxiv.org/content/10.1101/2023.01.24.525264v1

scNanoGPS (single cell Nanopore sequencing analysis of Genotypes and Phenotypes Simultaneously) deconvolutes error-prone long-reads into single-cells and single-molecules and calculates both genotypes and phenotypes in individual cells from high throughput scNanoRNAseq data.

iCARLO (Anchoring and Refinery Local Optimizatio), an algorithm to detect true cell barcodes. the CBs within two Levenshtein Distances (LDs) are curated and merged to rescue mis-assigned reads due to errors in CB sequences.

scNanoGPS detects transcriptome-wide point-mutations with accuracy by building consensus sequences of single molecules and performing consensus filtering of cellular prevalence, which removes most false calls due to random sequencing errors.





□ tiSFM: An intrinsically interpretable neural network architecture for sequence to function learning

>> https://www.biorxiv.org/content/10.1101/2023.01.25.525572v1

tiSFM (totally interpretable sequence to function model) improves the performance of multi-layer convolutional models. While tiSFM is itself technically a multi-layer neural network, internal model parameters are intrinsically interpretable in terms of relevant sequence motifs.

tiSFM’s model architecture makes use of convolutions with a fixed set of kernel weights representing known transcription factor (TF) binding site motifs. The final linear layer directly maps TFs to outputs and can produce a meaningful TF by output matrix with no post processing.





□ Gdaphen: R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05111-0

Gdaphen takes as input behavioral/clinical data and uses a Multiple Factor Analysis (MFA) to deal with groups of variables recorded from the same individuals or anonymize genotype-based recordings.

Gdaphen uses as optimized input the non-correlated variables with 30% correlation or higher on the MFA-Principal Component Analysis (PCA), increasing the discriminative power and the classifier’s predictive model efficiency.

Gdaphen can determine the strongest variables that predict gene dosage effects thanks to the General Linear Model (GLM)-based classifiers or determine the most discriminative not linear distributed variables thanks to Random Forest (RF) implementation.





□ macrosyntR : Drawing automatically ordered Oxford Grids from standard genomic files in R

>> https://www.biorxiv.org/content/10.1101/2023.01.26.525673v1

Macrosynteny refers to the conservation of chromosomal to sub-chromosomal domains across species. Pairwise comparison syntenic relationships of de-novo assembled genomes based on predicted protein sequences often use a graphical visualization called an Oxford grid.

macrosyntR automatically identifies order and plots the relative spatial arrangement of orthologous genes on Oxford Grids. It features an option to use a network-based greedy algorithm to cluster the sequences that are likely to originate from the same ancestral chromosome.





□ L0 segmentation enables data-driven concise representations of diverse epigenomic data

>> https://www.biorxiv.org/content/10.1101/2023.01.26.525794v1

L0 segmentation as a universal framework for extracting locally coherent signals for diverse epigenetic sources. L0 segmentation retains salient genomic features.

L0 segmentation efficiently represents epigenetic tracks while retaining many salient features such as peaks, promoters and ChromHMM states while making no assumptions about the underlying signal structure beyond piece wise constant.





□ EigenDel: Detecting genomic deletions from high-throughput sequence data with unsupervised learning

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05139-w

EigenDel first takes advantage of discordant read-pairs and clipped reads to get initial deletion candidates, and then it clusters similar candidates by using unsupervised learning methods. After that, EigenDel uses a carefully designed approach for calling true deletions from each cluster.

EigenDel processes each chromosome separately to call deletions. For each chromosome, EigenDel extracts discordant read-pairs and clipped reads from mapped reads. Then, the initial deletion candidates are determined by grouping nearby discordant read-pairs.





□ STARE: The adapted Activity-By-Contact model for enhancer-gene assignment and its application to single-cell data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad062/7008325

Any model to annotate enhancer-gene is only a prediction and likely not capturing the whole regulatory complexity of genes. The ABC-model requires two data types, which makes it applicable in a range of scenarios, but it might also miss out relevant epigenetic information.

STARE can compute enhancer-gene interactions from single-cell chromatin accessibility data. After mapping candidate enhancers to genes, using either the ABC-score or a window-based approach, STARE summarises TF affinities on a gene level.





□ FUSTA: leveraging FUSE for manipulation of multiFASTA files at scale

>> https://academic.oup.com/bioinformaticsadvances/article/2/1/vbac091/6851693

FUSTA is a FUSE-based virtual filesystem mirroring a (multi)FASTA file as a hierarchy of individual virtual files, simplifying efficient data extraction and bulk/automated processing of FASTA files.

The virtual files exposed by FUSTA behave like standard flat text files, and provide automatic compatibility w/ all existing programs. FUSTA can operate on gapped files and wrapped files, files containing empty sequences, and support any character within the sequence themselves.





□ iEnhancer-SKNN: a stacking ensemble learning-based method for enhancer identification and classification using sequence information

>> https://academic.oup.com/bfg/advance-article-abstract/doi/10.1093/bfgp/elac057/7008796

iEnhancer-SKNN, a two-layer prediction model, in which the function of the first layer is to predict whether the given DNA sequences are enhancers or non-enhancers, and the function of the second layer is to distinguish whether the predicted enhancers are strong enhancers or weak enhancers.

iEnhancer-SKNN achieves an accuracy of 81.75%, an improvement of 1.35% to 8.75% compared with other predictors, and in enhancer classification, iEnhancer-SKNN achieves an accuracy of 80.50%, an improvement of 5.5% to 25.5% compared with other predictors.






□ PyGenePlexus: A Python package for gene discovery using network-based machine learning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad064/7017525

PyGenePlexus provides predictions of how associated every gene is to the input gene set, offers interpretability by comparing the model trained on the input gene set to models trained on thousands of gene sets, and returns the network connectivity of the top predicted genes.

PyGenePlexus allows to input a set of genes and choose desired network. PyGenePlexus trains a custom ML model and returns the probability of how associated every gene in the network is to the user supplied gene set, along w/ the network connectivity of the top predicted genes.



Octanium.

2023-01-31 23:09:11 | Science News

(Art by kalsloos)


『不寛容』を諌めるのも不寛容とされるのが個対個の難しさ。社会規範における不寛容は、一時的な力学的均衡と秩序の内在化に実効性を齎す。他者への不寛容は自己束縛であり平面的に波及する。「物事が間違った方向へ進む」のは相互の偏向性が要因であるから、意図して為せるものは一つとしてない。



□ MaxFuse: Integration of spatial and single-cell data across modalities with weak linkage

>> https://www.biorxiv.org/content/10.1101/2023.01.12.523851v1

MaxFuse (MAtching X-modality via FUzzy Smoothed Embedding) is modality-agnostic and, through comprehensive benchmarks on single-cell and spatial ground-truth multiome datasets. MaxFuse boosts the signal-to-noise ratio in the linked features within each modality.

MaxFuse goes beyond label transfer and attempts to match cells to precise positions on a graph-smoothed low-dimensional embedding. MaxFuse iteratively refines the matching step based on graph smoothing, linear assignment, and Canonical Correlation Analysis.





□ Revolution: Self-supervised learning for DNA sequences with circular dilated convolutional networks

>> https://www.biorxiv.org/content/10.1101/2023.01.30.526193v1

Revolution (ciRcular dilatEd conVOLUTIONal), a self-supervised learning for long DNA sequences. A circular dilated design of Revolution allows it to capture the long-range interactions in DNA sequences, while the pretraining benefits Revolution with only a few supervised labels.

Revolution can handle long sequences and accurately conduct DNA-sequence-based inference.The Revolution network in the predictor mixes the encoded information toward the inference target, and the pooling and linear layer perform the final ensemble.





□ SPEAR: a Sparse Supervised Bayesian Factor Model for Multi-omic Integration

>> https://www.biorxiv.org/content/10.1101/2023.01.25.525545v1

SPEAR jointly models multi-omics data w/ the response in a probabilistic Bayesian framework and models a variety of response types in regression / classification tasks, distinguishing itself from existing response-guided dimensionality reduction methods such as sMBPLS and DIABLO.

SPEAR decomposes high-dimensional multi-omic datasets into interpretable low-dimensional factors w/ high predictive power. SPEAR returns both sparse regression and full projection coefficients as well as feature- wise posterior probabilities used to assign feature significance.





□ DeepERA: deep learning enables comprehensive identification of drug-target interactions via embedding of heterogeneous data

>> https://www.biorxiv.org/content/10.1101/2023.01.27.525827v1

DeepERA identies drug-target interactions based on heterogeneous data. This model assembles three independent feature embedding modules which each represent different attributes of the dataset and jointly contribute to the comprehensive predictions.

DeepERA specified three embedding components based on the formats and properties of the corresponding data: protein sequences and drug SMILES strings are processed by a CNN and a whole-graph GNN, respectively, in the intrinsic embedding component.





□ GRN-VAE: A Simplified and Stabilized SEM Model for Gene Regulatory Network Inference

>> https://www.biorxiv.org/content/10.1101/2023.01.26.525733v1

GRN-VAE which stabilizes the results of DeepSEM by only restricting the sparsity of the adjacency matrix at a later stage. GRN-VAE improves stability and efficiency while maintaining accuracy by delayed introduction of the sparse loss term.

GRN-VAE uses a Dropout Augmentation, to improve model robustness by adding a small amount of simulated dropout to the data. To minimize the negative impact of dropout in single-cell data, GRN-VAE trains on non-zero data.





□ GraphGPSM: a global scoring model for protein structure using graph neural networks

>> https://www.biorxiv.org/content/10.1101/2023.01.17.524382v1

GraphGPSM uses an equivariant graph neural network (EGNN) architecture and a message passing mechanism is designed to update and transmit information between nodes and edges of the graph. The global score of the protein model is output through a multilayer perceptron.

Atomic-level backbone features encoded by Gaussian radial basis functions, residue-level ultrafast shape recognition (USR), Rosetta energy terms, distance and orientations, one-hot encoding of sequences, and sinusoidal position encoding of residues.





□ G3DC: a Gene-Graph-Guided selective Deep Clustering method for single cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2023.01.15.524109v1

G3DC incorporates a graph loss based on existing gene network, together with a reconstruction loss to achieve both discriminative and informative embedding. This method is well adapted to the sparse and zero-inflated scRNA-seq data with the l2,1-norm involved.

G3DC utilizes the Laplacian matrix of the gene-gene interaction graph to make adjacent genes have similar weights, and hence guides the feature selection, reconstruction, and clustering. G3DC offers high clustering accuracy with regard to agreement with true cell types.





□ GM-lncLoc: LncRNAs subcellular localization prediction based on graph neural network with meta-learning

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-09034-1

GM-lncLoc is based on the initial information extracted from the lncRNA sequence, and also combines the graph structure information to extract high level features of lncRNA. GM-lncLoc combines GCN and MAML in predicting lncRNA subcellular localization.

GM-lncLoc predicts lncRNA subcellular localization more effectively than GCN alone. GM-lncLoc is able to extract information from the perspective of non-Euclidean space, which is the most different from previous methods based on Euclidean space data.





□ scMaui: Decoding Single-Cell Multiomics: scMaui - A Deep Learning Framework for Uncovering Cellular Heterogeneity in Presence of Batch Effects and Missing Data

>> https://www.biorxiv.org/content/10.1101/2023.01.18.524506v1

scMaui (Single-cell Multiomics Autoencoder Integration), a stacked VAE-based single-cell multiomics integration model, and showed its capability of extracting essential features from extremely high-dimensional information in varied single-cell multiomics datasets.

scMaui can handle multiple batch effects accepting both discrete and continuous values, as well as provides varied reconstruction loss functions. scMaui encodes given data into a reduced dimensional latent space after processing each assay in parallel via separated encoders.





□ DESP: Demixing Cell State Profiles from Dynamic Bulk Molecular Measurements

>> https://www.biorxiv.org/content/10.1101/2023.01.19.524460v1

DESP, a novel algorithm that leverages independent readouts of cellular proportions, such as from single-cell RNA-seq or cell sorting, to resolve the relative contributions of cell states to bulk molecular measurements, most notably quantitative proteomics,recorded in parallel.

DESP’s mathematical model is designed to circumvent the poor mRNA-protein correlation. DESP accurately reconstructs cell state signatures from bulk-level measurements of both the proteome and transcriptome providing insights into transient regulatory mechanisms.





□ KOMPUTE: Imputing summary statistics of missing phenotypes in high-throughput model organism data

>> https://www.biorxiv.org/content/10.1101/2023.01.12.523855v1

Using conditional distribution properties of multivariate normal, KOMPUTE estimates association Z-scores of unmeasured phenotypes for a particular gene as a conditional expectation given the Z-scores of measured phenotypes.

The KOMPUTE method demonstrated superior performance compared to the singular value decomposition (SVD) matrix completion method across all simulation scenarios.





□ Benchmarking Algorithms for Gene Set Scoring of Single-cell ATAC-seq Data

>> https://www.biorxiv.org/content/10.1101/2023.01.14.524081v1

GSS converts the gene-level data into gene set-level information; gene sets contain genes representing distinct biological processes (e.g., same Gene Ontology annotation) or pathways (e.g., MSigDB). They conducted in-depth evaluation on the impact of different GA tools on GSS.

GSS helps to decipher single-cell heterogeneity and cell-type-specific variability by incorporating prior knowledge from functional gene sets or pathways. The pipeline for evaluating GSS tools involves an additional preprocessing step -- imputation of dropout peaks.





□ SVhound: detection of regions that harbor yet undetected structural variation

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05046-6

SVhound is a framework to predict regions that harbour so far unidentified genotypes of Structural Variations. It uses a population size VCF file as input and reports the probabilities and regions across the population.

SVhound counts the number of different SV-alleles that occur in a sample of n genomes. SVhound predicts regions that can potentially harbor new structural variants (clairvoyant SV, clSV) by estimating the probability of observing a new SV-allele.





□ node2vec+: Accurately modeling biased random walks on weighted networks using node2vec

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad047/6998205

node2vec+, an improved version of node2vec that is more effective for weighted graphs by taking into account the edge weight connecting the previous vertex and the potential next vertex.

node2vec+ is a natural extension of node2vec; when the input graph is unweighted, the resulting embeddings of node2vec+ and node2vec are equivalent in expectation. Moreover, when the bias parameters are set to neutral, node2vec+ recovers a first-order random walk.





□ Gos: a declarative library for interactive genomics visualization in Python

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad050/6998203

Gos supports remote and local genomics data files as well as in-memory data structures. Gos integrates seamlessly within interactive computational environments, containing utilities to host and display custom visualizations within Jupyter, JupyterLab, and Google Colab notebooks.

Datasets are transformed to visual properties of marks via the Gos API to build custom interactive genomics visualizations. The field name / data type for an encoding may be specified w/ a simplified syntax (e.g, “peak:Q” denotes the “peak” variable w/ a quantitative data type).





□ CONTRABASS: Exploiting flux constraints in genome-scale models for the detection of vulnerabilities

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad053/7000333

CONTRABASS is a tool for the detection of vulnerabilities in metabolic models. The main purpose of the tool is to compute chokepoint and essential reactions by taking into account both the topology and the dynamic information of the model.

CONTRABASS can compute essential genes, compute and remove dead-end metabolites, compute different sets of growth-dependent reactions, and update the flux bounds of the reactions according to the results of Flux Variability Analysis.





□ PolyAMiner-Bulk: A Machine Learning Based Bioinformatics Algorithm to Infer and Decode Alternative Polyadenylation Dynamics from bulk RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2023.01.23.523471v1

PolyAMiner-Bulk utilizes an attention-based machine learning architecture and an improved vector projection-based engine to infer differential APA dynamics. PolyAMiner-Bulk can take either the raw read files in fastq format or the mapped alignment files in bam format as input.

PolyAMiner-Bulk not only identifies differential APA genes but also generates (i) read proportion heatmaps and (ii) read density visualizations of the corresponding bulk RNA-seq tracks and pseudo-3’UTR-seq tracks, allowing users to appreciate the differential APA dynamics.





□ ICARUS v2.0: Delineation of complex gene expression patterns in single cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2023.01.23.525100v1

ICARUS v2.0 enables gene co-expression analysis with Multiscale Embedded Gene Co-expression Network Analysis (MEGENA), transcription factor regulated network identification w/ SCENIC, trajectory analysis with Monocle3, and characterisation of cell-cell communication w/ CellChat.

ICARUS v2.0 introduces cell cluster labelling with sctype, an ultra-fast unsupervised method for cell type annotation using compiled cell markers from CellMarker. ICARUS provides the SingleR supervised cell-type assignment algorithm.





□ PPLasso: Identification of prognostic and predictive biomarkers in high-dimensional data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05143-0

PPLasso is particularly interesting for dealing with high dimensional omics data when the biomarkers are highly correlated, which is a framework that has not been thoroughly investigated yet.

PPLasso atakes into account the correlations between biomarkers that can alter the biomarker selection accuracy. PPLasso consists in transforming the design matrix to remove the correlations between the biomarkers before applying the generalized Lasso.





□ nf-core/circrna: a portable workflow for the quantification, miRNA target prediction and differential expression analysis of circular RNAs

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05125-8

nf-core/circrna offers a differential expression module to detect differentially expressed circRNAs and model changes in circRNA expression relative to its host gene guided by the phenotype.csv file provided by the user.

nf-core/circrna is the first portable workflow capable of performing the quantification, miRNA target prediction and differential expression analysis of circRNAs in a single execution.





□ FastContext: A tool for identification of adapters and other sequence patterns in next generation sequencing (NGS) data

>> https://vavilov.elpub.ru/jour/article/view/3582

The FastContext algorithm parses FastQ files (single-end / paired-end), searches read / read pair for user-specified patterns, and generates a human-readable representation of the search results. FastContext gathers statistics on frequency of occurence for each read structure.

FastContext performs the search based on full match, and a pattern sequence with one single sequencing error will be skipped as an unrecognized sequence. This is important for long patterns, which are under represented due to higher cumulative frequency of sequencing errors.





□ SeqPanther: Sequence manipulation and mutation statistics toolset

>> https://www.biorxiv.org/content/10.1101/2023.01.26.525629v1

SeqPanther, a Python application that provides the user with a suite of tools to further interrogate the circumstance under which these mutations occur and to modify the consensus as needed for non-segmented bacterial and viral genomes where reads are mapped to a reference.

SeqPanther generates detailed reports of mutations identified within a genomic segment or positions of interest, incl. visualization of the genome coverage and depth. SeqPanther features a suite of tools that perform various functions including codoncounter, cc2ns, and nucsubs.





□ r-pfbwt: Building a Pangenome Alignment Index via Recursive Prefix-Free Parsing

>> https://www.biorxiv.org/content/10.1101/2023.01.26.525723v1

An algorithm for building the SA sample and RLBWTof Moni in manner that removes the dependency of the construction on the parse from prefix-free parsing.

This reduces the memory required by 2.7 times on large collections of chromosome 19. On full human genomes this reducing was even more pronounced and r-pfbwt was the only method that was able to index 400 diploid human genomes sequences.

Although the dictionary scales nicely (sub-linear) with the size of the input, the parse becomes orders of magnitude larger than the dictionary. To scale the construction of Moni, they need to remove the parse from the construction of the RLBWT and suffix array.





□ The Ontology of Biological Attributes (OBA) - Computational Traits for the Life Sciences

>> https://www.biorxiv.org/content/10.1101/2023.01.26.525742v1

The Ontology of Biological Attributes (OBA) is a formalised, species-independent collection of interoperable phenotypic trait categories that is intended to fulfil a data integration role.

The logical axioms in OBA also provide a previously missing bridge that can computationally link Mendelian phenotypes with GWAS and quantitative traits. OBA provides semantic links and data integration across specialised research community boundaries, thereby breaking silos.





□ DGAN: Improved downstream functional analysis of single-cell RNA-sequence data

>> https://www.nature.com/articles/s41598-023-28952-y

DGAN (Deep Generative Autoencoder Network) is an evolved variational autoencoder designed to robustly impute data dropouts in scRNA-seq data manifested as a sparse gene expression matrix.

DGAN learns gene expression data depiction and reconstructs the imputed matrix. DGAN principally reckons count distribution, besides data sparsity utilizing a gaussian model whereby, cell dependencies are capitalized to detect and exclude outlier cells via imputation.





□ HAT: de novo variant calling for highly accurate short-read and long-read sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.01.27.525940v1

Hare-And-Tortoise (HAT) a de novo variant caller for sequencing data from short-read WES, short-read WGS, and long-read WGS in parent-child sequenced trios. HAT is important for generating DNV calls for use in studies of mutation rates and identification of disease-relevant DNVs.

The general HAT workflow consists of three main steps: GVCF generation, family-level genotyping, and filtering of variants to get final DNVs. The genotyping step is done with GLnexus.





□ demuxmix: Demultiplexing oligonucleotide-barcoded single-cell RNA sequencing data with regression mixture models

>> https://www.biorxiv.org/content/10.1101/2023.01.27.525961v1

demuxmix’s probabilistic classification framework provides error probabilities for droplet assignments that can be used to discard uncertain droplets and inform about the quality of the HTO data and the demultiplexing success.

demuxmix utilizes the positive association between detected genes in the RNA library and HTO counts to explain parts of the variance in the HTO data resulting in improved droplet assignments.





□ PACA: Phenotypic subtyping via contrastive learning

>> https://pubmed.ncbi.nlm.nih.gov/36711575/

Phenotype Aware Components Analysis (PACA) is a contrastive learning approach leveraging canonical correlation analysis to robustly capture weak sources of subphenotypic variation.

PACA learns a gradient of variation unique to cases in a given dataset, while leveraging control samples for accounting for variation and imbalances of biological and technical confounders between cases and controls.





□ DecontPro: Decontamination of ambient and margin noise in droplet-based single cell protein expression data

>> https://www.biorxiv.org/content/10.1101/2023.01.27.525964v1

DecontPro, a novel hierarchical Bayesian model that can decontaminate ADT data by estimating and removing contamination from ambient and margin sources. DecontPro was able to preserve the native markers in known cell types while removing contamination from the non-native markers.

DecontPro outperforms other decontamination tools in removing aberrantly expressed ADTs while retaining native ADTs and in improving clustering specificity after decontamination. DecontPro can be incorporated into CITE-seq workflows to improve the quality of downstream analyses.





□ SMURF: embedding single-cell RNA-seq data with matrix factorization preserving self-consistency

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbad026/7008800

SMURF embeds cells and genes into their latent space vectors utilizing matrix factorization with a mixture of Poisson-Gamma divergent as objective while preserving self-consistency. SMURF exhibited feasible cell subpopulation discovery efficacy with the latent vectors.

SMURF can reduce the cell embedding to a 1D-oval space to recover the time course of cell cycle. SMURF paraded the most robust gene expression recovery power with low root mean square error and high Pearson correlation.





□ Uvaia: Scalable neighbour search and alignment

>> https://www.biorxiv.org/content/10.1101/2023.01.31.526458v1

Uvaia is a program for pairwise reference-based alignment, and subsequent search against an aligned database. The alignment uses the promising WFA library implemented by Santiago Marco-Sola, and the database search is based on score distances from my biomcmc-lib library.

The first versions used the kseq.h library, by Heng Li, for reading fasta files, but currently it relies on general compression libraries available on biomcmc-lib. In particular all functions should work with XZ compressed files for optimal compression.





□ MoP2: DSL2 version of Master of Pores: Nanopore Direct RNA Sequencing Data Processing and Analysis using MasterOfPores

>> https://link.springer.com/protocol/10.1007/978-1-0716-2962-8_13

MoP2, an open-source suite of pipelines for processing and analyzing direct RNA Oxford Nanopore sequencing data. The MoP2 relies on the Nextflow DSL2 framework and Linux containers, thus enabling reproducible data analysis in transcriptomic and epitranscriptomic studies.

MoP2 starts w/ the pre-processing of raw FAST5 , which incl. basecalling, read quality control, demultiplexing, filtering, mapping, estimation of per-gene/transcript abundances, and transcriptome assembly, w/ support of the GPU computing for the basecalling and read demultiplex.





□ Sequoia: A Framework for Visual Analysis of RNA Modifications from Direct RNA Sequencing Data

>> https://link.springer.com/protocol/10.1007/978-1-0716-2962-8_9

Sequoia, a visual analytics application that allows users to interactively analyze signals originating from nanopore sequencers and can readily be extended to both RNA and DNA sequencing datasets.

Sequoia combines a Python-based backend with a multi-view graphical interface that allows users to ingest raw nanopore sequencing data in Fast5 format, cluster sequences based on electric-current similarities, and drill-down onto signals to find attributes of interest.




Ultima Genomics

>> https://www.genomeweb.com/sequencing/ny-genome-center-team-harnesses-ultima-genomics-platform-high-sensitivity-ctdna

Thanks to @nygenome and @landau_lab for their great work demonstrating the power of genomics at scale! This is an example of where the field is headed and what the Ultima platform makes possible.








Memoirs.

2023-01-31 23:07:11 | 写真

どう生きて何を為して誰を愛したかなんて、お伽話みたいに何も証明してはくれない。傍観者すらいない。人生は「人生のようなもの」で、愛は「愛のようなもの」。どれもどこかで聞いたような話。思い描いた「その日」はやってこない。だけど振り返ればいつも、そんな日々がお伽話だったって気付くんだ。