lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Light of Day.

2022-05-05 05:06:07 | Science News




□ INTERSTELLAR: A universal sequencing read interpreter

>> https://www.biorxiv.org/content/10.1101/2022.04.16.488535v1.full.pdf

INTERSTELLAR (interpretation, scalable transformation, and emulation of large-scale sequencing reads) that extracts data values encoded in theoretically any type of sequencing read and translates them into sequencing reads of any structure of choice.

INTERSTELLAR enables to translate a more complex read structure with higher order optimal space. A read pool of 10X Chromium multiplexed is translated into a hypothetical structure w/ multi-layered parental-local segment allocations and translated back to the 10X read structure.






□ PanGenie: Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes

>> https://www.nature.com/articles/s41588-022-01043-w

PanGenie, a new algorithm that leverages a haplotype-resolved pangenome reference together with k-mer counts from short-read sequencing data to genotype a wide spectrum of genetic variation - a process is refered to as genome inference.

PanGenie genotypes a large fraction of variants not typable by the former. PanGenie bypasses read mapping and is entirely based on k-mers, which allows it to rapidly proceed from the input short reads to a final callset including SNPs, indels and SVs.





□ Stardust: improving spatial transcriptomics data analysis through space aware modularity optimization based clustering.

>> https://www.biorxiv.org/content/10.1101/2022.04.27.489655v1.full.pdf

spaceWeight defines how much to weigh the space with respect to the transcriptional similarity. By configuring a single parameter the user can control how much the space-based measure weights on the overall measure.

Stardust computes the Louvain edge weights through a linear formulation and requires a fixed a priori parameter. Stardust* uses a dynamic non-linear formulation that changes the spatial weight according to the transcriptomics values in the surrounding space.





□ CellSpace: Scalable sequence-informed embedding of single-cell ATAC-seq data

>> https://www.biorxiv.org/content/10.1101/2022.05.02.490310v1.full.pdf

CellSpace captures meaningful latent structure in scATAC-seq datasets, including cell subpopulations and developmental hierarchies, and scores the activity of transcription factors in single cells based on proximity to binding motifs embedded in the same space.

CellSpace employs a latent embedding algorithm from natural language processing called StarSpace. The latent semantic embedding of entities in StarSpace has also been reformulated as a graph embedding problem.

CellSpace learns a joint embedding of k-mers and cells so that cells will be embedded close to each other in the latent space not simply due to shared accessible events but based on the shared DNA sequence content of their accessible events.





□ Airpart: Interpretable statistical models for analyzing allelic imbalance in single-cell datasets

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac212/6564225

Airpart identifies differential CTS AI from single-cell RNA- sequencing (scRNA-seq) data, or other spatially- or time-resolved datasets. Airpart outputs discrete partitions of data, pointing to groups of genes and cells under common mechanisms of cis-genetic regulation.

Airpart uses a Generalized Fused Lasso with Binomial likelihood for partitioning groups of cells by AI signal, and a hierarchical Bayesian model. Airpart identified DAI patterns across cell states and could be used to define trends of AI signal over spatial or time axes.

<be />



□ scAEGAN: Unification of Single-Cell Genomics Data by Adversarial Learning of Latent Space Correspondences

>> https://www.biorxiv.org/content/10.1101/2022.04.19.488745v1.full.pdf

scAEGAN, a hybrid architecture using an autoencoder (AE) network together with adversarial learning by a cycleGAN (cGAN) network. The core insight is that the AE respects each sample's uniqueness, whereas the cGAN exploits the distributional data similarity in the latent space.

scAEGAN outperforms Seurat3 in library integration, is more robust against data sparsity, and beats Seurat 4 in integrating paired data from the same cell. Furthermore, in predicting one data modality from another, scAEGAN outperforms Babel.





□ GeneVector: Identification of transcriptional programs using dense vector representations defined by mutual information.

>> https://www.biorxiv.org/content/10.1101/2022.04.22.487554v1.full.pdf

GeneVector, a scalable framework for dimensionality reduction implemented as a vector space model using mutual information. It identifies metagenes that correspond to cell-specific transcriptional processes incl. canonical phenotype and cell type-specific interferon activated GE.

GeneVector model provide a framework for identifying metagenes within a gene similarity graph from the cosine distance between each gene vector, and relating these metagenes back to each cell using latent space arithmetic.





□ Poisson VAE: Modeling fragment counts improves single-cell ATAC-seq analysis

>> https://www.biorxiv.org/content/10.1101/2022.05.04.490536v1.full.pdf

scATAC-seq data can be treated quantitatively and that useful information is lost through binarization of the counts. Fragment counts, but not read counts, can be approximately modeled with the Poisson distribution.

Modeling DNA accessibility in single nuclei quantitatively, rather than as a binary state, is consistent with the fact that to access DNA, transcription factors, just like transposases, have to diffuse through the nucleus, likely reaching distinct chromosome territories.

Adapting PeakVI to models Poisson-distributed data in a Poisson VAE. Poisson VAE significantly outperformed PeakVI in reconstructing binarized counts as measured by average precision - NeurIPS: adjusted P = 1.2 x 10^-7 and Satpathy et al.: adjusted P = 6.9 x 10^-8.





□ Clair3-Trio: high-performance Nanopore long-read variant calling in family trios with Trio-to-Trio deep neural networks

>> https://www.biorxiv.org/content/10.1101/2022.05.03.490460v1.full.pdf

The MCVLoss (Mendelian Inheritance Constraint Violation Loss) function is designed to improve variant calling in trios by leveraging the explicit encoding of the priors of the Mendelian inheritance in trios.

Clair3-Trio, the first variant caller tailored for family trio data from Nanopore long-reads. Clair3-Trio employs a Trio-to-Trio deep neural network model, which allows it to input the trio sequencing information and output all of the trio’s predicted variants within a single model.





□ UniTVelo: temporally unified RNA velocity reinforces single-cell trajectory inference

>> https://www.biorxiv.org/content/10.1101/2022.04.27.489808v1.full.pdf

UniTVelo, a statistical framework that models the full dynamics of gene expression with a radial basis function (RBF) and quantifies RNA velocity in a top-down manner. It also introduced a unified latent time across the whole transcriptome.

UniTVelo supports a gene-independent mode to assign the latent time to each gene independently, similar to scVelo. The unified mode allows to aggregate information for all genes, reinforcing the directionality in the trajectory inference, i.e. weak kinetics or complex branches.





□ BiWFA: Optimal gap-affine alignment in O(s) space

>> https://www.biorxiv.org/content/10.1101/2022.04.14.488380v1.full.pdf

the bidirectional Wavefront Alignment algorithm (BiWFA), the first gap-affine algorithm capable of computing optimal alignments in O(s) memory while retaining the WFA’s time complexity of O(ns).

BiWFA’s time complexity is O((m+n)s). BiWFA computes the WFA alignment of two sequences in the forward and reverse direction until they meet. The BiWFA answers the pressing need for sequence alignment methods capable to scaling to genome-scale alignments / full pangenomes.





□ Minigraph-0.17 (r524)

>> https://github.com/lh3/minigraph/releases/tag/v0.17

Minigraph-0.17 gives more accurate graph alignment and generally simpler graph topology. Note that minigraph still focuses on structural variations and does not generate base-level graphs. To endusers, minigraph remains similar feature wise.

Minigraph-0.17 attempts to connect linear chains with the graph wavefront alignemnt algorithm (GWFA) and produces the final alignment with miniwfa under the 2-piece gap penalty. Graph generation also considers base alignment.





□ One Cell At a Time (OCAT): a unified framework to integrate and analyze single-cell RNA-seq data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02659-1

OCAT employs the local anchor embedding (LAE) algorithm to further optimize the edge weights from each single cell to the remaining most similar “ghost” cells, such that the resulting sparsified weights can most effectively reconstruct the transcriptomic features.

OCAT constructs a bipartite graph b/n all single cells and the “ghost” cell set using similarities as edge weights. OCAT captures the cell similarities through message passing b/n the “ghost” cells, which maps the sparsified weights of all single cells to the global latent space.





□ plotsr: Visualising structural similarities and rearrangements between multiple genomes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac196/6569079

Plotsr generates high-quality visualisation of synteny and structural rearrangements between multiple genomes. For this, it uses the genomic structural annotations between multiple chromosome-level assemblies.

Plotsr can be used to compare genomes on chromosome level or to zoom in on any selected region. In addition, plotsr can augment the visualisation with regional identifiers (e.g. genes or genomic mark- ers) or histogram tracks for continuous features.





□ UMINT: Unsupervised Neural Network For Single Cell Multi-Omics Integration

>> https://www.biorxiv.org/content/10.1101/2022.04.21.489041v1.full.pdf

UMINT (Unsupervised neural network for single cell Multi-omics INTegration) serves as a promising model for integrating variable number of single cell omics layers with high dimensions, and provides substantial reduction in the number of parameters.

UMINT-generated latent embedding has been proved to produce better clustering as compared to AE. Even without batch integration, UMINT can extract most relevant features from the data that can act as input to further downstream investigations.





□ CONCERT: Genome-wide prediction of sequence elements that modulate DNA replication timing

>> https://www.biorxiv.org/content/10.1101/2022.04.21.488684v1.full.pdf

CONCERT (CONtext-of-sequenCEs for Replication Timing unifies (i) modeling of long-range spatial dependencies across different genomic loci and (ii) detection of a subset of genomic loci that are predictive of the target genomic signals over large-scale spatial domains.

CONCERT integrates two functionally cooperative modules, a selector, which performs importance estimation- based sampling to detect predictive sequence elements, and a predictor, which incorporates bidirectional recurrent neural networks and self-attention mechanism.





□ Generative Moment Matching Networks for Genotype Simulation

>> https://www.biorxiv.org/content/10.1101/2022.04.14.488350v1.full.pdf

Generative Moment Matching Networks (GMMNs) require only training one unique network (the generator), and do not need to observe the data directly, but instead can observe “sketches” that capture the statistical properties of the database as a whole.

GMMN architecture uses a linear layer of dimension 5000 × 4096, followed by a ReLU and a batch norm, followed by another linear layer of dimension 4096 × 5000, finishing w/ a binary quantizer. The random features are implemented w/ a random linear layer of dimension 5000 × 50000.





□ RODAN: a fully convolutional architecture for basecalling nanopore RNA sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04686-y

The RODAN architecture is composed of 22 convolutional blocks and contains around 10M parameters. RODAN gradually incorporates surrounding information for each position in the signal by increasing the kernel size with each successive convolutional block.

In RODAN architecture, increasing the number of channels / the kernel sizes used in each layer, up to 768 channels/a kernel size of 100 in the final layer. the convolutional block includes a pointwise expansion to increase the number of channels before the depthwise convolution.





□ A hybrid unsupervised approach for accurate short read clustering and barcoded sample demultiplexing in nanopore sequencing

>> https://www.biorxiv.org/content/10.1101/2022.04.13.488186v1.full.pdf

An unsupervised hybrid approach to achieve accurate short read clustering for Nanopore sequencing, in which the nucleobase-based greedy algorithm is utilized to obtain initial clusters, and the raw signal information is measured to guide the continuously optimization.

Dynamic Time Warping algorithm has been accelerated by GPU and the clustering time is completely acceptable. A block-wise acceleration strategy is proposed to fully utilize the advantage of GPU blocks, which enables the launch of million threads of DTW calculation simultaneously.





□ deepCNNvalid: Validation of genetic variants from NGS data using Deep Convolutional Neural Networks

>> https://www.biorxiv.org/content/10.1101/2022.04.12.488021v1.full.pdf

The validation of genetic variants can be improved using a machine learning approach resting on a Convolutional Neural Network, trained using ex- isting human annotation. A way in which contextual data from sequencing tracks can be included into the automated assessment.

The idea of including additional context tracks to handle library-specific artefacts translates analogously to this case, so that sequencing data of unrelated samples with the same library preparation would be added along the depth dimension.




□ DiMeLo-seq: a long-read, single-molecule method for mapping protein–DNA interactions genome wide

>> https://www.nature.com/articles/s41592-022-01475-6

DiMeLo-seq combines elements of antibody-directed protein–DNA mapping approache to deposit methylation marks near a specific target protein, then uses long-read sequencing to read out these exogenous methylation marks directly.

DiMeLo-seq’s long sequencing reads often overlap multiple heterozygous sites, enabling phasing and measurement of haplotype-specific protein–DNA interactions. Finally, long reads enable mapping of protein–DNA interactions within highly repetitive regions of the genome.





□ BioNE: Integration of network embeddings for supervised learning

>> https://www.biorxiv.org/content/10.1101/2022.04.26.489560v1.full.pdf

The BioNE framework integrates embeddings from different embedding method, enabling the assessment of whether the combined embeddings offer complementary information with regards to the input network features and thus better performance on prediction tasks.

The BioNE pipeline consists of three steps: network preparation, network embedding, and link prediction: BioNE’s network embedding step takes the prepared input and applies network embedding methods to learn low-dimensional vector representations for each node on the network.





□ METACLUSTERplus - an R package for probabilistic inference and visualization of context-specific transcriptional regulation of biosynthetic gene clusters

>> https://www.biorxiv.org/content/10.1101/2022.04.11.487835v1.full.pdf

METACLUSTERplus, a probabilistic framework that integrates gene expression compendia, context-specific annotations, biosynthetic gene cluster definitions, as well as gene regulatory network architectures.

METACLUSTERplus redefines the transcriptional activity inference in order to compensate for a potential weakness in the original framework. It further augments TA analysis by another layer, that is the simultaneous inference of context specific transcriptional regulation.





□ scTour: a deep learning architecture for robust inference and accurate prediction of cellular dynamics

>> https://www.biorxiv.org/content/10.1101/2022.04.17.488600v1.full.pdf

scTour simultaneously infers the developmental pseudotime, transcriptomic vector field and latent space of cells, with all these inferences unaffected by batch effects inherent in the datasets.

scTour predicts the transcriptomic properties and dynamics of unseen cellular states. the inference of a low-dimensional latent space which combines the intrinsic transcriptome and extrinsic time information provides richer information for reconstructing a finer cell trajectory.





□ PhenoComb: A discovery tool to assess complex phenotypes in high-dimension, single-cell datasets

>> https://www.biorxiv.org/content/10.1101/2022.04.06.487335v1.full.pdf

PhenoComb uses signal intensity thresholds to assign markers to discrete states (e.g. negative, low, high) and then counts the number of cells per sample from all possible marker combinations in a memory-safe manner.

PhenoComb counts the number of cells that have a given phenotype for all possible phenotypes. This is done by first counting cells for all full-length phenotypes, and generating all other phenotypes with neutral states by summing up the cells counted in the full-length ones.





□ Towards a robust out-of-the-box neural network model for genomic data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04660-8

DeepRAM outperforms all other models especially the recurrent version (RNN) in terms of prediction accuracy, overfitting, and robustness across datasets. DeepRAM models are more robust, transferable and generalizable across genomic datasets with varied characteristics.

A LSTM autoencoder model (LSTM-AE) aims to represent a sequence by a dense vector that can be converted back to the original sequence. The encoder reads as input an encoded DNA sequence and outputs a dense vector as the embedding for this sequence whose length is a hyper parameter to tune.

LSTM-AE+NN adds a simple fully connected neural network containing two dense layer with size shrinking by a factor of 2 with a dropout layer in between for the prediction of class labels. The size of the first dense layer is adjusted, as a rule of thumb, to match 1 to 4 times the embedding dimension.





□ DeepCOLOR: Single-cell colocalization analysis using a deep generative model

>> https://www.biorxiv.org/content/10.1101/2022.04.10.487815v1.full.pdf

DeepCOLOR segre- gates cell populations defined by the colocalization relationships and predicts cell-cell interactions between colocalized single cells. DeepCOLOR is typically applicable to studying cell-cell interactions in any spatial niche.

DeepCOLOR was used to build a continuous neural network map from latent cell state space to each spot in the spatial transcriptome in order to enhance consistent mapping profiles between single cells with similar molecular profiles.





□ rox: A statistical model for regression with missing values

>> https://www.biorxiv.org/content/10.1101/2022.04.15.488427v1.full.pdf

rox, “rank order with missing values(X)”, a flexible, non-parametric approach for regression analysis of a dependent variable with missing values and continuous, ordinal, or binary explanatory variables.

rox utilizes the knowledge of missing values representing low concentrations due an Limit Of Detection effect, w/o requiring any actual imputation steps. rox relies on the assumption of an LOD effect in its core, it flexibly generalizes to data with other missingness mechanisms.





Karen Miga

>> https://www.nature.com/articles/s41586-022-04601-8

The Human Pangenome Reference Consortium #HPRC aims to create a more complete human reference genome with a graph-based, #T2T representation of global genomic diversity. Exciting perspective from the team released today in @Nature





□ SeATAC: a tool for exploring the chromatin landscape and the role of pioneer factors

>> https://www.biorxiv.org/content/10.1101/2022.04.25.489439v1.full.pdf

SeATAC can be extended to model scATAC-seq data and to investigate the V-plot dynamics. SeATAC uses a conditional variational autoencoder (CVAE) model to learn the latent representation of ATAC-seq V-plots, and to estimate the statistically differential chromatin accessibility.





□ DeepRepeat: direct quantification of short tandem repeats on signal data from nanopore sequencing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02670-6

DeepRepeat accurately detects STRs directly from nanopore electric signals, without using synthetic signals. DeepRepeat is based on the notion that directly adjacent STR units share similar nanopore signal distribution.

DeepRepeat feeds repeat / non-repeat images into aconvolutional neural network followed by a full connection network. Based on alignment of all long reads for a STR locus, the information is summed from multiple long reads for the STR locus using a Gaussian mixture distribution.





Oxford Nanopore

>> https://www.nature.com/articles/s41565-022-01116-1

Our own R&D teams complement their work through collaborations with partnerships with academic collaborators. Here, our collaborators at @ucl demonstrating how antibodies can be detected using designed DNA origami nanopores embedded in MinION Flow Cells.





□ GraphPred: An approach to predict multiple DNA motifs from ATAC-seq data using graph neural network and coexisting probability

>> https://www.biorxiv.org/content/10.1101/2022.05.02.490240v1.full.pdf

GraphPred employs a two-layer of GNN. The first layer was used to learn the embedding of k-mer nodes from the similarity graph and coexisting graph, the second layer of GraphPred was used to learn the embedding of sequence nodes from inclusive graph.

GraphPred calculates the coexisting probability of k-mers using the coexisting edges of the heterogeneous graph and finds multiple motifs from an ATAC-seq dataset. GraphPred can capture the important nodes and edges via their weights.





□ PAUSE: Principled Feature Attribution for Unsupervised Gene Expression Analysis

>> https://www.biorxiv.org/content/10.1101/2022.05.03.490535v1.full.pdf

PAUSE, - principled attribution for unsupervised gene expression analysis, combines biologically-constrained autoencoders with principled attributions to improve the unsupervised analysis of gene expression data.

Biologically-constrained “interpretable” autoencoders use prior knowledge to define sparse connections in a deep autoencoder, such that latent variables correspond to the activity of biological pathways.





Virtus Patientiae.

2022-05-05 05:05:05 | Science News

(“Virtus Patientiae” by Del)




□ WFA-GPU: Gap-affine pairwise alignment using GPUs

>> https://www.biorxiv.org/content/10.1101/2022.04.18.488374v1.full.pdf

WFA-GPU, a CPU-GPU co-design capable of performing inter and intra-sequence parallel alignment of multiple sequences, combining a succinct backtrace encoding to reduce the overall memory consumption of the original WFA (the Wavefront Alignment algorithm).


WFA-GPU makes asynchronous kernel launches, allowing overlapping data transfers. While the GPU is computing the alignments for a given batch, the sequences of the following batch are being copied to the device. Latencies due to transfer times are effectively hidden / overlapped.





□ GLUE: Multi-omics single-cell data integration and regulatory inference with graph-linked embedding

>> https://www.nature.com/articles/s41587-022-01284-4

GLUE (graph-linked unified embedding) integrates unpaired single-cell multi-omics data and inferring regulatory interactions simultaneously. By modeling the regulatory interactions across omics layers explicitly, GLUE bridges the gaps b/n various omics-specific feature spaces.

GLUE enables effective triple-omics integration. The GLUE alignment successfully revealed a shared manifold of cell states across the 3 omics layers. the GLUE regulatory inference can be seen as a posterior estimate, which can be continuously refined on the arrival of new data.





□ DELAY: Depicting pseudotime-lagged causality across single-cell trajectories for accurate gene-regulatory inference

>> https://www.biorxiv.org/content/10.1101/2022.04.25.489377v1.full.pdf

Granger causality-based methods can be error-prone when genes display nonlinear or cyclic interactions. Deep learning-based methods make no assumptions about the temporal relationships or connectivity b/n genes in complex regulatory networks.

DELAY (Depicting Lagged Causality) learns gene-regulatory interactions from discrete joint-probability matrices of paired, pseudotime-lagged gene-expression trajectories. DELAY can overcome certain limitations of Granger causality-based methods of gene-regulatory inference.





□ Cue: A deep learning framework for structural variant discovery and genotyping

>> https://www.biorxiv.org/content/10.1101/2022.04.30.490167v1.full.pdf

Cue, a novel generalizable framework for SV calling and genotyping, which can effectively leverage deep learning to automatically discover the underlying salient features of different SV types and sizes, including complex and somatic subclonal SVs.

Cue converts sequence alignments to multi-channel images that capture multiple SV-informative signals and uses a stacked hourglass convolutional neural network to predict the type, genotype, and genomic locus of the SVs captured in each image.





□ Echtvar: Compressed variant representation for rapid annotation and filtering of SNPs and indels

>> https://www.biorxiv.org/content/10.1101/2022.04.15.488439v1.full.pdf

echtvar efficiently encodes population variants and annotation fields into a compressed archive that can be used for rapid variant annotation and filtering. echtvar is faster and uses less space than existing tools and that it can effectively reduce the number of candidate variants.

Echtvar encodes small variants into integers with the bits partition. Encoding simply partitions values to those bits which results in a 32-bit integer. The genomic bin determines the 1,048,576 bin and corresponding directory within the echtvar archive for a given query variant.





□ DeepVelo: Deep Learning extends RNA velocity to multi-lineage systems with cell-specific kinetics

>> https://www.biorxiv.org/content/10.1101/2022.04.03.486877v1.full.pdf

DeepVelo generalizes RNA velocity to cell populations containing time-dependent kinetics and multiple lineages, which are common in developmental and pathological systems.

DeepVelo infers time-varying cellular rates of transcription and degradation. DeepVelo models RNA velocities for dynamics of high complexity, and exceeds the capacity of existing models with cell-agnostic rates in realistic single-cell datasets w/ multiple trajectories/lineages.





□ MAVE-NN: learning genotype-phenotype maps from multiplex assays of variant effect

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02661-7

MAVE-NN, a neural-network-based Python package that implements a broadly applicable information-theoretic framework for learning genotype-phenotype maps—including biophysically interpretable models—from MAVE datasets.

MAVE-NN is based on the use of latent phenotype models, which assume that each assayed sequence has a well-defined latent phenotype (specified by the G-P map), of which the MAVE experiment provides a noisy indirect readout.





□ scTagger: Fast and accurate matching of cellular barcodes across short- and long-reads of single-cell RNA-seq experiments

>> https://www.biorxiv.org/content/10.1101/2022.04.21.489097v1.full.pdf

scTagger uses a trie-based data structure to efficiently match the identified barcodes in the SRs to the LRs while allowing for non-zero edit distance matching. scTagger has accuracy on par with an exact but computationally intensive dynamic programming-based matching approach.

scTagger exploits the apriori knowledge about the template of the LRs and uses the alignment of the fixed Illumina adapter sequence to each of the LR segment. The time complexity for querying the trie in the matching stage of scTagger is O(Mεe(L + e)e+1).





□ Algorithm for DNA sequence assembly by quantum annealing

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04661-7

Using the Genomic Signal Processing approach, detecting overlaps between DNA reads by calculating the Pearson correlation coefficient and formulating the assembly problem as an optimization task.

The linear complexity parts of this algorithm are deployed on CPU, the parts with higher complexity on quantum annealing. The problem of repeated regions in DNA sequences should also be solved, e.g. by appropriate methods of filtering out erroneous reads.




□ scSGL: Kernelized Signed Graph Learning for Single-Cell Gene Regulatory Network Inference

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac288/6572335

scSGL, a novel signed graph learning (GL) approach that learns GRNs based on the assumption of smoothness and non-smoothness of gene expressions over activating and inhibitory edges.

scSGL is formulated as a non-convex optimization problem and solved using an efficient ADMM framework. scSGL is extended with kernels to account for non-linearity of co-expression and for effective handling of highly occurring zero values.





□ scDeconv: an R package to deconvolve bulk DNA methylation data with scRNA-seq data and paired bulk RNA-DNA methylation data

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac150/6572659

scDeconv solve the reference deficiency problem of DNAm data and deconvolve them from scRNA-seq data in a trans-omics manner. It assumes that paired samples have similar cell compositions and so the cell content information deconvolved from the scRNA-seq and paired RNA data.

scDeconv contains other functions such as refDeconv to deconvolve bulk data using reference from the same omics, and celldiff to select cell-type-specific inter-group differential features, and enrichwrapper to annotate differential DNAm feature using a correlation-based method.





□ AGC: Compact representation of assembled genomes

>> https://www.biorxiv.org/content/10.1101/2022.04.07.487441v1.full.pdf

AGC (Assembled Genomes Compressor), a highly efficient compression method for the collection of assembled genome sequences of the same species. AGC offers fast access to the requested contigs or samples without the need to decompress other sequences.

AGC uses splitters to divide each contig into segments. These segments are collected in groups using pairs of terminating splitters to have in the same group segments that are similar to each other. AGC decompresses the reference segments and, partially, the necessary blocks.





□ miniwfa: another reimplementation of the wavefront alignment algorithm (WFA) in low memory.

>> https://github.com/lh3/miniwfa

Miniwfa is a reimplementation of the WaveFront Alignment algorithm (WFA) with 2-piece affine gap penalty. When reporting base alignment for megabase-long sequences, miniwfa is sometimes a few times faster and tends to use less memory in comparison to WFA2-lib and wfalm.

Miniwfa approximately uses (20qs^2/p+ps) bytes of memory. s is the optimal alignment penalty, p is the distance b/n stripes and q=max(x, o1+e1, o2+e2) is the maximal penalty between adjacent entries. The time complexity is O(n(s+p)) where n is the length of the longer sequence.





□ Metacell-2: a divide-and-conquer metacell algorithm for scalable scRNA-seq analysis

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02667-1

a Metacell algorithm (MC2) supports practically unlimited scaling, using an iterative divide-and-conquer approach. The algorithm uses a new graph partition score to avoid time-consuming resampling and directly control metacell sizes.

Metacell-2 implements a new adaptive outlier detection module, and employs a rare-gene-module detector. MC2 constructs metacells by partitioning the constructed graph, independently in parallel for each pile of cells in the data or recursively over groups of metacells.





□ NN-MM: Extend mixed models to multilayer neural networks for genomic prediction including intermediate omics data

>> https://academic.oup.com/genetics/advance-article-abstract/doi/10.1093/genetics/iyac034/6536967

NN-MM models the multiple layers of regulation from genotypes to intermediate omics features, then to phenotypes, by extending conventional linear mixed models (“MM”) to multilayer artificial neural networks (“NN”).

NN-MM incorporates intermediate omics features by adding middle layers b/n genotypes and phenotypes. Linear mixed models can be used to genetic values, and activation functions in NN are used to capture the nonlinear relationships b/n intermediate omics features and phenotypes.





□ STIX: Searching thousands of genomes to classify somatic and novel structural variants

>> https://www.nature.com/articles/s41592-022-01423-4

STIX is built on top of the GIGGLE genome search engine. STIX searches the raw alignments across thousands of samples. For a given deletion, duplication, inversion or translocation, STIX reports a per-sample count of every alignment that supports the variant.

STIX extracts and tracks all discordant alignments from each sample’s genome. STIX searches the index using the left coordinate and only retains alignments that also overlap the right coordinate and have a strand configuration that matches the given SV type.





□ GenMPI: Cluster Scalable Variant Calling for Short/Long Reads Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2022.04.01.486779v1.full.pdf

GenMPI is portable and flexible, meaning it can be deployed to any private or public cluster/cloud infrastructure. Any alignment or variant calling application can be used with minimal adaptation.

GenMPI is the first-ever cluster scale implementation of any long reads aligners. GenMPI integrates the Minimap2 aligner and three different variant callers (DeepVariant, DeepVariant with WhatsHap for phasing (PacBio) and Clair3.





□ Synthetic Approaches to Complex Organic Molecules in the Cold Interstellar Medium

>> https://www.frontiersin.org/articles/10.3389/fspas.2021.789428/full

The diverse suggestions made to explain the formation of Complex Organic Molecules (COMs) in the low-temperature interstellar medium. Granular mechanisms include both diffusive and nondiffusive processes.

A granular explanation is strengthened by experiments at 10 K that indicate that the synthesis of large molecules on granular ice mantles under space-like conditions is exceedingly efficient, with and without external radiation.

The bombardment of carbon-containing ice mantles in the laboratory by cosmic rays, which are mainly high-energy protons, can lead to organic species even at low temperatures.





□ Orbit: A Python Package for Bayesian Forecasting

>> https://github.com/uber/orbit

Orbit is a Python package for Bayesian time series forecasting and inference. It provides a familiar and intuitive initialize-fit-predict interface for time series tasks, while utilizing probabilistic programming languages under the hood.

In the Kernel-based Time-varying Regression (KTR) model, The coefficient curves are approximated with Gaussian kernels having positive values of knots. The levels are also included in the process with vector of ones as the covariates.





□ MCDP: Markov chains improve the significance computation of overlapping genome annotations

>> https://www.biorxiv.org/content/10.1101/2022.04.07.487119v1.full.pdf

MCDP computes the p-values under the Markovian null hypothesis in O(m2 + n) time and O(m) memory, where m and n are the numbers of intervals in the reference and query annotations, respectively.





□ SpaTalk: Knowledge-graph-based cell-cell communication inference for spatially resolved transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2022.04.12.488047v1.full.pdf

SpaTalk relies on a graph network and knowledge graph to model and score the ligand-receptor-target signaling network between spatially proximal cells, decomposed from ST data through a non-negative linear model and spatial mapping between single-cell RNA-sequencing and ST data.

SpaTalk was then applied to STARmap, Slide-seq, and 10X Visium data, revealing the in-depth communicative mechanisms underlying normal and disease tissues with spatial structure.

SpaTalk can uncover spatially resolved cell-cell communications for single-cell and spot-based ST data universally, providing new insights into spatial inter-cellular dynamics.





□ Vaeda computationally annotates doublets in single-cell RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.04.15.488440v1.full.pdf

Vaeda (Variaitonal Auto-Encoder for Doublet Annotation) integrates a variational auto-encoder and Positive-Unlabeled learning to produce doublet scores and binary doublet calls.

Vaeda uses a VAE to derive a low-dimensional representation of the input data. A combination of a cluster-aware AE, homotypic doublet exclusion, PU learning w/ a logistic regression type classifier, and incl the neighborhood doublet fraction as a feature yielded the best results.





□ CellDrift: Inferring Perturbation Responses in Temporally-Sampled Single Cell Data

>> https://www.biorxiv.org/content/10.1101/2022.04.13.488194v1.full.pdf

CellDrift, a generalized linear model-based functional data analysis method capable of identifying covarying temporal patterns of various cell types in response to perturbations.

CellDrift first captures cell type specific perturbation effects by adding an interaction term in the Generalized Linear Model (GLM) and then utilizes predicted coefficients to calculate contrast coefficients, which represent perturbation effects.





□ SageNet: Supervised spatial inference of dissociated single-cell data

>> https://www.biorxiv.org/content/10.1101/2022.04.14.488419v1.full.pdf

SageNet, a method that reconstructs latent cell positions by probabilistically mapping cells from a dissociated scRNA-seq query dataset to non-overlapping partitions of a spatial molecular reference.

SageNet estimates a gene interaction network (GIN), which then forms the scaffold for a GNN. SageNet outputs a probabilistic mapping of dissociated cells to spatial partitions, an estimated cell-cell spatial distance matrix, as well as a set of spatially informative genes (SIGs).





□ TITAN: A Toolbox for Information-Theoretic Analysis of Molecular Networks

>> https://www.biorxiv.org/content/10.1101/2022.04.18.488630v1.full.pdf

TITAN, a toolbox in MATLAB and Octave for the reconstruction and graph analysis of molecular networks. Using an information-theoretical approach TITAN reconstructs networks from transcriptional data, revealing the topological structure of correlations in biological systems.

TITAN uses MI / VI to find correlations in molecular data and construct a network. TITAN can be expanded to the analysis of each target as a hub by calculation of the betweenness centrality which is defined as the fraction of all shortest paths that go through a particular node.





□ iSFun: an R package for integrative dimension reduction analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac281/6571144

Sparse PCA (SPCA), PLS (SPLS), and CCA (SCCA) can possess many strengths of their dense counterparts, while being more stable and more interpretable by having sparse loadings.

the Minimax Concave Penalty (MCP)-based penalization are adopted. group MCP and composite MCP are adopted to tailor different settings. iSFun contains the magnitude- and sign-based penalties to promote qualitative similarity of the estimates from multiple datasets.





□ BiocMAP: A Bioconductor-friendly, GPU-Accelerated Pipeline for Bisulfite-Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2022.04.20.488947v1.full.pdf

The first BiocMAP module performs speedy alignment to a reference genome by Arioc, and requires GPU resources. Methylation extraction and remaining steps are performed in the second module, optionally on a different computing system where GPUs need not be available.

BiocMAP counts the number of reads aligned to each version of the lambda genome, and call these counts for the original and bisulfite-converted versions. This contrasts with the more conventional approach, which involves directly aligning reads to the lambda reference genome.





□ SpaGene: Scalable and model-free detection of spatial patterns and colocalization

>> https://www.biorxiv.org/content/10.1101/2022.04.20.488961v1.full.pdf

SpaGene is built upon a simple intuition that spatially variable genes have uneven spatial distribution, meaning that cells/spots with high expression tend to be more spatially connected than random.

SpaGene uses neighborhood graphs to represent spatial connections, making it more robust to non-uniform cellular densities common in tissues. SpaGene is very flexible, which can tune neighborhood search spaces automatically based on the data sparsity.





□ HisCoM-Kernel: Kernel-based hierarchical structural component models for pathway analysis

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac276/6572812

HisCoM-Kernel (Hierarchical structural CoMponent analysis using Kernel), a new approach to model complex effects. HisCoM-Kernel models nonlinear associations between biomarkers and phenotype by extending the kernel machine regression and analyzes entire pathways.





□ Parameter estimation and uncertainty quantification using information geometry

>> https://royalsocietypublishing.org/doi/10.1098/rsif.2021.0940

Exploring the use of techniques from information geometry, including geodesic curves and Riemann scalar curvature, to supplement typical techniques for uncertainty quantification, such as Bayesian methods, profile likelihood, asymptotic analysis and bootstrapping.

The Fisher information defines Riemann metric on the statistical manifold. Where the Fisher information is not available, the sample-based observed information—computed as negative the Hessian of the log-likelihood function, or via Monte Carlo methods.




□ Detecting epistatic interactions in genomic data using Random Forests

>> https://www.biorxiv.org/content/10.1101/2022.04.26.488110v1.full.pdf

Most Random Forests based methods that claim to detect interactions rely on different forms of variable importance measures that suffer when the interacting variables have very small or no marginal effects.





□ SvAnna: efficient and accurate pathogenicity prediction of coding and regulatory structural variants in long-read genome sequencing

>> https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-022-01046-6

Structural variant Annotation and analysis (SvAnna) assesses all classes of SVs and their intersection with transcripts and regulatory sequences, relating predicted effects on gene function with clinical phenotype data.

SvAnna assesses each variant in the context of its genomic location. SvAnna integrates annotation and prioritization of SVs called in LRS data starting from variant call format (VCF) files produced by LRS SV callers such as pbsv, sniffles, and SVIM.





□ KnotAli: informed energy minimization through the use of evolutionary information

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04673-3

KnotAli takes a multiple RNA sequence alignment as input and uses covariation and thermodynamic energy minimization to predict possibly pseudoknotted secondary structures for each individual sequence in the alignment.

KnotAli first identifies a set of intermediary base pairs utilizing a noise adjusted mutual information metric (MIp). Using the coupling of covariation and thermodynamics, KnotAli is capable of finding possibly pseudoknotted structures in O(Nn^3)time and O(n^2)space.





□ CDHGNN: Identifying disease-associated circRNAs based on edge-weighted graph attention and heterogeneous graph neural network

>> https://www.biorxiv.org/content/10.1101/2022.05.04.490565v1.full.pdf

CDHGNN, a model based on edge-weighted graph attention and heterogeneous graph neural networks for discovering probable circRNA-disease correlations prediction. CDHGNN can find molecular connections and the relevant pathways in pathogenesis

A unique edge-weighted graph attention network grasps node features since edge weights convey the relevance of associations between nodes. CDHGNN learns contextual information and assign attention weights on the meta-path in the heterogeneous network.





□ Bi-CCA: Bi-order multimodal integration of single-cell data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02679-x

bi-CCA, a novel mathematical solution named bi-order canonical correlation analysis which extends the widely used CCA approach to iteratively align the rows and the columns between data matrices.

Bi-CCA is generally applicable to combinations of any two single-cell modalities. bi-CCA utilizes the full feature information and enables accurate alignment of bipolar cell subtypes between RNA and ATAC data.






Kavka.

2022-05-05 05:04:05 | Science News

(Artwork by Pak)




□ deepSimDEF: deep neural embeddings of gene products and Gene Ontology terms for functional analysis of genes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac304/6583182

deepSimDEF, a deep learning method to automatically learn FS estimation of gene pairs given a set of genes and their GO annotations. deepSimDEF can be run in two settings: single channel considering sub-ontologies separately, and multi-channel with sub-ontologies combined.

deepSimDEF’s key novelty is its ability to learn low-dimensional embedding vector representations of GO terms and gene products, and then calculate FS using these learned vectors.





□ Statistical correction of input gradients for black box models trained with categorical input features

>> https://www.biorxiv.org/content/10.1101/2022.04.29.490102v1.full.pdf

A new source of noise in input gradients when the input features have a geometric constraint set by a probabilistic interpretation, such as one-hot-encoded DNA sequences. All data lives on a lower-dimensional manifold – a simplex within a higher-dimensional space.

This randomness can introduce unreliable gradient components in directions off the simplex, thereby affecting explanations from gradient-based attribution. A simple correction to input gradients which minimizes the impact of off-simplex-derived gradient noise.





□ eQTLsingle: Discovering single-cell eQTLs from scRNA-seq data only

>> https://www.sciencedirect.com/science/article/abs/pii/S0378111922003390

Paired sequencing technologies are still immature, and the genome coverage of current single-cell pair-sequencing data is too shallow for effective eQTL analysis. Several previous studies have shown that mutations in gene regions can be reliably detected from RNA-seq data.

eQTLsingle detects mutations from scRNA-seq data and models gene expression of different genotypes with the zero-inflated negative binomial (ZINB) model to find associations between genotypes and phenotypes at single-cell level.





□ SPCS: a spatial and pattern combined smoothing method for spatial transcriptomic expression

>> https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac116/6563417

Spatial and Pattern Combined Smoothing (SPCS) is a novel two-factor smoothing technique, that employs k-nearest neighbor technique to utilize associations from transcriptome and Euclidean space from the Spatial Transcriptomic (ST) data.

SPCS smoothing method produces greater silhouette scores than MAGIC and SAVER. SPCS method generates a higher ARI score than existing one-factor methods, which means a more accurate histopathological parti- tion can be acquired by performing the two-factor SPCS method.





□ scGraph: a graph neural network-based approach to automatically identify cell types

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac199/6565313

ScGraph is a GNN-based automatic cell identification algorithm leveraging gene interaction relationships to enhance the performance of the cell type identification.

scGraph automatically learns the gene interaction relationships from biological data and the pathway enrichment analysis shows consistent findings with previous analysis, providing insights on the analysis of regulatory mechanism.





□ RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04648-4

RGMQL is built over the GenoMetric Query Language (GMQL) data management and computational engine, and can leverage its open curated repository as well as its cloud-based resources, with the possibility of outsourcing computational tasks to GMQL remote services.

RGMQL can easily scale up from local to parallel and cloud computing while it combines and analyzes heterogeneous omics data from local or remote datasets, both public and private, in a completely transparent way to the user.





□ scROSHI - robust supervised hierarchical identification of single cells

>> https://www.biorxiv.org/content/10.1101/2022.04.05.487176v1.full.pdf

single cell Robust Supervised Hierarchical Identification of cell types (scROSHI), which utilizes a-priori defined cell type-specific gene sets and does not require training or the existence of annotated data.

scROSHI utilizes the hierarchical nature of cell identities, it can outperform its competitor when a sample contains similar cell types that derive from different branches of the lineage tree.





□ BFF and cellhashR: Analysis Tools for Accurate Demultiplexing of Cell Hashing Data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac213/6565315

Bimodal Flexible Fitting (BFF) demultiplexing algorithms BFFcluster and BFFraw, a novel class of algorithms that rely on the single inviolable assumption that barcode count distributions are bimodal.

cellhashR, a new R package that provides integrated QC and a single command to execute and compare multiple demultiplexing algorithms. BFFcluster demultiplexing is both tunable and insensitive to issues with poorly-behaved data that can confound other algorithms.





□ QuasiFlow: a bioinformatic tool for genetic variability analysis from next generation sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.04.05.487169v1.full.pdf

QuasiFlow, a workflow based on well-stablished software that extracts reliable mutations and recombinations, even at low frequencies (~10^–4), provided that at least 250 million nucleotides are analysed.

To present a robust and accurate assessment of mutation and recombination frequencies, the QuasiFlow/QuasiComparer analysis must rely on the whole genetic variability, and this is clearly dependent on the number of reads for the low frequent SNVs.





□ Genotype error biases trio-based estimates of haplotype phase accuracy

>> https://www.biorxiv.org/content/10.1101/2022.04.06.487354v1.full.pdf

A method for estimating the genotype error rate from parent-offspring trios and a method for estimating the bias in the observed switch error rate that is caused by genotype error.

Genotype error inflates the observed switch error rate and that the relative bias increases with sample size. the observed switch error rate in the trio offspring is 2.4 times larger than the true switch error rate and that the average distance b/n phase errors is 64 megabases.





□ DeepPerVar: a multimodal deep learning framework for functional interpretation of genetic variants in personal genome

>> https://www.biorxiv.org/content/10.1101/2022.04.10.487809v1.full.pdf

DeepPerVar is essentially a multi-modal DNN, which considers both personal genome and personal traits, awa their interactions in the model training, to quantitatively predict epigenetic signals and evaluate the functional consequence of genetic variants on an individual level.

DeepPerVar uses the Adam algorithm to minimize the mean square error. Validation loss is evaluated at the end of each training epoch to monitor convergence. The weights of convolutional and dense layers are initialized by randomly Xavier uniform distribution.





□ BANKSY: A Spatial Omics Algorithm that Unifies Cell Type Clustering and Tissue Domain Segmentation

>> https://www.biorxiv.org/content/10.1101/2022.04.14.488259v1.full.pdf

BANKSY (Building Aggregates with a Neighbourhood Kernel and Spatial Yardstick), an algorithm that unifies cell type clustering and domain segmentation by constructing a product space of cell and neighbourhood transcriptomes, representing cell state and microen-vironment.

BANKSY can solve the distinct problems of cell type clustering and tissue domain segmentation within a unified feature augmentation framework. BANKSY is seamlessly inter-operable with the widely used bioinformatics pipelines Seurat, SingleCellExperiment, and Scanpy.





□ A spectral algorithm for polynomial-time graph isomorphism testing

>> https://www.biorxiv.org/content/10.1101/2022.04.14.488296v1.full.pdf

A spectral algorithm to infer quadratic permutations mapping tuples of isomorphic graphs in O(n^4) time. Robustness to degeneracy and multiple isomorphisms are achieved through low dimensional eigenspace projections and iterative perturbations respectively.

The graph isomorphism algortihm identified a correct solution in each experiment. Algorithmic vulnerability to numerical instability was identified in some experiments, necessitating the imposition of numerical tolerances during equality checking operations.





□ Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation

>> https://www.nature.com/articles/s41592-022-01445-y

Merfin evaluates each variant based on the expected k-mer multiplicity in the reads, independently of the quality of the read alignment and variant caller’s internal score.

Merfin increased the precision of genotyped calls, improved consensus accuracy and reduced frameshift errors when applied to human and nonhuman assemblies built from PacBio HiFi and continuous long reads or Oxford Nanopore reads, incl. the first complete human genome.





□ ScisorWiz: Visualizing Differential Isoform Expression in Single-Cell Long-Read Data

>> https://www.biorxiv.org/content/10.1101/2022.04.14.488347v1.full.pdf

ScisorWiz, a streamlined tool to visualize isoform expression differences across single-cell clusters in an informative and easily-communicable manner. ScisorWiz visualizes pre-processed single-cell long- read RNA sequencing data.

ScisorWiz generates a file for all single-cell long reads that can be inspected on the UCSC Genome Browser. ScisorWiz can be run on output generated by scisorseqr or a similarly formatted dataset, which, in turn, can be based on diverse mappers including STAR and minimap2.

< br />



□ SOAR: a spatial transcriptomics analysis resource to model spatial variability and cell type interactions

>> https://www.biorxiv.org/content/10.1101/2022.04.17.488596v1.full.pdf

SOAR (Spatial transcriptOmics Analysis Resource), an extensive and publicly accessible resource of spatial transcriptomics data. SOAR is a comprehensive database hosting a total of 1,633 samples from 132 datasets, which were uniformly processed using a standardized workflow.

SOAR provides interactive web interfaces for users to visualize spatial gene expression, evaluate gene spatial variability across cell types, and assess cell-cell interactions.





□ Read2Tree: scalable and accurate phylogenetic trees from raw reads

>> https://www.biorxiv.org/content/10.1101/2022.04.18.488678v1.full.pdf

Read2Tree, a novel approach to infer species trees, which works by directly processing raw sequencing reads into groups of corresponding genes—bypassing genome assembly, annotation, or all-versus-all sequence comparisons.

Read2Tree is able to also provide accurate trees and species comparisons using only low coverage (0.1x) data sets as well as RNA vs. genomic sequencing and operates on long or short reads.





□ Hi-LASSO: High-performance Python and Apache spark packages for feature selection with high-dimensional data

>> https://www.biorxiv.org/content/10.1101/2022.04.22.489133v1.full.pdf

High-Dimensional LASSO (Hi-LASSO) is a linear regression-based feature selection model that produces outstanding performance in both prediction and feature selection on high-dimensional data, by theoretically improving Random LASSO.

Hi-LASSO alleviates bias introduced from bootstrapping, refines importance scores, improves the performance taking advantage of global oracle property, provides a statistical strategy to determine the number of bootstrapping.





□ Statistical analysis of spatially resolved transcriptomic data by incorporating multi-omics auxiliary information

>> https://www.biorxiv.org/content/10.1101/2022.04.22.489194v1.full.pdf

OrderShapeEM is a generic multiple comparison procedure with auxiliary information that is applicable to many types of omics data. OrderShapeEM calculates the Lfdr based on an empirical Bayesian two-group mixture model.

This framework can annotate each peak with the closest gene and use the corresponding p-values as the auxiliary covariate. One caveat is that this integrative analysis is a marginal based approach and does not incorporate dependence information such as linkage disequilibrium.





□ The COPILOT Raw Illumina Genotyping QC Protocol

>> https://currentprotocols.onlinelibrary.wiley.com/doi/10.1002/cpz1.373

COPILOT (Containerised wOrkflow for Processing ILlumina genOtyping daTa.) has been successfully used to transform raw Illumina genotype intensity data into high-quality analysis-ready data that have been genotyped on a variety of Illumina genotyping arrays.

The COPILOT QC protocol consists of two distinct tandem procedures to process raw Illumina genotyping data. It automates an array of complex bioinformatics analyses to improve data quality through a secondary clustering algorithm and to automatically identify typical GWAS issues.





□ Hist2ST: Spatial Transcriptomics Prediction from Histology jointly through Transformer and Graph Neural Networks

>> https://www.biorxiv.org/content/10.1101/2022.04.25.489397v1.full.pdf

Hist2ST, a spatial information- guided deep learning method for spatial transcriptomic prediction from WSIs. Hist2ST consists of three modules: the Convmixer, Transformer, and graph neural network.

Hist2ST explicitly captures the neighborhood relationships through the graph neural network. These learned features are used to predict the gene expression by following the zero-inflated negative binomial (ZINB) distribution.





□ levioSAM2: Improved sequence mapping using a complete reference genome and lift-over

>> https://www.biorxiv.org/content/10.1101/2022.04.27.489683v1.full.pdf

LevioSAM2 lifts mappings from a source reference to a target reference while selectively remapping the subset of reads for which lifting is not appropriate. LevioSAM2 also improved long read mapping, demonstrated by more accurate small- and structural-variant calling.

LevioSAM2 first sorts the aligned segments by position and stores them in a chain interval array, and builds a pair genome- length of succinct bit vectors. LevioSAM2 queries the chain interval array using the index and updates the contig, strand and position information.





□ scProjection: Projecting clumped transcriptomes onto single cell atlases to achieve single cell resolution

>> https://www.biorxiv.org/content/10.1101/2022.04.26.489628v1.full.pdf

scProjection computes cell type abundance to a set of populations, its primary goal is to distinguish intra-cell type variation by mapping the RNA sample onto the precise cell state within each of the cell type populations that represents the expression profile of cell types.

scProjection uses individual variational autoencoders (VAEs) trained on each cell population within the single cell atlas to model within-cell type expression variation and delineate the landscape of valid cell states, as well as their relative occurrence.





□ HiFine: integrating Hi-c-based and shotgun-based methods to reFine binning of metagenomic contigs

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac295/6575440

HiFine designs a strategy of fragmentation for the original bin sets derived from the Hi-C-based and shotgun-based binning methods, which considerably increases the purity of initial bins, followed by merging fragmented bins and recruiting unbinned contigs.





□ Methylartist: Tools for Visualising Modified Bases from Nanopore Sequence Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac292/6575433

Methylartist, tools for analysing nanopore-derived modified base data. It is an accessible augmentation to the available tools for analysis and visualisation of nanopore-derived methylation data, incl. the non-CpG modification motifs used in chromatin footprinting assays.

The command "methylartist segmeth" aggregates methylation calls over segments into a table of tab-separated values. Category-based methylation data aggregated with "segmeth" can be plotted as strip plots, violin plots, or ridge plots using the "segplot" command.





□ GEInfo: an R package for gene-environment interaction analysis incorporating prior information

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac301/6575887

Extending a “quasi-likelihood + penalization” approach to linear, logistic, and Poisson regressions. Such models are much more popular in practice.

GEInfo can incorporate prior information and is more flexible by not assuming such information is fully correct. GEInfo performs almost as well as CGEInfo and significantly outperforms GEsgMCP.





□ A Pairwise Imputation Strategy for Retaining Predictive Features When Combining Multiple Datasets

>> https://www.biorxiv.org/content/10.1101/2022.05.04.490696v1.full.pdf

A pairwise imputation method to account for differing feature sets across multiple studies when the goal is to combine information across studies to build a predictive model.

Formal notation for the general pairwise imputation framework to impute study-specific missing genes across multiple studies, as well as the specific ‘Core’ and ‘All’ imputation methods.

Both the ‘Core’ and ‘All’ imputation methods will decrease the RMSE of prediction compared to the omitting method, with ‘Core’ imputation demonstrating better performance than the ‘All’ imputation method.





□ An Entropy Approach for Choosing Gene Expression Cutoff

>> https://www.biorxiv.org/content/10.1101/2022.05.05.490711v1.full.pdf

Annotating cell types using single-cell transcriptome data usually requires binarizing the expression data to distinguish between the background noise vs. real expression or low expression vs. high expression cases.

A common approach is choosing a “reasonable” cutoff value, but it remains unclear how to choose it. A simple yet effective approach for finding this threshold value. Binarizing the data in a way that minimizes the clustering information loss.





□ scSemiAE: a deep model with semi-supervised learning for single-cell transcriptomics

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04703-0

scSemiAE aims at the identification of cell subpopulations for scRNA-seq data analysis, which leverage partial cells with labels to guide the learning of an autoencoder for the target datasets.

scSemiAE employs a classifier trained data w/ known cell type labels to annotate cell types for target datasets and selects predictions being true w/ high probability, and learns low-dimensional representations of target datasets guided by partial cells with predicted cell types.





□ Gaining insight into the allometric scaling of trees by utilizing 3d reconstructed tree models - a SimpleForest study

>> https://www.biorxiv.org/content/10.1101/2022.05.05.490069v1.full.pdf

The Reverse Branch Order (RBO) of a cylinder is the maximum depth of the subtree of the segment’s node. The RBO denotes the maximal number of branching splits of the sub-branch growing out the segment.





□ RSNET: inferring gene regulatory networks by a redundancy silencing and network enhancement technique

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04696-w

In RSNET algorithm, highly dependent nodes are constrained in the model as network enhancement items to enhance real interactions and dimension of putative interactions is reduced adaptively to remove weak and indirect connections.

The network inferred by RSNET method is a directed network. RSNET can identify the direct causal genes by filtering out the indirect and noisy genes. RSNET combines both linear and nonlinear interactions overcomes the drawback of linear or nonlinear methods.





□ Depth normalization for single-cell genomics count data

>> https://www.biorxiv.org/content/10.1101/2022.05.06.490859v1.full.pdf

A monotonic transform on the raw counts that results in a fully depth normalized matrix and offers variance stability similar to sqrt. Depth normalization was assessed by plotting, for each cell, the total raw cell counts vs. the total transformed cell counts.





□ SPINNAKER: an R-based tool to highlight key RNA interactions in complex biological networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04695-x

SPINNAKER (SPongeINteractionNetworkmAKER) the open-source version of their widely established mathematical model for predicting ceRNAs crosstalk, that is released as an exhaustive collection of R functions.

SPINNAKER applies a logarithmic (log2) transformation to the RNAs and miRNAs expression levels and conducts a processing analysis to remove those genes having too many missing values among the samples, and computes the Pearson correlation coefficient with miRNAs.





□ iFeatureOmega: an integrative platform for engineering, visualization and analysis of features from molecular sequences, structural and ligand data sets

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkac351/6582173

iFeatureOmega supplies the largest number of feature extraction and analysis approaches for most molecule types compared to other pipelines. It integrates 15 feature analysis methods incl. ten clustering, three dimensionality reduction and two feature normalization algorithms.

iFeatureOmega covers six correlation and covariance measures for individual amino acid sequences, summarized in the ‘autocorrelations’ category. Two sequence order-based features can also be calculated by iFeatureOmega in the ‘quasi-sequence-order’ category.





□ Sparse sliced inverse regression for high dimensional data analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04700-3

Obtaining sparse estimates of the eigenvectors that constitute the basis matrix that is used to construct the indices is desirable to facilitate variable selection, which in turn facilitates interpretability and model parsimony.

A convex formulation that produces simultaneous dimension reduction and variable selection. A group-Dantzig selector type formulation that induces row-sparsity to the sliced inverse regression dimension reduction vectors.





7.

2022-05-05 05:03:05 | Science News




□ scSpace: Reconstruction of the cell pseudo-space from single-cell RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.05.07.491043v1.full.pdf

single-cell Spatial Position Associated Co-Embeddings (scSpace), an integrative algorithm to distinguish spatially variable cell subclusters by reconstructing cells onto a pseudo-space with spatial transcriptome references.

scSpace projects single cells into a pseudo-space via a Multi-layer Neural Network model, so that gene expression graph and spatial graph of cells can be embedded jointly for the further spatial reconstruction and space-informed cell clustering with higher accuracy and precision.





資源集約的な解析技術は、高スループットのbulk dataに対し体系的に適用することが出来ない。その為、データ駆動型のスケーリング因子を用いて事前に定義された疑似バルクデータをシミュレーションすることにより、現実のデータの統計的特徴を再現することが可能である。



□ SimBu: Bias-aware simulation of bulk RNA-seq data with variable cell type composition

>> https://www.biorxiv.org/content/10.1101/2022.05.06.490889v1.full.pdf

SimBu is a user-friendly and flexible tool for simulating realistic pseudo-bulk RNA-seq datasets serving as in silico gold-standard for assessing cell-type deconvolution methods.

A unique feature of SimBu is the modelling of cell-type-specific mRNA bias using experimentally or data-driven scaling factors. SimBu can use Smart-seq2 or 10x Genomics data to generate pseudo-bulk data that faithfully reflects the statistical features of true bulk RNA-seq data.





□ RECODE: Resolution of the curse of dimensionality in single-cell RNA sequencing data analysis

>> https://www.biorxiv.org/content/10.1101/2022.05.02.490246v1.full.pdf

RECODE (resolution of the curse of dimensionality) consistently eliminates COD in relevant scRNA-seq data with unique molecular identifiers. RECODE employs different principles and exhibits superior overall performance in cell-clustering and single-cell level analysis.

RECODE does not involve dimension reduction and recovers expression values for all genes, including lowly expressed genes, realizing precise delineation of cell-fate transitions and identification of rare cells with all gene information.





□ SMURF: embedding single-cell RNA-seq data with matrix factorization preserving selfconsistency

>> https://www.biorxiv.org/content/10.1101/2022.04.22.489140v1.full.pdf

SMURF embeds cells and genes into their latent space vectors utilizing matrix factorization with a mixture of Poisson-Gamma divergent as objective while preserving self-consistency. SMURF exhibited feasible cell subpopulation discovery efficacy with the latent vectors.

SMURF can embed the cell latent vectors into a 1D-oval and recover the time course of the cell cycle. SMURF paraded the most robust gene expression recovery power with low root mean square error and high Pearson correlation.





□ TopoGAN: Unsupervised manifold alignment of single-cell data

>> https://www.biorxiv.org/content/10.1101/2022.04.27.489829v1.full.pdf

TopoGAN, a topology-preserving multi-modal alignment of two single-cell modalities w/ non-overlapping cells or features. TopoGAN finds topology-preserving latent representations of the different modalities, which are then aligned in an unsupervised way using a topology-guided GAN.

The latent space representation of the two modalities are aligned in a topology-preserving manner. TopoGAN uses a topological autoencoder, which chooses point-pairs that are crucial in defining the topology of the manifold instead of trying to optimize all possible point-pairs.





□ AutoClass: A universal deep neural network for in-depth cleaning of single-cell RNA-Seq data

>> https://www.nature.com/articles/s41467-022-29576-y

AutoClass integrates two DNN components, Autoencoder / Classifier, as to maximize both noise removal and signal retention. AutoClass is distribution agnostic as it makes no assumption on specific data distributions, hence can effectively clean a wide range of noise and artifacts.

AutoClass is robust on key hyperparameter settings: i.e. bottleneck layer size, pre-clustering number and classifier weight. AutoClass does not presume any specific type or form of data distribution, hence has the potential to correct a wide range noises and non-signal variances.





□ TAMPA: interpretable analysis and visualization of metagenomics-based taxon abundance profiles

>> https://www.biorxiv.org/content/10.1101/2022.04.28.489926v1.full.pdf

TAMPA (Taxonomic metagenome profiling evaluation) , a robust and easy-to-use method that allows scientists to easily interpret and interact with taxonomic profiles produced by the many different taxonomic profiler methods beyond the standard metrics used by the scientific community.

TAMPA allows for users to choose among multiple graph layout formats, including pie, bar, circle and rectangular. TAMPA can illuminate important biological differences between the two tools and the ground truth at the phylum level, as well as at all other taxonomic ranks.





□ Threshold Values for the Gini Variable Importance: A Empirical Bayes Approach

>> https://www.biorxiv.org/content/10.1101/2022.04.06.487300v1.full.pdf

It is highly desirable that RF models be made more interpretable and a large part of that is a better understanding of the characteristics of the variable importance measures generated by the RF. Considering the mean decrease in node “impurity” (MDI) variable importance (VI).

Efron’s “local fdr” approach, calculated from an empirical Bayes estimate of the null distribution. the distribution may be multi-modal, which creates modelling difficulties; – the null distribution is not of an obvious form, as it is not symmetric.




□ Weighted Kernels Improve Multi-Environment Genomic Prediction

>> https://www.biorxiv.org/content/10.1101/2022.04.10.487783v1.full.pdf

A flexible GS framework capable of incorporating important genetic attributes to breeding populations and trait variability while addressing the shortcomings of conventional GS models.

Comparing to the existing Gaussian Kernel (GK) that assigns a uniform weight to every SNP, This Weighted Kernel (WK) captured more robust genetic relationship of individuals within and cross environments by differentiating the contribution of SNPs.





□ Pangolin: Predicting RNA splicing from DNA sequence

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02664-4

Pangolin can predict the usage of a splice site in addition to the probability that it is spliced. Pangolin improves prediction of the impact of genetic variants on RNA splicing, including common, rare, and lineage-specific genetic variation.

Pangolin’s architecture resembles that used in SpliceAI, which allows modeling of features from up to 5000 base pairs. Pangolin identifies loss-of-function mutations with high accuracy and recall, particularly for mutations that are not missense or nonsense.





□ MISTy: Explainable multiview framework for dissecting spatial relationships from highly multiplexed data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02663-5

MISTy facilitates an in-depth understanding of marker interactions by profiling the intra- and intercellular relationships. MISTy builds multiple views focusing on different spatial or functional contexts to dissect different effects.

MISTy allows for a hypothesis-driven and composition of views that fit the application of interest. The views capture functional relationships, such as pathway activities and crosstalk, cell-type-specific relationships, or focus on relations b/n different anatomical regions.





□ wenda_gpu: fast domain adaptation for genomic data

>> https://www.biorxiv.org/content/10.1101/2022.04.09.487671v1.full.pdf

Weighted elastic net domain adaptation exploits the complex biological interactions that exist between genomic features to maximize transferability to a new context.

wenda_gpu uses GPyTorch, which provides efficient and modular Gaussian process inference. Using wenda_gpu, completing the whole prediction task on genome-wide datasets with tens of thousands of features is thus feasible in a single day on a single GPU-enabled computer.





□ CHAPAO: Likelihood and hierarchical reference-based representation of biomolecular sequences and applications to compressing multiple sequence alignments

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0265360

CHAPAO (COmpressing Alignments using Hierarchical and Probabilistic Approach), a new lossless compression which is especially designed for multiple sequence alignments (MSAs) of biomolecular data.

CHAPAO combines likelihood based analyses of the sequence similarities and graph theoretic algorithms. CHAPAO has achieved more compression on the MSAs with less average pairwise hamming distance among the sequences.





□ Using topic modeling to detect cellular crosstalk in scRNA-seq

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009975

A new method based on Latent Dirichlet Allocation (LDA) for detecting genes that change as a result of interaction. This method does not require prior information in the form of clustering or generation of synthetic reference profiles.

The model has been applied to two datasets of sequenced PICs and a dataset generated by standard 10x Chromium. Its approach assumes there is a reference population that can be used to fit the first LDA; for example this could be populations before an interaction has occurred.





□ DRUMMER—Rapid detection of RNA modifications through comparative nanopore sequencing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac274/6569078

DRUMMER (Detection of Ribonucleic acid Modifications Manifested in Error Rates) utilizes a range of statistical tests and background noise correction to identify modified nucleotides, operates w/ similar sensitivity to signal-level analysis, and correlates very well w/ orthogonal approaches.

DRUMMER can process both genome-level and transcriptome-level alignments. DRUMMER uses sequence read alignments against a genome to predict the location of putative RNA modifications in a genomic context.





□ Neural network approach to somatic SNP calling in WGS samples without a matched control.

>> https://www.biorxiv.org/content/10.1101/2022.04.14.488223v1.full.pdf

A neural network-based approach for calling somatic single nucleotide polymorphism (SNP) variants in tumor WGS samples without a matched normal.

The method relies on recent advances in artefact filtering as well as on state-of-the-art approaches to germline variant removal in single-sample calling. In the core of the method is a neural network classifier trained using 3D tensors consisting of piledup variant reads.





□ BioAct: Biomedical Knowledge Base Construction using Active Learning

>> https://www.biorxiv.org/content/10.1101/2022.04.14.488416v1.full.pdf

BioAct, is based on a partnership between automatic annotation methods (leveraging SciBERT with other machine learning models) and subject matter experts and uses active learning to create training datasets in the biological domain.

BioAct can be used to effectively increase the ability of a model to construct a correct knowledge base. The labels created using BioAct continuously improve the ability of a model to augment an existing seed knowledge base through many iterations of active learning.





□ Rye: genetic ancestry inference at biobank scale

>> https://www.biorxiv.org/content/10.1101/2022.04.15.488477v1.full.pdf

Rye (Rapid ancestrY Estimation) is a large scale global ancestry inference algorithm that works from principal component analysis (PCA) data. The PCA data (eigenvector and eigenvalue) reduces the massive genomic scale comparison to a much smaller matrix solving problem.

Rye infers GA based on PCA of genomic variant samples from ancestral reference populations and query individuals. The algorithm’s accuracy is powered by Metropolis-Hastings optimization and its speed is provided by non-negative least squares (NNLS) regression.





□ USAT: a Bioinformatic Toolkit to Facilitate Interpretation and Comparative Visualization of Tandem Repeat Sequences

>> https://www.biorxiv.org/content/10.1101/2022.04.15.488513v1.full.pdf

A conversion between sequence-based alleles and length-based alleles (i.e., the latter being the current allele designations in the CODIS system) is needed for backward compatibility purposes.

Universal STR Allele Toolkit (USAT) provides a comprehensive set of functions to analyze and visualize TR alleles, including the conversion between length-based alleles and sequence-based alleles, nucleotide comparison of TR haplotypes and an atlas of allele distributions.





□ Dug: A Semantic Search Engine Leveraging Peer-Reviewed Knowledge to Query Biomedical Data Repositories

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac284/6571145

Dug applies semantic web and knowledge graph methods to improve the FAIR-ness of research data. A key obstacle to leveraging this knowledge is the lack of researcher tools to navigate from a set of concepts of interest towards relevant study variables. In a word, search.

Dug's ingest uses the Biolink upper ontology to annotate knowledge graphs and structure queries used to drive full text indexing and search. It uses Monarch Initiative APIs to perform named entity recognition on natural language prose to extract ontology identifiers.





□ Persistent Memory as an Effective Alternative to Random Access Memory in Metagenome Assembly

>> https://www.biorxiv.org/content/10.1101/2022.04.20.488965v1.full.pdf

PMem is a cost- effective option to extend the scalability of metagenome assemblers without requiring software refactoring, and this likely applies to similar memory-intensive bioinformatics solutions.





□ Fast and robust imputation for miRNA expression data using constrained least squares

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04656-4

A novel, fast method for data imputation using constrained Conjugate Gradient Least Squares (CGLS) borrowing ideas from the imaging and inverse problems literature.

The method will be denoted by Fast Linear Imputation. Reconstructing the missing data via nonnegative constrained regression, but with the further constraint that the regression weights sum to 1.





□ SCAMPP: Scaling Alignment-based Phylogenetic Placement to Large Trees

>> https://ieeexplore.ieee.org/document/9763324/

SCAMPP (SCAlable alignMent-based Phylogenetic Placement), a technique to extend the scalability of these likelihood-based placement methods to ultra-large backbone trees.





□ Spycone: Systematic analysis of alternative splicing in time course data

>> https://www.biorxiv.org/content/10.1101/2022.04.28.489857v1.full.pdf

Spycone uses gene or isoform expression as an input. Spycone features a novel method for IS detection and employs the sum of changes of all isoforms relative abundances (total isoform usage) across time points.

Spycone provides downstream analysis such as clustering by total isoform usage, i.e. grouping genes that are most likely to be coregulated, and network enrichment, i.e. extracting subnetworks or pathways that are over-represented by a list of genes.





□ scMOO: Imputing dropouts for single-cell RNA sequencing based on multi-objective optimization

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac300/6575885

scMOO is different from existing ones, which assume that the underlying data has a preconceived structure and impute the dropouts according to the information learned from such structure.

the data combines three types of latent structures, including the horizontal structure (genes are similar to each other), the vertical structure (cells are similar to each other), and the low-rank structure.

The combination weights and latent structures are learned using multi-objective optimization. And, the weighted average of the observed data and the imputation results learned from the three types of structures are considered as the final result.





□ Improving the RNA velocity approach using long-read single cell sequencing

>> https://www.biorxiv.org/content/10.1101/2022.05.02.490352v1.full.pdf

Region velocity is a multi-platform and multi-model parameter to project cell state, which is based on long-read scRNA-seq.

Region velocity is primarily observed through the spindle-shaped relationship between the number of exons and introns in different genes, representing a steady-state model of the original RNA velocity parameter, and their correlation level varies in different genes.





□ GEMmaker: process massive RNA-seq datasets on heterogeneous computational infrastructure

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04629-7

GEMmaker, is a nf-core compliant, Nextflow workflow, that quantifies gene expression from small to massive RNA-seq datasets.

GEMmaker ensures results are highly reproducible through the use of versioned containerized software that can be executed on a single workstation, institutional compute cluster, Kubernetes platform or the cloud.





□ Hierarch: Analyzing nested experimental designs—A user-friendly resampling method to determine experimental significance

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010061

Hierarch can be used to perform hypothesis tests that maintain nominal Type I error rates and generate confidence intervals that maintain the nominal coverage probability without making distributional assumptions about the dataset of interest.



□ HGGA: hierarchical guided genome assembler

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04701-2

HGGA a method for assembling read data with the help of genetic linkage maps. HGGA produces more misassemblies than Kermit but less than miniasm, and produces a similar number of misassemblies as Kermit but less than miniasm.

HGGA does not do scaffolding, the process of ordering the contigs into scaffolds where contigs are separated by gaps. A scaffolding method could be run after HGGA to further increase the contiguity of the assembly. HGGA is inherently easy to parallelize beyond a single machine.





□ CSREP: A framework for summarizing chromatin state annotations within and identifying differential annotations across groups of samples

>> https://www.biorxiv.org/content/10.1101/2022.05.08.491094v1.full.pdf

CSREP takes as input chromatin state annotations for a group of samples and then probabilistically estimates the state at each genomic position and derives a representative chromatin state map for the group.

CSREP uses an ensemble of multi-class logistic regression classifiers to predict the chromatin state assignment of each sample given the state maps from all other samples.





□ Limited overlap of eQTLs and GWAS hits due to systematic differences in discovery

>> https://www.biorxiv.org/content/10.1101/2022.05.07.491045v1.full.pdf

eQTLs cluster strongly near transcription start sites, while GWAS hits do not. Genes near GWAS hits are enriched in numerous functional annotations, are under strong selective constraint and have a complex regulatory landscape across different tissue/cell types.





□ TT-Mars: structural variants assessment based on haplotype-resolved assemblies

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02666-2

TT-Mars takes advantage of the recent production of high-quality haplotype-resolved genome assemblies by providing false discovery rates for variant calls based on how well their call reflects the content of the assembly, rather than comparing calls themselves.

TT-Mars inherently provides a rough estimate of sensitivity because it does not fit into the paradigm of comparing inferred content, and requires variants to be called.

This estimate simply considers false negatives as variants detected by haplotype-resolved assemblies that are not within the vicinity of the validated calls, and one should consider a class of variant that may have multiple representations when reporting results.





□ MuSiC2: cell type deconvolution for multi-condition bulk RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2022.05.08.491077v1.full.pdf

MuSiC2 is an iterative algorithm that aims to improve cell type deconvolution for bulk RNA-seq data using scRNA-seq data as reference when the bulk data are generated from samples with multiple clinical conditions where at least one condition is different from the scRNA-seq reference.

MuSiC2 takes two datasets as input, a scRNA-seq data generated from one clinical condition, and a bulk RNA-seq dataset collected from samples with multiple conditions in which one or more is different from the single-cell reference data.





□ Assembly-free discovery of human novel sequences using long reads

>> https://www.biorxiv.org/content/10.1101/2022.05.06.490971v1.full.pdf

An Assembly-Free Novel Sequence (AF-NS) performs quick identification of novel sequences without assembling processes. the AF-NS detected novel sequences covered over 90% of Illumina novel sequences and contained more DNA information missing from the Illumina data.

A single read can decipher large structural variations, which guarantees the feasibility of AF-NS to discover novel sequences at read level. All ONT long reads were aligned to references using minimap 2.17-r941, and reads with unmapped fragments longer than 300bp were selected.





□ MAGNETO: an automated workflow for genome-resolved metagenomics

>> https://www.biorxiv.org/content/10.1101/2022.05.06.490992v1.full.pdf

MAGNETO, an automated workflow dedicated to MAGs reconstruction, which includes a fully-automated co-assembly step informed by optimal clustering of metagenomic distances, and implements complementary genome binning strategies, for improving MAGs recovery.



□ The limitations of the theoretical analysis of applied algorithms

>> https://arxiv.org/pdf/2205.01785.pdf

Merge sort runs in O (n log n) worst-case time, which formally means that there exists a constant c such that for any large-enough input of n elements, merge sort takes at most cn log n time.

An inter-disciplinary field that uses algorithms to extract biological meaning from genome sequencing data. Demonstrating two concrete examples of how theoretical analysis has failed to achieve its goals but also give one encouraging example of success.





Genuine.

2022-05-03 03:05:03 | Music20


Valravn / “Genuine”

@ LavaStudios, Copenhagen

Composer: Christopher Juul, Anna Katrin Egilstrød
Feat: Anders Ådin, Jonas Bleckman
Lyricist: Anna Katrin Egilstrød
Producer: Christopher Juul
Language: English


Somewhere other than here

We coexist being a part

A picture framed

Sane

Illuminated by a flame

Playing shadow games

Humans remain 
A mystery
 Unexplained


Somewhere it might be here

I throw my self against the wall

A free fall
 And 
Re-collect the faltered parts

Shooting darts

Hunting hearts

My inside is mine alone

I'm coming undone



Simply being a part of

Traveling alone from the start

Simply being a part of

Traveling alone from the start

and somehow cohere



I want to see you while I'm here



Simply being a part of 

Traveling alone from the start

Simply being a part of

Traveling alone from the start

and somehow cohere


ODESZA - “Light of Day” (feat. Ólafur Arnalds)

2022-05-03 03:03:03 | Music20


□ ODESZA - “Light of Day” (feat. Ólafur Arnalds)

>> https://odesza.com/

7月にリリースされるODESZAのAlbum “The Last Goodbye”に、アイスランドの作曲家Ólafur Arnaldsが参加しているのが楽しみ😌✨ ODESZAのボーダーレスでクールなダンストラックに、Ólafurの哀愁のストリングスが聴けたら泣いてしまいまそう😳🔊