lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Ostiarius.

2022-07-31 23:57:37 | Science News




□ OmegaFold: High-resolution de novo structure prediction from primary sequence

>> https://www.biorxiv.org/content/10.1101/2022.07.21.500999v1.full.pdf

OmegaFold enables accurate predictions on orphan proteins that do not belong to any functionally characterized protein family and antibodies that tend to have noisy MSAs due to fast evolution.

OmegaFold combines a large pretrained language model for sequence modeling and a geometry-inspired transformer. It learns single- and pairwise-residue embeddings. A stack of Geoformer layers then iteratively updates these embeddings to improve their geometric consistency.





□ HYFA: Hypergraph factorisation for multi-tissue gene expression imputation

>> https://www.biorxiv.org/content/10.1101/2022.07.31.502211v1.full.pdf

HYFA (Hypergraph Factorisation), a parameter-efficient graph representation learning approach for joint multi-tissue and cell-type GE imputation. Through transfer learning on a paired single-nucleus RNA-seq dataset (GTEx-v9), HYFA resolves cell-type signatures from bulk GE.

HYFA imputes tissue-specific GE via a specialised graph neural network operating on a hypergraph of metagenes. HYFA is genotype-agnostic, supports a variable number of collected tissues, and imposes strong inductive biases to leverage the shared regulatory architecture.





□ HiCoEx: Prediction of Gene Co-expression from Chromatin Contacts with Graph Attention Network

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac535/6656345

HiCoEx, a novel machine learning framework based on graph neural network HiCoEx is able to automatically capture important patterns for the prediction of co-expression from chromosomal contacts between genes, and visualize the gene-gene interactions for mechanistic exploration.

HiCoEx calculates topological properties incl. Clustering Coefficient, Jaccard Index and Shortest path length. Pearson Correlation Coefficient (PCC) about each topological property is computed between the genes and their neighborhoods in the embedding space.





□ GIANT: A unified analysis of atlas single cell data

>> https://www.biorxiv.org/content/10.1101/2022.08.06.503038v1.full.pdf

GIANT integrates multi-modality and multi-tissue data. GIANT first converts datasets from different modalities into gene graphs, and then recursively embeds genes in the graphs into a latent space without additional alignment.

A dendrogram is then built to connect the gene graphs in a hierarchy. In recursive projection, a dendrogram is used to enforce similarity constraints across graphs while still allowing genes with multiple functions to be projected to different locations in the embedding space.





□ Exact polynomial-time isomorphism testing in directed graphs through comparison of vertex signatures in Krylov subspaces.

>> https://www.biorxiv.org/content/10.1101/2022.07.28.501884v1.full.pdf

Graph Krylov subspaces, which contain products of vectors and exponentiated adjacency matrices, are closely related to the tensor of eigenprojections, presenting an related avenue for isomorphism research.

Recursive exponentiation may also cause either vanishing or explosive growth of Krylov matrix elements. This problem may be addressed in some cases by normalising vectors.

A “vertex signature” is defined by initialising a Krylov matrix with a binary vector indicating the vertex position. the isomorphic mapping may be constructed iteratively o(n^5) time by building a set of vertex analogies sequentially.





□ Hierarchical Interleaved Bloom Filter: Enabling ultrafast, approximate sequence queries

>> https://www.biorxiv.org/content/10.1101/2022.08.01.502266v1.full.pdf

The HIBF data structure has enormous potential. It can be used on its own like in the tool Raptor, or can serve as a prefilter to distribute more advanced analyses such as read mapping.

Since the build time exceeds two orders of magnitude less than that of comparable tools like Mantis and Bifrost, the HIBF can easily be rebuilt even for huge data sets.

The HIBF builds an index up to 211 times faster, using up to 14 times less space and can answer approximate membership queries faster by a factor of up to 129. This can be considered a quantum leap that opens the door to indexing complete sequence.





□ ZetaSuite: computational analysis of two-dimensional high-throughput data from multi-target screens and single-cell transcriptomics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02729-4

Zeta is Z-based estimation of global splicing regulators. Zeta statistics can maximally segregate high-quality cells from damaged ones while minimize unwanted artifacts. ZetaSuite is a computational framework initially developed to process the data from a siRNA screen.

ZetaSuite generates a Z-score for each AS event against each targeting RNA in the data matrix and then computes the number of hits at each Z-score cutoff from low to high and in both directions to separately quantify induced exon skipping or inclusion events.





□ Tensor Decomposition Discriminates Tissues Using scATAC-seq

>> https://www.biorxiv.org/content/10.1101/2022.08.04.502875v1.full.pdf

Tensor Decomposition to an scATAC-seq data set and the obtained embedding can be used for UMAP, following which the embedded material obtained by UMAP can differentiate tissues from which the scATAC sequence was retrieved.

Applying UPGMA (unweighted pair group method with arithmetic mean) to negatively signed correlation coefficients. TD can deal with large sparse data sets generated by approximately 200 bp intervals and this number can be as high as 13,627,618, as these can be stored in a sparse matrix format.





□ CIARA: a cluster-independent algorithm for the identification of markers of rare cell types from single-cell RNA seq data

>> https://www.biorxiv.org/content/10.1101/2022.08.01.501965v1.full.pdf

CIARA (Cluster Independent Algorithm for the identification of markers of RAre cell types) identifies potential marker genes of rare cell types by exploiting their property of being highly expressed in a small number of cells with similar transcriptomic signatures.

CIARA ranks genes based on their enrichment in local neighborhoods defined from a K-nearest neighbors (KNN) graph. The top-ranked genes have, thus, the property of being “highly localized” in the gene expression space.





□ ASURAT: Functional annotation-driven unsupervised clustering of single-cell transcriptomes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac541/6655687

ASURAT, a computational tool for simultaneously performing unsupervised clustering and functional annotation of biological process, and signaling pathway activity for transcriptomic data, using a correlation graph decomposition for genes in database-derived functional terms.

ASURAT creates sign-by-sample matrices (SSMs). SSM is analogous to a read count table, where the rows represent signs with biological meaning instead of individual genes and the values contained are “sign scores” instead of read counts.

Since ASURAT can create multivariate data (i.e., SSMs) from multiple signs, ranging from cell types to biological functions, it will be valuable to consider graphical models of signs.

A non-Gaussian Markov random field theory is one of the most promising approaches to address this problem, although requires a large number of samples for achieving true graph edges.





□ Metheor: Ultrafast DNA methylation heterogeneity calculation from bisulfite read alignments

>> https://www.biorxiv.org/content/10.1101/2022.07.20.500893v1.full.pdf

The main algorithmic advantage of Metheor comes from the fact that it only reads through the entire BAM file only once. Reduced representation bisulfite sequencing (RRBS) predominantly targets the CpG-dense regions. This read-centric approach iterates through aligned reads.

Metheor produces methylation heterogeneity levels accurately. Metheor supports Computation of local pairwise methylation discordance (LPMD). LPMD is defined as a fraction of CpG pairs within a given range of genomic distance. LPMD does not depend on length of sequencing read.





□ Asteroid: a new minimum balanced evolution supertree algorithm robust to missing data

>> https://www.biorxiv.org/content/10.1101/2022.07.22.501101v1.full.pdf

Asteroid, a novel supertree method that infers an unrooted species tree from a set of unrooted gene trees. Asteroid is more robust to missing data than ASTRAL and ASTRID, while being several orders of magnitude faster than ASTRAL for datasets that contain thousands of genes.

Asteroid computes for each input gene tree a distance matrix based on the gene internode distance. Then, it computes a species tree from this set of distance matrices under the minimum balanced evolution principle.





□ scMTNI: Inference of cell type-specific gene regulatory networks on cell lineages from single cell omic datasets

>> https://www.biorxiv.org/content/10.1101/2022.07.25.501350v1.full.pdf

scMTNI (single-cell Multi-Task Network Inference), a multi-task learning framework that integrates the cell lineage structure, scRNA-seq and scATAC-seq measurements to enable joint inference of cell type-specific GRNs.

scMTNI uses a novel probabilistic prior to incorporate the lineage structure and outputs GRNs for each cell type on a cell lineage. The output networks of scMTNI are analyzed using two dynamic network analysis methods: edge-based k-means clustering and topic models.





□ HAlign 3: fast multiple alignment of ultra-large numbers of similar DNA/RNA sequences

>> https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msac166/6653123

HAlign 3 improves the time efficiency and the alignment quality. The suffix tree data structure is specifically modified to fit the nucleotide sequence: Left-child right-sibling is replaced by a K-ary tree to build the suffix tree to reach a higher common substring searching efficiency.

A global substring selection algorithm combining directed acyclic graphs with dynamic programming is adopted to screen out the unsatisfactory common substrings. These improvements make HAlign 3 a specialized program to deal with ultra-large numbers of similar DNA/RNA sequences.





□ MGREML: Multivariate estimation of factor structures of complex traits using SNP-based genomic relationships

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04835-3

MGREML estimates multivariate factor structures and perform inferences on factor models at low computational cost. It enables simple structural equation modeling using MGREML, allowing to specify, estimate, and compare genetic factor models of their choosing using SNP data.

MGREML calculates the contribution of any given block in O(T^2) time. MGREML transforms the data, and reorders the variance matrix is block diagonal. Using a Broyden–Fletcher–Goldfarb–Shanno algorithm, it balances computational complexity & rate of convergence across iterations.





□ GE-Impute: graph embedding-based imputation for single-cell RNA-seq data

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac313/6651303

GE-Impute learns the neural graph representation for each cell and reconstructs the cell–cell similarity network accordingly, which enables better imputation of dropout zeros based on the more accurately allocated neighbors in the similarity network.

GE-Impute constructs a raw cell-cell similarity network based on Euclidean distance. For each cell, it simulates a random walk of fixed length using BFS and DFS strategy.

Next, graph embedding-based neural network was employed to train the embedding matrix for each cell based on sampling walks. The similarity among cells could be re-calculated from embedding matrix to predict new link-neighbors and reconstruct cell-cell similarity network.





□ DeepST: A versatile graph contrastive learning framework for spatially informed clustering, integration, and deconvolution of spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2022.08.02.502407v1.full.pdf

Spatial contrastive self-supervised learning enables the learned spatial spot representation to be more informative and discriminative by minimizing the embedding distance between spatially adjacent spots and vice versa.

DeepST learns a mapping matrix to project the scRNA-seq data into the ST space based on their learned features via a contrastive learning mechanism where the similarities of spatially neighboring spots are maximized and those of spatially non-neighboring spots are minimized.





□ Exploring Phylogenetic Classification and Further Applications of Codon Usage Frequencies

>> https://www.biorxiv.org/content/10.1101/2022.07.20.500846v1.full.pdf

GridSearchCV was used to search over hyperparameters. Using the sparse categorical crossentropy loss function, the adam optimizer, 5 fold CV, 15 epochs, a validation split of 0.1 the code chose the number of layers, neurons in each layer, and the l2 penalty for regularization.





□ A quaternion model for single cell transcriptomics

>> https://www.biorxiv.org/content/10.1101/2022.07.21.501020v1.full.pdf

Quaternions are four dimensional hypercomplex numbers that, along with real numbers, complex numbers and octonions, represent one of the four normed division algebras.

The quaternion associated with each cell represents a vector in R3 with vector length capturing sequencing depth and vector direction capturing the relative expression profile.

The proposed scRNA-seq quaternion model enables the spectral analysis scRNA-seq data relative to a single variable (e.g., pseudo-time) or two variables to be performed on a genome-wide basis by used a one or two-dimensional hypercomplex Fourier transformation.





□ MCPNet : A parallel maximum capacity-based genome-scale gene network construction framework

>> https://www.biorxiv.org/content/10.1101/2022.07.19.500603v1.full.pdf

MCP Score, a novel maximum-capacity-path based metric to quantify the relative strengths of direct and indirect gene-gene interactions. MCPNet, an efficient, parallelized GRN reconstruction software that can scale to hundreds of cores.

The maximum capacity of all stlength-L paths can be computed via recursive path bisection. The recursive path bisection allows to be computed in O(|V| log2 L) for a single gene-gene pair, and the long range DPI scores for all gene pairs to be computed in O(|V |3log2 L) time.





□ LanceOtron: a deep learning peak caller for genome sequencing experiments

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac525/6648462

LanceOtron combines deep learning for recognizing peak shape with multifaceted enrichment calculations for assessing significance. In benchmarking ATAC-seq, ChIP-seq, and DNase-seq, LanceOtron outperforms long-standing peak callers through its near perfect sensitivity.

LanceOtron uses the relationship b/n the number of overlapping reads and their relative positions at all 2,000 points, returning a shape score. A multilayer perceptron combines the CNN and logistic regression models, to produce an overall peak quality metric called Peak Score.





□ SpatialSort: A Bayesian Model for Clustering and Cell Population Annotation of Spatial Proteomics Data

>> https://www.biorxiv.org/content/10.1101/2022.07.27.499974v1.full.pdf

SpatialSort has the ability to accounts for the affinities of cells of different types to neighbour in space. By incorporating prior information about expected cell populations, SpatialSort is able to improve clustering accuracy and perform automated annotation of clusters.

SpatialSort models cell labels using an Hidden Markov Random Field (HMRF). SpatialSort takes the cell location and neighbour relations to construct sample-specific cell connectivity graphs that link cells that are spatially proximal.





□ Deep R-looper Discriminant: Cell-type-specific aberrant R-loop accumulation regulates target gene and confers cell-specificity

>> https://www.biorxiv.org/content/10.1101/2022.07.19.500727v1.full.pdf

Deep R-looper Discriminant, a deep neural network-based framework for extracting features automatically from epigenetic marks in genome bins around TSS and TTS and identifying aberrant R-loops against normal R-loops.

Deep R-looper Discriminant adoptes GridSearch CV to automate the tuning of hyperparameters for these baseline models and finally got optimized k-nearest neighbors (KNN), linear discriminant analysis (LDA), logistic regression (LR), naive bayes (NB), and random forests (RF).





□ HAT: Haplotype Assembly Tool using short and error-prone long reads

>> https://www.biorxiv.org/content/10.1101/2022.07.20.500775v1.full.pdf

HAT, a haplotype assembly tool that exploits short and long reads along with a reference genome to reconstruct haplotypes. HAT tries to take advantage of the accuracy of short reads and the length of the long reads to reconstruct haplotypes.

HAT comprises 3 components - initialization, iteration and assembly. Initialization creates the first phased blocks. The iteration expands the phased blocks and finds alleles of all haplotypes. Then, HAT clusters the reads, and assembles haplotypes using these clustered reads.





□ scDEC-Hi-C: Deep generative modeling and clustering of single cell Hi-C data

>> https://www.biorxiv.org/content/10.1101/2022.07.19.500573v1.full.pdf

scDEC-Hi-C is a novel end-to-end deep learning framework for analyzing single cell Hi-C data using a multi-stage model. scDEC-Hi-C consists of a chromosome-wise autoencoder (AE) model and a cell-wise deep embedding and clustering model (scDEC).

Note that all baseline methods are only able to learn the embedding for each single cell and require additional clustering methods (e.g, K-means) while scDEC-Hi-C simultaneously learns cell embeddings and assigns clustering labels to each cell.





□ Accelerating genomic workflows using NVIDIA Parabricks

>> https://www.biorxiv.org/content/10.1101/2022.07.20.498972v1.full.pdf

Achieving up to 65x acceleration, bringing HaplotypeCaller runtime down from 36 hours to 33 minutes on AWS, 35 minutes on GCP, and 24 minutes on the NVIDIA DGX.

Alternatively, somatic variant callers achieved speedups up to 56.8x for the Mutect2 algorithm, but surprisingly, did not scale linearly with the number of GPUs, emphasizing the need for algorithmic benchmarking before embarking on large-scale projects.







□ BiGCARP: Deep self-supervised learning for biosynthetic gene cluster detection and product classification

>> https://www.biorxiv.org/content/10.1101/2022.07.22.500861v1.full.pdf

Biosynthetic Gene CARP (BiGCARP) represents BGCs as chains of functional protein domains, and uses ESM-1b, a protein masked language model, to obtain pretrained embeddings of functional protein domains with amino acid-level context.

A convolutional masked language model on these domains to develop meaningful learned representations of BGCs and their constituent domains. BiGCARP-random is initialized with a random Pfam embedding.





□ BWA-MEME: BWA-MEM emulated with a machine learning approach

>> https://academic.oup.com/bioinformatics/article-abstract/38/9/2404/6543607

BWA-MEME, the first full-fledged short read alignment software that leverages learned indices for solving the exact match search problem for efficient seeding.

BWA-MEME is a practical and efficient seeding algorithm based on a suffix array search algorithm that solves the challenges in utilizing learned indices for SMEM search which is extensively used in the seeding phase.





□ ATAC-STARR-seq reveals transcription factor-bound activators and silencers across the chromatin accessible human genome

>> https://genome.cshlp.org/content/early/2022/07/18/gr.276766.122

A new workflow that substantially expands the capabilities of ATAC- STARR-seq to extract and measure gene regulatory information. This workflow identifies both activators and silencers, as well as to simultaneously profile chromatin accessibility, and perform TF footprinting.

Adapting a modified tagmentation protocol (Omni-ATAC) to remove mitochondrial DNA from the DNA fragment pool.

The re-isolation of plasmid DNA recovers only the ATAC-STARR-seq plasmids that were successfully transfected, thus providing a more accurate representation of the “input” sample than sequencing without transfection.





□ SECEDO: SNV-based subclone detection using ultra-low coverage single-cell DNA sequencing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac510/6651099

\
The pivotal blocks in the SECEDO pipeline are a Bayesian filtering strategy for efficient identification of relevant loci and derivation of a global cell-to-cell similarity matrix utilizing both the structure of reads and the haplotype phasing.





□ epiConv: Joint analysis of scATAC-seq datasets

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04858-w

epiConv is capable of aligning low-depth scATAC-Seq from co-assay data (simultaneous profiling of transcriptome and chromatin) onto high-quality ATAC-seq reference and increasing the resolution of chromatin profiles of co-assay data.

epiConv directly calculates the similarities between cells without embedding them into the latent feature space. epiConv can be used to integrate cells from different biological conditions, which reveals hidden cell populations that would otherwise be undetectable.





□ BMRF: Probabilistic Edge Inference of Gene Networks with Bayesian Markov Random Field Modelling

>> https://www.biorxiv.org/content/10.1101/2022.07.30.501645v1.full.pdf

This method combines the Bayesian Markov Random field model and conditional autoregressive model for the relationship between gene nodes. This analysis can evaluate the relative strength of the edges and further prioritize the edges of interest.

The proposed BMRF model was compared with M&B, Glasso, SPACE, and CLIME, as well as with the Bayesian approach BDgraph using the Bayesian model averaging procedure (denoted as BD_BMA) or the Maximum a posterior probability procedure.





□ HiCAT: A tool for automatic annotation of centromere structure

>> https://www.biorxiv.org/content/10.1101/2022.08.07.502881v1.full.pdf

HiCAT, a generalizable automatic centromere annotation tool, based on hierarchical tandem repeat mining and maximization of tandem repeat coverage to facilitate decoding of centromere architecture.

HiCAT transforms a centromere DNA sequence into a block list based on an input monomer template. HiCAT defines a similarity score based on the block edit distance to obtain a block similarity matrix. HiCAT detects LN-HORs using the Hierarchical Tandem Repeat Mining.





Stiria.

2022-07-31 23:55:57 | Science News




□ TrEMOLO: Accurate transposable element allele frequency estimation using long-read sequencing data combining assembly and mapping-based approaches

>> https://www.biorxiv.org/content/10.1101/2022.07.21.500944v1.full.pdf

Transposable Element MOvement detection using LOng-reads (TrEMOLO) combines the advantages offered by LR sequencing (i.e., highly contiguous assembly and unambiguous mapping) to identify TE insertion (and deletion) variations, for TE detection and frequency estimation.

TrEMOLO accuracy in TE identification and the TSD detection system allow predicting the insertion site within a 2-base pair window. Assemblers provide the most frequent haplotype at each locus, and thus an assembly represent just the "consensus" of all haplotypes at each locus.





□ Causal identification of single-cell experimental perturbation effects with CINEMA-OT

>> https://www.biorxiv.org/content/10.1101/2022.07.31.502173v1.full.pdf

CINEMA-OT (Causal INdependent Effect Module Attribution + Optimal Transport) separates confounding sources of variation from perturbation effects to obtain an optimal transport matching that reflects counterfactual cell pairs.

The algorithm is based on a causal inference framework for modeling confounding signals and conditional perturbation. CINEMA-OT can attribute divergent treatment effects to either explicit confounders, or latent confounders by cluster-wise coarse-graining of the matching matrix.






□ AIFS: A novel perspective, Artificial Intelligence infused wrapper based Feature Selection Algorithm on High Dimensional data analysis

>> https://www.biorxiv.org/content/10.1101/2022.07.21.501053v1.full.pdf

AIFS creates a Performance Prediction Model (PPM) using artificial intelligence (AI) which predicts the performance of any feature set and allows wrapper based methods to predict and evaluate the feature subset model performance without building actual model.

AIFS can identify both marginal features and interaction terms without using interaction terms in PPM, which could be critical in reducing the feature space an algorithm has to process.





□ MVCPM: Multiview clustering of multi-omics data integration by using a penalty model

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04826-4

MVCPM has the highest silhouette score for common clusters and the average silhouette score. MVCPM provides more detailed information within each data type, is better for integrating different types of omics data and simultaneously has consistent and differential cluster patterns.

MVCPM can be considered the best approach for integration and clustering. MVCPM uses k-NN to assign patients that are originally clustered into different clusters into one cluster and compute silhouette scores. MVCPM determines the significance of difference in survival times.





□ Hybrid Rank Aggregation (HRA): A novel rank aggregation method for ensemble-based feature selection

>> https://www.biorxiv.org/content/10.1101/2022.07.21.501057v1.full.pdf

the ensemble-based feature selection (EFS) approach relies on using a single RA algorithm to pool feature performance and select features. However, a single RA algorithm may not always give optimal performance across all datasets.

A novel hybrid rank aggregation (HRA) method allows creation of a RA matrix which contains feature performance or importance in each RA technique followed by an unsupervised learning-based selection of features based on their performance/importance in RA matrix.





□ Benchmarking long-read RNA-sequencing analysis tools using in silico mixtures

>> https://www.biorxiv.org/content/10.1101/2022.07.22.501076v1.full.pdf

ONT long reads from pure RNA samples were used for isoform detection using bambu, FLAIR, FLAMES, SQANTI3, StringTie2 and TALON. Both pure RNA samples and in silico mixture samples were mapped against the GENCODE human annotation and sequins annotation.

This silico mixture strategy provides extra levels of ground-truth without extra cost. The transcript-level count matrix was used as input to downstream steps such as DTE (fDESeq2, EBSeq, edgeR, limma, NOISeq) and DTU (DEXSeq, DRIMSeq, edgeR, limma and satuRn).





□ ccImpute: an accurate and scalable consensus clustering based algorithm to impute dropout events in the single-cell RNA-seq data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04814-8

ccImpute has a polynomial runtime that compares favorably to imputation algorithms with polynomial (DrImpute, DCA, DeepImpute) and exponential runtime (scImpute).

ccImpute relies on a consensus matrix to approximate how likely a given pair of cells is to be clustered together and considered to be of the same type. Applying mini-batch K-means and the possibility of using a more efficient centroid selection scheme than random restarts.





□ CMIC: an efficient quality score compressor with random access functionality

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04837-1

CMIC (classification, mapping, indexing and compression), an adaptive and random access supported compressor for lossless compression. In terms of random access speed, the CMIC is faster than the LCQS.

The algorithm realizes the parallelization of the compression process by using SIMD. CMIC makes full use of the correlation between adjacent quality scores and improves the efficiency of context modeling entropy encoding.





□ orsum: a Python package for filtering and comparing enrichment analyses using a simple principle

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04828-2

Filtering in orsum is based on a simple principle: a term is discarded if there is a more significant term that annotates at least the same genes; the remaining more significant term becomes the representative term for the discarded term.

The inputs for orsum are enrichment analysis results containing term IDs ordered by statistical significance and Gene Matrix Transposed (GMT) file. This makes it possible to use the same annotations as the ones used in the enrichment analysis.





□ dRFEtools: Dynamic recursive feature elimination for omics

>> https://www.biorxiv.org/content/10.1101/2022.07.27.501227v1.full.pdf

Dynamic recursive feature elimination (RFE) decreases computational time compared to the current RFE function available with scikit-learn, while maintaining high accuracy in simulated data for both classification and regression models.

Dynamic RFE analysis is based on the random forest algorithm with Out-of-Bag scoring and 100 n estimators similar to simulation data. StratifiedKFold is used to generate cross-validation folds for all scenarios to maintain even distribution of patient diagnosis across folds.





□ McAN: an ultrafast haplotype network construction algorithm

>> https://www.biorxiv.org/content/10.1101/2022.07.23.501111v1.full.pdf

McAN, a minimum-cost arborescence based haplotype network construction algorithm, by considering mutation spectrum history (mutations in ancestry haplotype should be contained in descendant haplotype), node size and sampling time.

McAN calculates distances b/n adjacent haplotypes instead of any two haplotypes. All haplotypes are sorted by mutation count and sequence count in descending order and the earliest sampling time in ascending order. The closest ancestor is determined and minimized for each haplotype.





□ SparkGC: Spark based genome compression for large collections of genomes

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04825-5

SparkGC uses Spark’s in-memory computation capabilities to reduce compression time by keeping data active in memory between the first-order and second-order compression.

SparkGC is a lossless genome compression method, the auxiliary data of the to-be-compressed sequence cannot be lost.

The compression algorithm is deployed on the master node, but the scheduling mechanism of Spark is migrating the computing tasks to nodes closest to the data, so the compression tasks will be scheduled to worker nodes.





□ ColocQuiaL: A QTL-GWAS colocalization pipeline

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac512/6650620

ColocQuiaL automates the execution of COLOC to perform colocalization analyses between GWAS signals for any trait of interest and single-tissue eQTL and sQTL signals.

The input loci to ColocQuiaL can be a single GWAS locus, a list of GWAS loci of interest, or just the summary statistics across the entire genome.





□ Canary: an automated tool for the conversion of MaCH imputed dosage files to PLINK files

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04822-8

Canary uses singularity container technology to allow users to automatically convert these MaCH files into PLINK compatible files. Canary is a singularity container which comes w/ many preinstalled software, incl. dose2plink.c, which allows users to use directly on any system.

The convert-mac module of Canary deals with a single sub-study at a time. Canary combines the consent groups by combining each of chromosome dose files i.e., consent group 1 chromosome 1 with consent group 2 with chromosome 1.





□ Haisu: Hierarchically supervised nonlinear dimensionality reduction

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010351

Haisu is a generalizable extension to nonlinear dimensionality reduction for visualization that incorporates an input hierarchy to influence a resulting embedding.

Haisu mirrors the limitations of the integrated NLDR approach spatially and temporally. Haisu formulates a direct relationship between the distance of two graph nodes in the hierarchy and the resulting pairwise distance in high-dimensional space.





□ CGAN-Cmap: protein contact map prediction using deep generative adversarial neural networks

>> https://www.biorxiv.org/content/10.1101/2022.07.26.501607v1.full.pdf

CGAN-Cmap is constructed via integration of a modified squeeze excitation residual neural network (SE-ResNet), SE-Concat, and a conditional GAN.

CGAN-Cmap uses a dynamic weighted binary cross-entropy (BCE) loss function, which assigns a dynamic weight for classes based on the ratio of the uncontacted class to the contacted class in each iteration.





□ JBrowse 2: A modular genome browser with views of synteny and structural variation

>> https://www.biorxiv.org/content/10.1101/2022.07.28.501447v1.full.pdf

JBrowse 2 retains the core features of the open-source JavaScript genome browser JBrowse while adding new views for synteny, dotplots, breakpoints, gene fusions, and whole-genome overviews.

JBrowse 2 features several specialized synteny views, incl. the Dotplot View and the Linear Synteny View. These views can display data from Synteny Tracks, which themselves can load data from formats including MUMmer, minimap2, MashMap, UCSC chain files, and MCScan.





□ HyMSMK: Integrate multiscale module kernel for disease-gene discovery in biological networks

>> https://www.biorxiv.org/content/10.1101/2022.07.28.501869v1.full.pdf

HyMSMK, a type of novel hybrid methods for disease-gene discovery by integrating multiscale module kernel (MSMK) derived from multiscale module profile (MSMP).

HyMSMK extracts MSMP with local to global structural information by multiscale modularity optimization with exponential sampling, and construct MSMK by using the MSMP as a feature matrix, combining with the relative information content of features and kernel sparsification.





□ Graphia: A platform for the graph-based visualisation and analysis of high dimensional data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010310

Graph layout is an iterative process. Many programs only display the results of a layout algorithm after it has run a defined number of iterations. With Graphia, the layout is shown live, such that graphs ‘unfold’ in real time.

Core to Graphia’s functionality is support for the calculation of correlation matrices from any tabular matrix of continuous or discrete values, whereupon the software is designed to rapidly visualise the often very large graphs that result in 2D or 3D space.





□ Cookie: Selecting Representative Samples From Complex Biological Datasets Using K-Medoids Clustering

>> https://www.frontiersin.org/articles/10.3389/fgene.2022.954024/full

Cookie can efficiently select out the most representative samples from a massive single-cell population with diverse properties. This method quantifies the relationships/similarities among samples using their Manhattan distances by vectorizing all given properties.

Cookie determines an appropriate sample size by evaluating the coverage of key properties from multiple candidate sizes, following by a k-medoids clustering to group samples into several clusters and selects centers from each cluster as the most representatives.





□ FLAIR-fusion: Detection of alternative isoforms of gene fusions from long-read RNA-seq

>> https://www.biorxiv.org/content/10.1101/2022.08.01.502364v1.full.pdf

FLAIR-fusion can detect simulated fusions and their isoforms with high precision and recall even with error-prone reads. This tool is able to do splice site correction of all reads, gather chimeric reads, and then apply a number of specific filters to identify true fusion reads.

FLAIR-fusion identifies the isoforms at each locus involved in a fusion, then combines those to identify full-length fusion isoforms matched across the fusion breakpoint.





□ sc-SHC: Significance Analysis for Clustering with Single-Cell RNA-Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2022.08.01.502383v1.full.pdf

Over-clustering can be particularly insidious because clustering algorithms will partition data even in cases where there is only uninteresting random variation present.

Extending a method for Gaussian data, Significance of Hierarchical Clustering (SHC), to propose a model-based hypothesis testing that incorporates significance analysis into the clustering algorithm and permits statistical evaluation of clusters as distinct cell populations.





□ SPA: Optimal Sparsity Selection Based on an Information Criterion for Accurate Gene Regulatory Network Inference

>> https://www.frontiersin.org/articles/10.3389/fgene.2022.855770/full

SPA, a sparsity selection algorithm that is inspired by the AIC and BIC in terms of introducing a penalty term to the goodness of fit, but is developed particularly for GRN inference to identify the most mathematically optimal and accurate GRN.


SPA takes a set of inferred GRNs with varying sparsities, the measured gene expression in fold changes, and the perturbation design as input. It then uses the GRN Information Criterion (GRNIC) and identifies the GRN that minimizes GRNIC as the best GRN.





□ EI: Integrating multimodal data through interpretable heterogeneous ensembles

>> https://www.biorxiv.org/content/10.1101/2020.05.29.123497v3.full.pdf

Existing data integration approaches do not sufficiently address the heterogeneous semantics of multimodal data. Early approaches that rely on a uniform integrated representation reinforce the consensus among the modalities, but may lose exclusive local information.

Ensemble Integration (EI) infers local predictive models from the individual data modalities using appropriate algorithms, and uses effective heterogeneous ensemble algorithms to integrate these local models into a global predictive model.





□ BASS: multi-scale and multi-sample analysis enables accurate cell type clustering and spatial domain detection in spatial transcriptomic studies

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02734-7

BASS (Bayesian Analytics for Spatial Segmentation) performs multi-scale transcriptomic analyses in the form of joint cell type clustering and spatial domain detection, with the two analytic tasks carried out simultaneously within a Bayesian hierarchical modeling framework.

BASS is capable of multi-sample analysis that jointly models multiple tissue sections/samples, facilitating the integration of spatial transcriptomic data across tissue samples.





□ Cogito: automated and generic comparison of annotated genomic intervals

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04853-1

Cogito “COmpare annotated Genomic Intervals TOol” provides a workflow for an unbiased, structured overview and systematic analysis of complex genomic datasets consisting of different data types (e.g. RNA-seq, ChIP-seq) and conditions.

Cogito is able to visualize valuable key information of genomic or epigenomic interval-based data. Within Cogito gene expression in reads per kilo base per million mapped reads (RPKM) from RNA-seq and Homer ChIP-seq peak scores were interpreted as rational values.





□ DBFE: Distribution-based feature extraction from structural variants in whole-genome data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac513/6656344

The core contributions of DBFE include: (1) strategies for determining features using variant length binning, clustering, and density estimation; (2) a programming library for automating distribution-based feature extraction in machine learning pipelines.

DBFE uses an approach based on Kernel Density Estimation. DBFE can be applied to other variant types (e.g., small insertions/deletions). One would possibly need to limit the range of lengths taken into account and analyze distributions on a linear rather than a logarithmic scale.





□ ChromTransfer: Transfer learning reveals sequence determinants of regulatory element accessibility

>> https://www.biorxiv.org/content/10.1101/2022.08.05.502903v1.full.pdf

The ENCODE rDHSs were assembled using consensus calling from 93 million DHSs called across a wide range of human cell lines, cell types, cellular states, and tissues, and are therefore likely capturing the great majority of possible sequences associated with human open chromatin.

ChromTransfer, a transfer learning scheme for single-task modeling of the DNA sequence determinants of regulatory element activities. ChromTransfer uses a cell-type agnostic model of open chromatin regions across human cell types to fine-tune models for specific tasks.





□ Detecting boolean asymmetric relationships with a loop counting technique and its implications for analyzing heterogeneity within gene expression datasets

>> https://www.biorxiv.org/content/10.1101/2022.08.04.502792v1.full.pdf

A very general method that can be used to detect biclusters within gene-expression data that involve subsets of genes which are enriched for these ‘boolean-asymmetric’ relationships (BARs).

This strategy can make use of any method which finds BSR-biclusters, but for demonstration we make use of the LCLR method for finding BSR-biclusters. combine the column-splitting technique with the LCLR algorithm to form what we call the Loop Counting Asymmetric algorithm.





□ matchRanges: Generating null hypothesis genomic ranges via covariate-matched sampling

>> https://www.biorxiv.org/content/10.1101/2022.08.05.502985v1.full.pdf

matchRanges, a propensity score-based covariate matching method for the efficient generation of matched null ranges from a set of background ranges. matchRanges function takes as input a “focal” set of data to be matched and a “pool” set of background ranges to select from.

matchRanges performs subset selection based on the provided covariates and returns a null set of ranges with distributions of covariates. This allows for an unbiased comparison between features of interest in the focal and matched sets without confounding by matched covariates.





□ RNA-Bloom2: Reference-free assembly of long-read transcriptome sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.08.07.503110v1.full.pdf

RNA-Bloom2 extends support for reference-free transcriptome assembly of bulk RNA long sequencing reads. RNA-Bloom2 offers both memory- and time-efficient assembly by utilizing digital normalization of long reads with strobemers.

RNA-Bloom2 assemblies have higher BUSCO completeness than input reads and a RATTLE assembly. A portion of our assembled transcripts have split alignments across genome scaffolds, but the majority of them are supported by paired-end short reads.





□ Improved prediction of gene expression through integrating cell signalling models with machine learning

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04787-8

An approach to integration is to augment ML with similarity features computed from cell signalling models. Each set of features was in turn used to learn multi-target regression models. All the features have significantly improved accuracy over the baseline model.

The baseline model is a random forest model trained as Multi-target regressor stacking (MTRS) without the extra features generated from graph processing. This implementation directly combines the predictions without using an extra meta model.





□ Completing Single-Cell DNA Methylome Profiles via Transfer Learning Together With KL-Divergence

>> https://www.frontiersin.org/articles/10.3389/fgene.2022.910439/full

Using transfer learning together with Kullback-Leibler (KL) divergence to train DNNs for completing DNA methylome profiles with extremely low coverage rate by leveraging those with higher coverage.

Employing a hybrid network architecture adapted from DeepGpG, a mixture of convolutional neural network and recurrent neural network. The CNN learns predictive DNA sequence patterns and the RNN exploits known methylation state of neighboring CpGs in the target profile.





□ PWCoCo: Pair-wise Conditional and Colocalisation: An efficient and robust tool for colocalisation

>> https://www.biorxiv.org/content/10.1101/2022.08.08.503158v1.full.pdf

PWCoCo performs conditional analyses to identify independent signals for the two tested traits in a genomic region and then conducts colocalisation of each pair of conditionally independent signals for the two traits using summary-level data.

This allows for the stringent single-variant assumption to hold for each pair of colocalisation analysis. the computational efficiency of PWCoCo is better than colocalisation with Sum of Single Effects Regression using Summary Stats, with greater gains in efficiency for analysis.








Hotel Monterey.

2022-07-31 23:52:44 | ホテル


□ Hotel Monterey

>> https://www.hotelmonterey.co.jp/en/sendai/

Hotel Montereyに宿泊。中欧の古都プラハをコンセプトにした内装や調度品が雰囲気たっぷりで、天然温泉まで備えていて価格もリーズナブル。ぜひ再訪したいホテルの一つ。

Hotel MontereyのWedding Chapel、”Vaceslav (ヴァーツラフ)”は、プラハの建築を忠実に再現したロマネスク様式の拡張高い空間。挙式はここに決めました😇















MacBook Air (M2) - Midnight.

2022-07-31 23:51:01 | デジタル・インターネット


□ MacBook Air (M2) - Midnight

>> https://www.apple.com/macbook-air-m2/

システム構成
8コアCPU、10コアGPU、16コアNeural Engine搭載Apple M2チップ
16GBユニファイドメモリ
デュアルUSB-Cポート搭載35Wコンパクト電源アダプタ
512GB SSDストレージ

M2 MacBook Air - Midnight (16GBユニファイドメモリ・512GB SSDストレージ)のCTOモデル着荷。もうずっとAir ユーザなので、羽のように軽い!とまでは感じなかったけれど、この筐体でM2パワーでゴリゴリ処理できるのは頼もしい。











DENON DHT-S517

2022-07-31 07:13:31 | デジタル・インターネット


□ 『DENON DHT-S517』

>> https://www.denon.jp/ja-jp/shop/denonapac-hometheatresystems_ap/dhts517

Dolby Atmos enabled speaker内蔵Sound Bar (3.1.2ch) 購入。入手困難なほど人気機種らしく、故障したHome Podの代替機としては十分すぎる高機能・高品質。Apple TV 4Kから音楽のドルビーアトモス出力も可能。さっそく至福のサウンドに包まれている🔉😇



□ ENIGMA / “The Platinum Collection” 【Dolby Atmos】

>> https://music.apple.com/album/the-platinum-collection/713169072

ENIGMAの楽曲がDolby ATMOSで聴ける!Apple Musicの”The Platinum Collection”が、唯一Dolby Atmos Remasterされたエニグマの音源。DENON DHT-S517のイマーシブサウンドとENIGMAの世界観の相乗効果が◎。DSPを迂回するPUREモードも音の解像度が突き抜けて良い。🔉





Jurassic World Dominion.

2022-07-31 01:10:37 | 映画


□ 『Jurassic World Dominion』

>> https://www.jurassicworld.com

Release Year: 2022
Directed by Colin Trevorrow
Cast: Chris Pratt, Bryce Dallas Howard


『Jurassic World Dominion』(IMAX 3D)。最終作らしく唐突に背景がスケールしている。
とはいえ、マルタ島のラプトルとのカーチェイスシーンは、
これまで全く未体験と言えるシリーズ屈指のハイライト。

絶滅か共存か。

クライトンの代弁者であるマルコム博士の警鐘は、
当世に至って更に重みを増している。

















Conjugate.

2022-07-17 19:13:37 | Science News




□ LANTERN: Interpretable modeling of genotype-phenotype landscapes with state-of-the-art predictive power

>> https://www.pnas.org/doi/10.1073/pnas.2114021119

LANTERN, a hierarchical Bayesian model that distills genotype–phenotype landscape (GPL) measurements into a low-dimensional feature space. LANTERN captures the nonlinear effects of epistasis through a multidimensional, nonparametric Gaussian Process model.

LANTERN predicts the position of variant in the latent mutational effect space as a linear combination of mutation effect vectors with an unknown matrix. LANTERN facilitates discovery of fundamental mechanisms in GPLs, while extrapolating to unexplored regions of genotypic space.





□ psupertime: supervised pseudotime analysis for time-series single-cell RNA-seq data

>> https://academic.oup.com/bioinformatics/article/38/Supplement_1/i290/6617492

psupertime, a supervised pseudotime approach based on a regression model. It identifies genes that vary coherently along a time series, in addition to pseudo-time values for individual cells, and a classifier that can be used to estimate labels for new data with unknown or differing labels.

psupertime is based on penalized ordinal regression, a statistical technique used where data have categorical labels that follow a sequence. A pseudotime value for each individual cell is obtained by multiplying the log gene expression values by the vector of coefficients.





□ scDREAMER: atlas-level integration of single-cell datasets using deep generative model paired with adversarial classifier

>> https://www.biorxiv.org/content/10.1101/2022.07.12.499846v1.full.pdf

scDREAMER can overcome critical challenges including the presence of skewed cell types among batches, nested batch effects, large number of batches and conservation of development trajectory across different batches.

scDREAMER employs a novel adversarial variational autoencoder for inferring the latent cellular embeddings from the high-dimensional gene expression matrices from different batches. scDREAMER is trained using evidence lower bound and Bhattacharyya loss.





□ scSTEM: clustering pseudotime ordered single-cell data

>> https://genomebiology.
biomedcentral.com/articles/10.1186/s13059-022-02716-9


scSTEM uses one of several metrics to summarize the expression of genes and assigns a p-value to clusters enabling the identification of significant profiles and comparison of profiles across different paths.

scSTEM generates summary time series data using several different approaches for each of the paths. This data is then used as input for STEM and clusters are determined for each path in the trajectory.





□ scMMGAN: Single-Cell Multi-Modal GAN architecture resolves the ambiguity created by only stating a distribution-level loss in learning a mapping.

>> https://www.biorxiv.org/content/10.1101/2022.07.04.498732v1.full.pdf

Single-Cell Multi-Modal GAN (scMMGAN) that integrates data from multiple modalities into a unified representation in the ambient data space for downstream analysis using a combination of adversarial learning and data geometry techniques.

scMMGAN achieves multi-modality and specify a generally applicable correspondence loss: the geometry preserving loss. It enforces the diffusion geometry, performed w/ a new kernel designed to pass gradients better than the Gaussian kernel, is preserved throughout the mapping.





□ VeloVAE: Bayesian Inference of RNA Velocity from Multi-Lineage Single-Cell Data

>> https://www.biorxiv.org/content/10.1101/2022.07.08.499381v1.full.pdf

VeloVAE uses variational Bayesian inference to estimate the posterior distribution of latent time, latent cell state, and kinetic rate parameters for each cell.

VeloVAE addresses key limitations of previous methods by inferring a global time and cell state; modeling the emergence of multiple cell types; incorporating prior information such as time point labels; using scalable minibatch optimization; and quantifying parameter uncertainty.





□ TCSW: Directed Shortest Walk on Temporal Graphs

>> https://www.biorxiv.org/content/10.1101/2022.07.08.499368v1.full.pdf

The Time Conditioned Shortest Walk (TCSW) problem, which takes on a similar flavor as Condition Shortest Path. It gives a series of ordered networks Gt and ordered conditions {1, ..., T} representing a discrete measurement of time, and as well as a pair of nodes (a∈G1,b∈GT).

Extending the Condition setting to TCSW, a singular global shortest path problem w/ the temporal walk constraint, becomes hard to solve. An integer linear program solves a generalized version of TCSW. It finds optimal solutions to the generalized k-TCSW problem in feasible time.





□ GeneTrajectory: Gene Trajectory Inference for Single-cell Data by Optimal Transport Metrics

>> https://www.biorxiv.org/content/10.1101/2022.07.08.499404v1.full.pdf

GeneTrajectory unravels gene trajectories associated with distinct biological processes. GeneTrajectory computes a cell-cell graph that preserves the manifold structure of the cells.

GeneTrajectory construct a gene-gene graph where the affinities between genes are based on the Wasserstein distances between their distributions on the cell graph. Each trajectory is associated with a specific biological process and reveals the pseudo-temporal order.





□ CTSV: Identification of Cell-Type-Specific Spatially Variable Genes Accounting for Excess Zeros

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac457/6632658

In fact, the spatial information can be incorporated into the Gaussian process in two ways—the spatial effect on the mean vector or the spatial dependency induced by the covariance matrix.

CTSV explicitly incorporates the cell type proportions of spots into a zero-inflated negative binomial distribution and models the spatial effects through the mean vector, whereas existing SV gene detection approaches either do not directly utilize cellular compositions or do not account for excess zeros.





□ SeCNV: Resolving single-cell copy number profiling for large datasets

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac264/6633647

SeCNV successfully processes large datasets (>50 000 cells) within 4 min, while other tools fail to finish within the time limit, i.e. 120 h.

SeCNV adopts a local Gaussian kernel to construct a matrix, depth congruent map (DCM), capturing the similarities between any two bins along the genome. Then, SeCNV partitions the genome into segments by minimizing the structural entropy.





□ BubbleGun: Enumerating Bubbles and Superbubbles in Genome Graphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac448/6633304

BubbleGun is considerably faster than vg especially in bigger graphs, where it reports all bubbles in less than 30 minutes on a human sample de Bruijn graph of around 2 million nodes.

BubbleGun detects superbubbles in a given input graph by implementing the algorithm, which is an average-case linear time algorithm. The algorithm iterates over all nodes s in the graph and determines whether there is another node t that satisfies the superbubble rules.





□ treeArches: Single-cell reference mapping to construct and extend cell type hierarchies

>> https://www.biorxiv.org/content/10.1101/2022.07.07.499109v1.full.pdf

treeArches, a framework to automatically build and extend reference atlases while enriching them with an updatable hierarchy of cell type annotations across different datasets. treeArches enables data-driven construction of consensus, atlas-level cell type hierarchies.

treeArches builds on scArches and single-cell Hierarchical Progressive Learning (scHPL). treeArches maps new query datasets to the latent space learned from the reference datasets using architectural surgery.





□ Detection of cell markers from single cell RNA-seq with sc2marker

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04817-5

sc2marker is based on the maximum margin to select markers for flow cytometry. sc2marker finds an optimal threshold α (or margin) with maximal distances to true positives (TP) and true negatives (TN) and low distances to false positives (FP) and false negatives (FN).

Hypergate uses a non-parametric score statistic to find markers in scRNA-seq data that distinguish different cell types. sc2marker reimplements the Hypergate criteria to rank all markers. sc2marker allows users to explore the COMET database using the option “category=FlowComet”.





□ Verkko: telomere-to-telomere assembly of diploid chromosomes

> https://www.biorxiv.org/content/10.1101/2022.06.24.497523v1.full.pdf

To resolve the most complex repeats, this project relied on manual integration of ultra-long Oxford Nanopore sequencing reads with a high-resolution assembly graph built from long, accurate PacBio HiFi reads.

Verkko, an iterative, graph-based pipeline for assembling complete, diploid genomes. Verkko begins with a multiplex de Bruijn graph built from long, accurate reads and progressively simplifies this graph via the integration of ultra-long reads and haplotype paths.






□ ResMiCo: increasing the quality of metagenome-assembled genomes with deep learning

>> https://www.biorxiv.org/content/10.1101/2022.06.23.497335v1.full.pdf

Accuracy for the state of the art in reference-free misassembly prediction does not exceed an AUPRC of 0.57, and it is not clear how well these models generalize to real-world data.

the Residual neural network for Misassembled Contig identification (ResMiCo) is a deep convolutional neural network with skip connections between non-adjacent layers. ResMiCo is substantially accurate, and the model is robust to novel taxonomic diversity and varying assembly.





□ Bookend: precise transcript reconstruction with end-guided assembly

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02700-3

Bookend is a generalized framework for identifying RNA ends in sequencing data and using this information to assemble transcript isoforms as paths through a network accounting for splice sites, transcription start sites (TSS), and polyadenylation sites (PAS).

Bookend takes RNA-seq reads from any method as input and after alignment to a reference genome, reads are stored in an ELR format that records all RNA boundary features. The Overlap Graph is iteratively traversed to resolve an optimal set of Greedy Paths from TSSs to PASs.






□ Lokatt: A hybrid DNA nanopore basecaller with an explicit duration hidden Markov model and a residual LSTM network

>> https://www.biorxiv.org/content/10.1101/2022.07.13.499873v1.full.pdf

The duration of any state with a self transition in a Bayesian state-space model is always geometrically distributed. This is inconsistent with the dwell-times reported for both polymers and helicase, two popular candidates for ratcheting enzymes.

Lokatt: explicit duration Markov model and residual-LSTM network. Lokatt uses an explicit duration HMM (EDHMM) model with an additional duration state that models the dwell-time of the dominating k-mer.





□ scDART: integrating unmatched scRNA-seq and scATAC-seq data and learning cross-modality relationship simultaneously

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02706-x

scDART (single cell Deep learning model for ATAC-Seq and RNA-Seq Trajectory integration), a scalable deep learning framework that embeds data modalities into a shared low-dimensional latent space that preserves cell trajectory structures in the original datasets.

scDART learns a joint latent space for both data modalities that well preserve the cell developmental trajectories. Even though scDART-anchor were designed for cells that form continuous trajectories, they can also work for cells that form discrete clusters.





□ Duet: SNP-Assisted Structural Variant Calling and Phasing Using Oxford Nanopore Sequencing

>> https://www.biorxiv.org/content/10.1101/2022.07.04.498779v1.full.pdf

Duet, an SV detection tool optimized for SV calling and phasing using ONT data. The tool uses novel features integrated from both SV signatures and single-nucleotide polymorphism (SNP) signatures, which can accurately distinguish SV haplotype from a false signal.

Duet can perform accurate SV calling, SV genotyping and SV phasing using low-coverage ONT data. Duet will use the haplotype and the prediction confidence of the reads. Duet employs GNU Parallel to allow parallel processing of all chromosomes.





□ MSRCall: A Multi-scale Deep Neural Network to Basecall Oxford Nanopore Sequences

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac435/6619554

MSRCall comprises a multi-scale structure, recurrent layers, a fusion block, and a CTC decoder. To better identify both short-range and long-range dependencies, the recurrent layer is redesigned to capture various time-scale features with a multi-scale structure.

MSRCall fuses convolutional layers to manipulate multi-scale downsampling. These back-to-back convolutional layers aim to capture features with receptive fields at different levels of complexity.





□ Single-cell generalized trend model (scGTM): a flexible and interpretable model of gene expression trend along cell pseudotime

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac423/6618524

the single-cell generalized trend model (scGTM) for capturing a gene’s expression trend, which may be monotone, hill-shaped, or valley-shaped, along cell pseudotime.

scGTM uses the particle swarm optimization algorithm to find the constrained maximum likelihood estimates. A natural extension is to split a multiple-lineage cell trajectory into single lineages and fit the scGTM to each lineage separately.





□ scGET-seq: Dimensionality reduction and statistical modeling

>> https://www.biorxiv.org/content/10.1101/2022.06.29.498092v1.full.pdf

scGET-seq, a technique that exploits a Hybrid Transposase (tnH) along with the canonical enzyme (tn5), which is able to profile altogether closed and open chromatin in a single experiment.

scGET-seq uses Tensor Train Decomposition. It allows to represent data using a single tensor which can be factorized to obtain a low-dimensional embedding. scGET-seq overcomes the limitations of chromatin velocity and allows robust identification of cell trajectories.





□ GAVISUNK: Genome assembly validation via inter-SUNK distances in Oxford Nanopore reads

>> https://www.biorxiv.org/content/10.1101/2022.06.17.496619v1.full.pdf

GAVISUNK is a method of validating HiFi-driven assemblies with orthogonal ONT sequence. It specifically assesses the contiguity of regions, flagging potential haplotype switches or misassemblies.

GAVISUNK may be applied to any region or genome assembly to identify misassemblies and potential collapses, and is valuable for validating the integrity of regions. It can be applied at fine scale to closely examine regions of interest across multiple haplotype assemblies.





□ Bcmap: fast alignment-free barcode mapping for linked-read sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.06.20.496811v1.full.pdf

Bcmap is accurate and an order of magnitude faster than full read alignment. Bcmap uses k-mer hash tables and window minimizers to swiftly map barcodes to the reference whilst calculating a mapping score.

Bcmap calculates all minimizers of the reads labeled with the same barcode and look them up in the k-mer reference index. It is constructed in a way that allows one to look up the frequency of a minimizer before accessing all associated positions.





□ CCC: An efficient not-only-linear correlation coefficient based on machine learning

>> https://www.biorxiv.org/content/10.1101/2022.06.15.496326v1.full.pdf

the Clustermatch Correlation Coefficient (CCC) reveals biologically meaningful linear and nonlinear patterns missed by standard, linear-only correlation coefficients.

CCC has a single parameter that limits the maximum complexity of relationships found. CCC captures general patterns in data by comparing clustering solutions while being much faster than state-of-the-art coefficients such as the Maximal Information Coefficient.





□ JASPER: a fast genome polishing tool that improves accuracy and creates population-specific reference genomes

>> https://www.biorxiv.org/content/10.1101/2022.06.14.496115v1.full.pdf

JASPER (Jellyfish-based Assembly Sequence Polisher for Error Reduction) gains efficiency by avoiding the alignment of reads to the assembly. Instead, JASPER uses a database of k-mer counts that it creates from the reads to detect and correct errors in the consensus.

JASPER can use these k-mer counts to “correct” a human genome assembly so that it contains all homozygous variants that are common in the population from which the reads were drawn.





□ Uncertainty quantification of reference based cellular deconvolution algorithms

>> https://www.biorxiv.org/content/10.1101/2022.06.15.496235v1.full.pdf

An accuracy metric that quantifies the CEll TYpe deconvolution GOodness (CETYGO) score of a set of cellular heterogeneity variables derived from a genome-wide DNA methylation profile for an individual sample.

While theorhetically the CETYGO score can be used in conjunction with any reference based deconvolution method, this package only contains code to calculate it in combination with Houseman's algorithm.





□ SAE-IBS: Hybrid Autoencoder with Orthogonal Latent Space for Robust Population Structure Inference

>> https://www.biorxiv.org/content/10.1101/2022.06.16.496401v1.full.pdf

SAE-IBS combines the strengths of traditional matrix decomposition-based (e.g., principal component analysis) and more recent neural network-based (e.g., autoencoders) solutions.

SAE-IBS generates a robust ancestry space in the presence of relatedness. SAE-IBS yields an orthogonal latent space enhancing dimensionality selection while learning non-linear transformations.





□ Analyzing single-cell bisulfite sequencing data with scbs

>> https://www.biorxiv.org/content/10.1101/2022.06.15.496318v1.full.pdf

scbs prepare parses methylation files produced by common bisulfite sequencing mappers and stores their contents in a compressed format optimised for efficient access to genomic intervals.

To obtain a methylation matrix, similar to the count matrices used in scRNA-seq, the user must first decide in which genomic intervals methylation should be quantified. The methylation matrix can be used for downstream analysis such as cell clustering / dimensionality reduction.





□ NetRAX: Accurate and Fast Maximum Likelihood Phylogenetic Network Inference

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac396/6609768

NetRAX can infer maximum likelihood phylogenetic networks from partitioned multiple sequence alignments and returns the inferred networks in Extended Newick format.

NetRAX uses a greedy hill climbing approach to search for network topologies. It deploys an outer search loop to iterate over different move types and an inner search loop to search for the best- scoring network using a specific move type.





□ PolyAtailor: measuring poly(A) tail length from short-read and long-read sequencing data

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac271/6620877

PolyAtailor provides two core functions for measuring poly(A) tails, namely Tail_map and Tail_scan, which can be used for profiling tails with or without using a reference genome.

PolyAtailor can identify all potential tails in a read, providing users with detailed information such as tail position, tail length, tail sequence and tail type.

PolyAtailor integrates rich functions for poly(A) tail and poly(A) site analyses, such as differential poly(A) length analysis, poly(A) site identification and annotation, and statistics and visualization of base composition in tails.





□ Patchwork: alignment-based retrieval and concatenation of phylogenetic markers from genomic data

>> https://www.biorxiv.org/content/10.1101/2022.07.03.498606v1.full.pdf

Patchwork, a new method for mining phylogenetic markers directly from an assembled genome. Homologous regions are obtained via an alignment search, followed by a “hit-stitching” phase, in which adjacent or overlapping regions are concatenated together.

Patchwork utilizes the sequence aligner DIAMOND, and is written in the programming language Julia. A novel sliding window technique is used to trim non-coding regions from the alignments.





□ A Draft Human Pangenome Reference

>> https://www.biorxiv.org/content/10.1101/2022.07.09.499321v1.full.pdf

A draft pangenome that captures known variants and haplotypes, reveals novel alleles at structurally complex loci, and adds 119 million base pairs of euchromatic polymorphic sequence and 1,529 gene duplications relative to the existing reference, GRCh38.






□ UnpairReg: Integration of single-cell multi-omics data by regression analysis on unpaired observations

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02726-7

UnpairReg attempts to perform linear regression on the unpaired data. UnpairReg provides an accurate estimation of cell gene expression where only chromatin accessibility data is available. The cis-regulatory network inferred from UnpairReg is highly consistent with eQTL mapping.

UnpairReg uses a fast linear approximation algorithm. UnpairReg transfers the linear regression problem into a regression on covariance matrix. It is based on the assumption that the expression of different genes is independent under the condition of REs accessibility given.





□ On the importance of data transformation for data integration in single-cell RNA sequencing analysis

>> https://www.biorxiv.org/content/10.1101/2022.07.19.500522v1.full.pdf

A re-investigation employing different data transformation methods for preprocessing revealed that large performance gains can be achieved by a properly chosen optimal data transformation method. Transfer learning might not have significant benefits when preprocessing steps are well optimized.












Tessellate.

2022-07-17 19:07:07 | Science News




□ Storing and analyzing a genome on a blockchain

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02699-7

Nebula Genomics uses Ethereum Smart Contracts to facilitate communication between nodes, and Blockstack to facilitate data storage, but Blockstack stores the data off-chain, either on a local drive or in the cloud.

SAMchain is the first framework to store raw genomic reads on a blockchain, on-chain. The algorithm searches through the binned streams to obtain the SAM data. Each private blockchain network corresponds to a single genome owned by the individual to which the genome belongs.





□ Genozip Dual-Coordinate VCF format enables efficient genomic analyses and alleviates liftover limitations

>> https://www.biorxiv.org/content/10.1101/2022.07.17.500374v1.full.pdf

Dual Coordinate VCF (DVCF), a file format that records genomic variants against two different reference genomes simultaneously and is fully compliant with the current VCF specification.

Using DVCF files, researchers can alternate between coordinate systems according to their needs – without creating duplicate VCF files. Importantly the DVCF file format is independent of its implementation in Genozip.





□ PolarMorphism enables discovery of shared genetic variants across multiple traits from GWAS summary statistics

>> https://academic.oup.com/bioinformatics/article/38/Supplement_1/i212/6617483

PolarMorphism, a new approach to identify pleiotropic SNPs that is more efficient, identifies the same number of pleiotropic SNPs as PLACO, but can be applied to more than two traits. This enables the identification of SNPs that have an effect on numerous traits.

PolarMorphism enables construction of a trait network showing which traits share SNPs. PolarMorphism identifies more pleiotropic SNPs than the standard intersection method and than PRIMO. PolarMorphism finished analysis of 1 million SNPs in less than 20 s.





□ ResPAN: a powerful batch correction model for scRNA-seq data through residual adversarial networks

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac427/6623406

ResPAN is a light structured Residual autoencoder and mutual nearest neighbor Paring guided Adversarial Network for scRNA-seq batch correction.

ResPAN is based on Wasserstein Generative Adversarial Network (WGAN) combined with random walk mutual nearest neighbor pairing and fully skip-connected autoencoders to reduce the differences among batches.





□ scFates: a scalable python package for advanced pseudotime and bifurcation analysis from single cell data

>> https://www.biorxiv.org/content/10.1101/2022.07.09.498657v1.full.pdf

scFates is fully compatible with scanpy ecosystem by using the anndata format, and provides GPU and multicore accelerated functions for faster and more scalable inference.

Using SimplePPT algorithm, where each cell is assigned a probability to each principal point, scFates can generate several pseudotime mappings. scFates provides functions for selecting specific portions of the tree, by selecting starting and endpoints, or by using pseudotime.





□ scVIDE: Designing Single-Cell RNA-Sequencing Experiments for Learning Latent Representations

>> https://www.biorxiv.org/content/10.1101/2022.07.08.499284v1.full.pdf

scVIDE determines statistical power for detecting cell group structure in a lower-dimensional representation. scVIDE starts with a cell by gene count matrix from which a small number of cells are randomly selected and counts are randomly permuted across genes.

Eextending scVIDE to deep Boltzmann machines (DBMs), which have been adapted to scRNA-seq data, could be useful because it was previously shown that DBMs could learn from smaller data sets compared to other deep generative models.





□ scDLC: a deep learning framework to classify large sample single-cell RNA-seq data

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-08715-1

scDLC is based on the long short-term memory recurrent neural networks (LSTMs). This classifier does not require a prior knowledge on the scRNA-seq data distribution and it is a scale invariant method which does not require a normalization procedure for scRNA-seq data.

scDLC amplifies the features of the selected genes through the first fully connected layer. The output of the 1st fully connected layer is taken as the input of the two-layer long short-term memory network layer, and the weights of all gates are estimated by network calculation.





□ Deep Visualization: Structure-Preserving and Batch-Correcting Visualization Using Deep Manifold Transformation for Single-cell RNA-Seq Profiles

>> https://www.biorxiv.org/content/10.1101/2022.07.09.499435v1.full.pdf

deep visualization (DV), that possesses the ability to preserve inherent structure of data and handle batch effects and is applicable to a variety of datasets from different application domains and dataset scales.

The method embeds a given dataset into a 2- or 3-dimensional visualization space, with either a Euclidean or hyperbolic metric depending on a specified task type and data type “time-fixed” and “time-evolution” scRNA-seq data, respectively.

DV learns a semantic graph to describe the relationships between data samples, transforms the data into visualization space while preserving the geometric structure of the data and correcting batch effects in an end-to-end manner.





□ XSI - A genotype compression tool for compressive genomics in large biobanks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac413/6617346

xSqueezeIt (XSI) - VCF / BCF Genotype Compressor based on sparse representation for rare variants and positional Burrows-Wheeler transform (PBWT) followed by 16-bit Word Aligned Hybrid (WAH) encoding for common variants.

XSI relies on a hierarchical block-based strategy. The blocks hold a small dictionary referencing their content. The Sub-blocks are compressed with specific to the data type. The PBWT is recomputed from the initial sample ordering for each block, making each block independent.





□ The Practical Haplotype Graph, a platform for storing and using pangenomes for imputation

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac410/6617344

The Practical Haplotype Graph is a pangenome pipeline, database (PostGRES & SQLite), data model (Java, Kotlin, or R), and Breeding API (BrAPI). At even 0.1X coverage, with appropriate reads and sequence alignment, imputation results in extremely accurate haplotype reconstruction.

The Practical Haplotype Graph is a trellis graph that represents discrete genomic DNA sequences and connections. HMM algorithms, Viterbi and forward-backward, operate on a trellis graph, and organize pangenomes by aligning all of the genomes against a single reference genome.





□ Revelio: Manipulating base quality scores enables variant calling from bisulfite sequencing alignments using conventional bayesian approaches

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-022-08691-6

The double-masking procedure facilitates sensitive and accurate variant calling directly from bisulfite sequencing data using software intended for conventional DNA sequencing libraries.




□ scBalance: A scalable sparse neural network framework for rare cell type annotation of single-cell transcriptome data

>> https://www.biorxiv.org/content/10.1101/2022.06.22.497193v1.full.pdf

scBalance, a sparse neural network framework to automatically label rare cell types in all scale scRNA-seq datasets. By leveraging the newly designed neural network structure, scBalance especially obtains an outperformance on rare cell type annotation and robustness on batch effect.

scBalance leverages the combination of weight and sparse neural network, whereby rare cell types are informative w/o harming the annotation efficiency of the major cell populations. scBalance is the first auto-annotation tool that expands scalability to 1.5 million cells dataset.





□ baseLess: Lightweight detection of sequences in raw MinION data

>> https://www.biorxiv.org/content/10.1101/2022.07.10.499286v1.full.pdf

baseLess, a computational tool that enables such target-detection-only analysis. BaseLess makes use of an array of small neural networks, each of which efficiently detects a fixed-size subsequence of the target sequence directly from the electrical signal.

baseLess baseLess deduces the presence of a target sequence by detecting squiggle segments corresponding to salient short sequences, k-mers, using an array of convolutional neural networks.

baseLess ranks k-mers by abundance as measured in the reads and compares it to their abundance ranking in the target and background genomes, using the mean squared rank difference (MSRD).





□ RATTLE: reference-free reconstruction and quantification of transcriptomes from Nanopore sequencing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02715-w

RATTLE is competitive at recovering transcript sequences and their abundances despite not using any information from the reference. RATTLE lays the foundation for a multitude of potential new applications of Nanopore transcriptomics.

RATTLE performs a greedy deterministic clustering using a two-step k-mer based similarity measure. RATTLE solves the Longest Increasing Subsequence (LIS) problem to find the longest list of collinear matching k-mers between a pair of reads.





□ Needle: A fast and space-efficient prefilter for estimating the quantification of very large collections of expression experiments

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac492/6633930

Needle, a fast and space-efficient prefilter for estimating the quantification of very large nucleotide sequences. Needle can estimate the quantification of thousands of sequences in a few minutes or even only seconds.

Needle uses the Interleaved Bloom Filter (IBF) with minimizers instead of contiguously overlapping k-mers to efficiently index and query these experiments. Needle splits the count values of one experiment into meaningful buckets and stores each bucket as one IBF.





□ HiCImpute: A Bayesian hierarchical model for identifying structural zeros and enhancing single cell Hi-C data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010129

HiCImpute, a Bayesian hierarchical model that goes beyond data quality improvement by also identifying observed zeros that are in fact structural zeros.

The key idea relies on the introduction of an indicator variable (the latent variable) denoting structural zeros or otherwise, for which a statistical inference is made based on its posterior probability estimated using Markov chain Monte Carlo (MCMC) samples.





□ CDSImpute: An ensemble similarity imputation method for single-cell RNA sequence dropouts

>> https://www.sciencedirect.com/science/article/abs/pii/S0010482522004504

CDSImpute (Correlation Distance Similarity Imputation), a novel Single-cell RNA dropout imputation method to retrieve the original gene expression of the genes with excessive zero and near-zero counts.



□ Complete sequence verification of plasmid DNA using the Oxford Nanopore Technologies′ MinION device

>> https://www.biorxiv.org/content/10.1101/2022.06.21.497051v1.full.pdf

A pipeline that generates a high-quality consensus sequence of linearized plasmid using ONT MinION sequencing, leveraging substantial sequencing depth and stringent quality filters to overcome the relatively high error rates associated with nanopore sequencing.




□ Mandalorion: Identifying and quantifying isoforms from accurate full-length transcriptome sequencing reads

>> https://www.biorxiv.org/content/10.1101/2022.06.29.498139v1.full.pdf

The Mandalorion tool is continuously developed over the last 5 years, identifies and quantifies high-confidence isoforms from accurate full-length transcriptome sequencing reads produced by methods like PacBio Iso-Seq and ONT-based R2C2.

Mandalorion v4 accepts an arbitrary number of fasta/q files containing accurate full-length transcriptome sequencing data. Mandalorion v4 identifies isoforms with very high Recall and Precision when applied to either spike-in or simulated data with known ground-truth isoforms.





□ PyGenePlexus: A Python package for gene discovery using network-based machine learning

>> https://www.biorxiv.org/content/10.1101/2022.07.02.498552v1.full.pdf

The GenePlexus method utilizes pre-processed information from genome-wide molecular networks and gene set collections from the Gene Ontology (GO) and DisGeNet.

PyGenePlexus trains a custom ML model and returns the probability of how associated every gene in the network is to the user supplied gene set, along with the network connectivity of the top predicted genes.





□ Phylovar: toward scalable phylogeny-aware inference of single-nucleotide variations from single-cell DNA sequencing data

>> https://academic.oup.com/bioinformatics/article/38/Supplement_1/i195/6617481

Phylovar, which extends the phylogeny-guided variant calling approach to sequencing datasets containing millions of loci. Phylovar outperforms SCIΦ
in terms of running time while being more accurate than Monovar in terms of SNV detection.

Phylovar finds the tree topology and the placement of mutations on ancestral single cells that maximize the likelihood of the erroneous observed read counts given the genotypes.





□ fimpera: drastic improvement of Approximate Membership Query data-structures with counts

>> https://www.biorxiv.org/content/10.1101/2022.06.27.497694v1.full.pdf

fimpera, consisting of a simple strategy for reducing the false-positive rate of any AMQ indexing all k-mers (words of length k) from a set of sequences, along with their abundance information.

fimpera decreases the false-positive rate of a counting Bloom filter by an order of magnitude while reducing the number of overestimated calls, as well as lowering the average difference between the overestimated calls and the ground truth.





□ SIMPA: Single-cell specific and interpretable machine learning models for sparse scChIP-seq data imputation

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0270043

SIMPA, a scChIP-seq data imputation method leveraging predictive information within bulk data from the ENCODE project to impute missing protein-DNA interacting regions of target histone marks or transcription factors.

SIMPA enables the interpretation of machine learning models by revealing interaction sites of a given single cell that are most important for the imputation model trained for a specific genomic region.





□ DeepBend: An Interpretable Model of DNA Bendability

>> https://www.biorxiv.org/content/10.1101/2022.07.06.499067v1.full.pdf

DeepBend, a convolutional neural network model built as a visible neural network where we designed the convolutions to directly capture the motifs underlying DNA bendability and how their periodic occurrences or relative arrangements modulate bendability.

DeepBend is a 3-layered CNN that takes in a one-hot encoded DNA sequence as input and predicts its bendability. Each row of a first layer filter is a multinomial distribution over the four nucleotides, these filters are interpretable as biophysical models of sequence motifs.





□ DeepGenGrep: a general deep learning-based predictor for multiple genomic signals and regions

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac454/6633307

DeepGenGrep, a general predictor for the systematic identification of multiple different GSRs from genomic DNA sequences.

DeepGenGrep leverages the power of hybrid neural networks comprising a three-layer convolutional neural network and a two-layer long short-term memory to effectively learn useful feature representations from sequences.





□ pareg: Coherent pathway enrichment estimation by modeling inter-pathway dependencies using regularized regression

>> https://www.biorxiv.org/content/10.1101/2022.07.06.498967v1.full.pdf

pareg follows the ideas of GSEA as it requires no stratification of the input gene list, of MGSA as it incorporates term-term relations in a database-agnostic way, and of LRPath as it makes use of the flexibility of the regression approach.

pareg assumes that a linear combination of gene-pathway memberships is driving the overall pathway dysregulation, an assumption which may reduce the algorithm’s applicability in certain biological environments.





□ Porechop_ABI: discovering unknown adapters in ONT sequencing reads for downstream trimming

>> https://www.biorxiv.org/content/10.1101/2022.07.07.499093v1.full.pdf

Porechop_ABI automatically infers adapter sequences from raw reads alone, without any external knowledge or database. This algorithm determines whether the reads contain adapters, and if so what the content of these adapters is.

Porechop_ABI uses technics coming from string algorithms, with approximate k-mer, full text indexing and assembly graphs. Porechop_ABI cleans untrimmed reads for which the adapter sequences are not documented, to check whether a dataset has been trimmed or not.





□ MC profiling: a novel approach to analyze DNA methylation heterogeneity in genome-wide bisulfite sequencing data

>> https://www.biorxiv.org/content/10.1101/2022.07.06.498979v1.full.pdf

Methylation Class (MC) profiling approach is built on the concept of MCs, i.e. groups of DNA molecules sharing the same number of methylated cytosines in a sample.

MC profiling identified cell-to-cell differences as the prevalent contributor to DNA methylation heterogeneity, with allele differences emerging in a small fraction of analyzed regions. Moreover, MC profiling led to the identification of signatures of loci undergoing genomic imprinting.





□ MINE is a method for detecting spatial density of regulatory chromatin interactions based on a MultI-modal NEtwork

>> https://www.biorxiv.org/content/10.1101/2022.07.11.499656v1.full.pdf

MINE-Loop is a neural network model that integrates Hi-C, ChIP-seq, and ATAC-seq data to enhance the proportion of detectable regulatory chromatin interactions by reducing noise from non-regulatory interactions.

MINE-Density can be used to calculate the spatial density of regulatory chromatin interactions (SD-RCI) identified by MINE-Loop, and MINE-Viewer facilitates visualization of density and specific interactions with regulatory factors in 3D genomic structures.






□ Interactive analysis of single-cell data using flexible workflows with SCTK2.0

>> https://www.biorxiv.org/content/10.1101/2022.07.13.499900v1.full.pdf

SCTK enables importing data from the following tools: CellRanger, Optimus, DropEst, BUStools, Seqc, STARSolo and Alevin. In all cases, SCTK parses the standard output directory structure from the pre-processing tools and automatically identifies the count files to import.





□ seqQscorer: Batch effect detection and correction in RNA-seq data using machine-learning-based automated assessment of quality

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04775-y

seqQscorer detects batch effects in the data. Taken as a confounding factor to correct the data for the clustering of the samples, the quality evaluation led to results comparable to the reference method that uses the real batch information.

The pearsongamma is the correlation b/n distances and a 0–1-vector. Computing a design bias representing the agreement of Plow to biological groups, utilizing Pearson gamma or “normalized gamma”, to have a positive value b/n zero/one they added one and divided the result by two.





□ RNA-SSNV: A Reliable Somatic Single Nucleotide Variant Identification Framework for Bulk RNA-Seq Data

>> https://www.frontiersin.org/articles/10.3389/fgene.2022.865313/full

The RNA-SSNV is a scalable and efficient analysis method for RNA somatic mutation detection from RNA-WES paired sequencing data which utilized Mutect2 as core-caller and Multi-filtering strategy & Machine-learning based model to maximize precision & recall performance.

RNA-SSNV has a higher functional impact and therapeutic power in known driver genes. Furthermore, VAF (variant allele fraction) analysis revealed that subclonal harboring expressed mutations had evolutional selection advantage and RNA had higher detection power to rescue DNA-omitted mutations.





□ Comparison of Transformations for Single-Cell RNA-Seq Data

>> https://www.biorxiv.org/content/10.1101/2021.06.24.449781v3.full.pdf

The Pearson residuals-based transformation has attractive theoretical properties and, in the benchmarks, performed similarly well as the shifted logarithm transformation. It stabilizes the variance across all genes and is less sensitive to variations of the size factor.

Sanity Distance calculates the mean deviation of the posterior distribution of the logarithmic GE; it calculates all cell-by-cell distances, from which it can find the k-NN. Sanity ignores the inferred uncertainty and returns the maximum of the posterior as the transformed value.





□ Recommendations for clinical interpretation of variants found in non-coding regions of the genome

>> https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-022-01073-3

Recommendations aim to increase the number and range of non-coding region variants that can be clinically interpreted, which, together with a compatible phenotype, can lead to new diagnoses and catalyse the discovery of novel disease mechanisms.

Rethinking the standard ‘coding first’ strategy for genetic testing of many genes and conditions, not only through WGS, but also by expanding the regions captured by targeted panels to incl. standardised community-defined regulatory elements, where these remain more appropriate.





□ FastCAR: Fast Correction for Ambient RNA to facilitate differential gene expression analysis in single-cell RNA-sequencing datasets

>> https://www.biorxiv.org/content/10.1101/2022.07.19.500594v1.full.pdf

Fast Correction for Ambient RNA (FastCAR), a computationally lean and intuitive correction method, optimized for sc-DGE analysis of scRNA-Seq datasets generated by droplet-based methods including the 10XGenomics Chromium platform.








Converge.

2022-07-17 19:06:37 | Science News

(Pak)




□ Universal co-Extensions of torsion abelian groups

>> https://arxiv.org/pdf/2206.08857v1.pdf

An Ab3 abelian category which is Ext-small, satisfies the Ab4 condition if, and only if, each object of it is Ext-universal. In particular, this means that there are torsion abelian groups that are not co-Ext-universal in the category of torsion abelian groups.

Naturally the following question arises for non-Ab4 and Ab3 abelian categories: when an object V of such category admits a universal extension of V by every object? Characterizing all torsion abelian groups which are co-Ext-universal in such category.





□ Variational Bayes for high-dimensional proportional hazards models with applications within gene expression

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac416/6617825

Using a sparsity-inducing spike-and-slab prior with Laplace slab and Dirac spike, referred to as sparse variational Bayes (SVB).

The method is based on a mean-field variational approximation, overcomes the computational cost of MCMC whilst retaining features, providing a posterior distribution for the parameters and offering a natural mechanism for variable selection via posterior inclusion probabilities.





□ Reliable and efficient parameter estimation using approximate continuum limit descriptions of stochastic models

>> https://www.sciencedirect.com/science/article/abs/pii/S0022519322001990

Combining stochastic and continuum mathematical models in the context of lattice- based models of two-dimensional cell biology experiments by demonstrating how to simulate two commonly used experiments: cell proliferation assays and barrier assays.

Simulating a proliferation assay, where the continuum limit model is the logistic ordinary differential equation, as well as a barrier assay where the continuum limit model is closely related to the Fisher- Kolmogorov–Petrovsky–Piskunov equation partial differential equation.





□ Turing 次数から実効トポス上の Lawvere-Tierney 位相へ

>> https://researchmap.jp/cabinets/cabinet_files/download/815179/bdc501f2005d04fa6041239e58a67484

“探索問題のチューリング還元は,複雑性の低い ∀∃-型の 定理に関していえば,そこそこの精度で逆数学的結果と一致する ため,構成的逆数学の “予測” として利用可能である.”





□ SNVformer: An Attention-based Deep Neural Network for GWAS Data

>> https://www.biorxiv.org/content/10.1101/2022.07.07.499217v1.full.pdf

Sparse SNVs can be efficiently used by Transformer-based networks without expanding them to a full genome. It is able to achieve competitive initial performance, with an AUROC of 83% when classifying a balanced test set using genotype and demographic information.

A Transformer-based deep neural architecture for GWAS data, including a purpose-designed SNV encoder, that is capable of modelling gene-gene interactions and multidimensional phenotypes, and which scales to the whole-genome sequencing data standard for modern GWAS.





□ P-smoother: efficient PBWT smoothing of large haplotype panels

>> https://academic.oup.com/bioinformaticsadvances/article/2/1/vbac045/6611715

P-smoother, a Burrows-Wheeler transformation (PBWT) based smoothing algorithm to actively ‘correct’ occasional mismatches and thus ‘smooth’ the panel.

P-smoother runs a bidirectional PBWT-based panel scanning that flips mismatching alleles based on the overall haplotype matching context, the IBD (identical-by-descent) prior. P-smoother’s scalability is reinforced by benchmarks on panels ranging from 4000 - 1 million haplotypes.





□ GBZ File Format for Pangenome Graphs

>> https://www.biorxiv.org/content/10.1101/2022.07.12.499787v1.full.pdf

As the GBWTGraph uses a GBWT index for graph topology, it only needs to store a header and node labels. While the in-memory data structure used in Giraffe stores the labels in both orientations for faster access, serializing the reverse orientation is clearly unnecessary.

The libraries use Elias–Fano encoded bitvectors. While GFA graphs have segments with string names, bidirected sequence graphs have nodes w/ integer identifiers. And while the original graph may have segments w/ long labels, it often makes sense to limit the length of the labels.





□ LRBinner: Binning long reads in metagenomics datasets using composition and coverage information

>> https://almob.biomedcentral.com/articles/10.1186/s13015-022-00221-z

LRBinner, a reference-free binning approach that combines composition and coverage information of complete long-read datasets. LRBinner also uses a distance-histogram-based clustering algorithm to extract clusters with varying sizes.

LRBinner uses a variational auto-encoder to obtain lower dimensional representations by simultaneously incorporating both composition and coverage information of the complete dataset. LRBinner assigns unclustered reads to obtained clusters using their statistical profiles.





□ ChromDMM: A Dirichlet-Multinomial Mixture Model For Clustering Heterogeneous Epigenetic Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac444/6628584

ChromDMM, a product Dirichlet-multinomial mixture model that provides a probabilistic method to cluster multiple chromatin-feature coverage signals extracted from the same locus.

ChromDMM learns the shift and flip states more accurately compared to ChIP-Partitioning and SPar-K. Owing to hyper-parameter optimisation, ChromDMM can also regularise the smoothness of the epigenetic profiles across the consecutive genomic regions.





□ Slow5tools: Flexible and efficient handling of nanopore sequencing signal data

>> https://www.biorxiv.org/content/10.1101/2022.06.19.496732v1.full.pdf

SLOW5 was developed to overcome inherent limitations in the standard FAST5 signal data format that prevent efficient, scalable analysis. SLOW5 can be encoded in human-readable ASCII format, or a more compact and efficient binary format (BLOW5).

Slow5tools uses multi-threading, multi-processing and other engineering strategies to achieve fast data conversion and manipulation, including live FAST5-to-SLOW5 conversion during sequencing.





□ RResolver: efficient short-read repeat resolution within ABySS

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04790-z

RResolver, a scalable algorithm that takes a short-read de Bruijn graph assembly with a starting k as input and uses a k value closer to that of the read length to resolve repeats. This larger k step bypasses multiple short k increments.

RResolver builds a Bloom filter of sequencing reads which is used to evaluate the assembly graph path support at branching points and removes paths w/ insufficient support. Any unambiguous paths have their nodes merged, with each path getting its own copy of the repeat sequence.





□ Figbird: A probabilistic method for filling gaps in genome assemblies

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac404/6613135

Figbird, a probabilistic method for filling gaps in draft genome assemblies using second generation reads based on a generative model for sequencing that takes into account information on insert sizes and sequencing errors.

Figbird is based on a generative model for sequencing proposed in CGAL and subsequently used to develop a scaffolding tool SWALO, and uses an iterative approach based on the expectation-maximization (EM) algorithm for a range of gap lengths.





□ CuteSV: Structural Variant Detection from Long-Read Sequencing Data https://link.springer.com/protocol/10.1007/978-1-0716-2293-3_9

cuteSV, a sensitive, fast, and scalable alignment-based SV detection approach to complete comprehensive discovery of diverse SVs. cuteSV is suitable for large-scale genome project since its excellent SV yields and ultra-fast speed.

cuteSV employs a stepwise refinement clustering algorithm to process the comprehensive signatures from inter- and intra-alignment, construct and screen all possible alleles thus completes high-quality SV calling.





□ HoloNet: Decoding functional cell-cell communication events by multi-view graph learning on spatial transcriptomics

>> https://www.biorxiv.org/content/10.1101/2022.06.22.496105v1.full.pdf

HoloNet models CEs in spatial data as a multi-view network using the ligand and receptor expression profiles, developed a graph neural network model to predict the expressions of specific genes.

HoloNet reveals the holographic cell–cell communication networks which could help to find specific cells and ligand– receptor pairs that affect the alteration of gene expression and phenotypes. HoloNet interpretes the trained neural networks to decode FCEs.





□ A LASSO-based approach to sample sites for phylogenetic tree search

>> https://academic.oup.com/bioinformatics/article/38/Supplement_1/i118/6617489

An artificial-intelligence-based approach, which provides means to select the optimal subset of sites and a formula by which one can compute the log-likelihood of the entire data based on this subset.

The grid of penalty parameters is chosen, by default, such that the maximum value in the grid is the minimal penalty which forces all coefficients to equal exactly zero. Iterating over the penalty grid until finding a solution matching this criterion.





□ SeqScreen: accurate and sensitive functional screening of pathogenic sequences via ensemble learning

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02695-x

SeqScreen uses a multimodal approach combining conventional alignment-based tools, machine learning, and expert sequence curation to produce a new paradigm for novel pathogen detection tools, which is beneficial to synthetic DNA manufacturers.

SeqScreen provides an advantage in that it also reports the most likely taxonomic assignments and protein-specific functional information for each sequence, incl. GO terms / FunSoCs, to identify pathogenic sequences in each sequence without relying solely on taxonomic markers.





□ AvP: a software package for automatic phylogenetic detection of candidate horizontal gene transfers.

>> https://www.biorxiv.org/content/10.1101/2022.06.23.497291v1.full.pdf

AvP (Alienness vs Predictor) to automate the robust identification of HGTs at high-throughput. AvP facilitates the identification and evaluation of candidate HGTs in sequenced genomes across multiple branches of the tree of life.

AvP extracts all the information needed to produce input files to perform phylogenetic reconstruction, evaluate HGTs from the phylogenetic trees, and combine multiple external information.





□ LowKi: Moment estimators of relatedness from low-depth whole-genome sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04795-8

Both the kinship coefficient φ and the fraternity coefficient ψ for all pairs of individuals are of interest. However, when dealing with low-depth sequencing or imputation data, individual level genotypes cannot be confidently called.

LowKi (Low-depth Kinship), a new method-of-moment estimators of both the coefficients φ and ψ calculated directly from genotype likelihoods. LowKi is able to recover the structure of the Full GRM kinship and fraternity matrices.





□ Comprehensive benchmarking of CITE-seq versus DOGMA-seq single cell multimodal omics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02698-8

DOGMA-seq provides unprecedented opportunities to study complex cellular and molecular processes at single cell resolution, but a comprehensive independent evaluation is needed to compare these new trimodal assays to existing single modal and bimodal assays.

Single cell trimodal omics measurements were generally better than after an alternative “low-loss lysis”. DOGMA-seq with optimized DIG permeabilization and its ATAC library provides more information, although its mRNA libraries have slightly inferior quality compared to CITE-seq.





□ ClearCNV: CNV calling from NGS panel data in the presence of ambiguity and noise

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btac418/6617832

CNV calling has not been established in all laboratories performing panel sequencing. At the same time such laboratories have accumulated large data sets and thus have the need to identify copy number variants on their data to close the diagnostic gap.

clearCNV identifies CNVs affecting the targeted regions. clearCNV can cope relatively well with the wide variety of panel types, panel versions and vendor technologies present in typical heterogenous panel data collections found in rare disease research.





□ SCRaPL: A Bayesian hierarchical framework for detecting technical associates in single cell multiomics data

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010163

SCRaPL Single Cell Regulatory Pattern Learning) achieves higher sensitivity and better robustness in identifying correlations, while maintaining a similar level of false positives as standard analyses based on Pearson and Spearman correlation.

SCRaPL aims become a useful tool in the hands of practitioners seeking to understand the role of particular genomic regions in the epigenetic landscape. SCRaPL can increase detection rates up to five times compared to standard practices.





□ MenDEL: automated search of BAC sets covering long DNA regions of interest

>> https://www.biorxiv.org/content/10.1101/2022.06.26.496179v1.full.pdf

MenDEL – a web-based DNA design application, that provides efficient tools for finding BACs that cover long regions of interest and allow for sorting results based on multiple user defined criteria.

Deploying BAC libraries as indexed database tables allows further speed up and automate parsing of these libraries. An important property of those N-ary trees is that their depth-first traversals provide a complete list of unique BAC solutions.




□ Improving Biomedical Named Entity Recognition by Dynamic Caching Inter-Sentence Information

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac422/6618522

Specifically, the cache stores recent hidden representations constrained by predefined caching rules. And the model uses a query-and-read mechanism to retrieve similar historical records from the cache as the local context.

Then, an attention-based gated network is adopted to generate context-related features with BioBERT. To dynamically update the cache, we design a scoring function and implement a multi-task approach to jointly train the model.





□ WiNGS: Widely integrated NGS platform for federated genome analysis

>> https://www.biorxiv.org/content/10.1101/2022.06.23.497325v1.full.pdf

WiNGS sits at the crossroad of patient privacy rights and the need for highly performant / collaborative genetic variant interpretation platforms. It is a fast, fully interactive, and open source web-based platform to analyze DNA variants in both research / diagnostic settings.





□ Accuracy of haplotype estimation and whole genome imputation affects complex trait analyses in complex biobanks

>> https://www.biorxiv.org/content/10.1101/2022.06.27.497703v1.full.pdf

While phasing accuracy varied both by choice of method and data integration protocol, imputation accuracy varied mostly between data integration protocols. Finally, imputation errors can modestly bias association tests and reduce predictive utility of polygenic scores.





□ MaxHiC: A robust background correction model to identify biologically relevant chromatin interactions in Hi-C and capture Hi-C experiments

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010241

MaxHiC (Maximum Likelihood estimation for Hi-C), a negative binomial model that uses a maximum likelihood technique to correct the complex combination of known and unknown biases in both Hi-C and capture Hi-C libraries.

In MaxHiC, distance is modelled by a function that decreases at increasing genomic distances to reach a constant non-zero value. All of the parameters of the model are learned by maximizing the logarithm of likelihood of the observed interactions using the ADAM algorithm.





□ netANOVA: novel graph clustering technique with significance assessment via hierarchical ANOVA

>> https://www.biorxiv.org/content/10.1101/2022.06.28.497741v1.full.pdf

netANOVA analysis workflow we aim to exploit information about structural and dynamical properties of networks to identify significantly different groups of similar networks.

netANOVA workflow accommodates multiple distance measures: edge difference distance, a customized KNC version of k-step random walk kernel, DeltaCon, GTOM and the Gaussian kernel on the vectorized networks.





□ Gene symbol recognition with GeneOCR

>> https://www.biorxiv.org/content/10.1101/2022.07.01.498459v1.full.pdf

GeneOCR (OCR=optical character recognition) employs a state- of-the-art character recognition system to recognize gene symbols.

The errors are mostly due to substitution of optically similar characters, e.g. 1 for I or O for 0. In summary, GeneOCR recognizes or suggests the correct gene symbol in >80% cases and the errors in the rest case involve mostly single characters.





□ SIDEREF: Shared Differential Expression-Based Distance Reflects Global Cell Type Relationships in Single-Cell RNA Sequencing Data

>> https://www.liebertpub.com/doi/10.1089/cmb.2021.0652

SIDEREF modifies a biologically motivated distance measure, SIDEseq, for use of aggregate comparisons of cell types in large single-cell assays. The distance matrix more consistently retains global cell type relationships than commonly used distance measures for scRNA seq clustering.

Exploring spectral dimension reduction of the SIDEREF distance matrix as a means of noise filtering, similar to principal components analysis applied directly to expression data.





□ Gfastats: conversion, evaluation and manipulation of genome sequences using assembly graphs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac460/6633308

Gfastats is a standalone tool to compute assembly summary statistics and manipulate assembly sequences in fasta, fastq, or gfa [.gz] format. Gfastats stores assembly sequences internally in a gfa-like format. This feature allows gfastats to seamlessly convert fast* to and from gfa [.gz] files.

Gfastats can also build an assembly graph that can in turn be used to manipulate the underlying sequences following instructions provided by the user, while simultaneously generating key metrics for the new sequences.





□ Fast-HBR: Fast hash based duplicate read remover

>> http://www.bioinformation.net/018/97320630018036.pdf

Fast-HBR, a fast and memory-efficient duplicate reads removing tool without a reference genome using de-novo principles. Fast-HBR is faster and has less memory footprint when compared with the state of the art De-novo duplicate removing tools.





□ MKFTM: A novel multiple kernel fuzzy topic modeling technique for biomedical data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04780-1

MKFTM technique uses fusion probabilistic inverse document frequency and multiple kernel fuzzy c-means clustering algorithm for biomedical text mining.

In detail, the proposed fusion probabilistic inverse document frequency method is used to estimate the weights of global terms while MKFTM generates frequencies of local and global terms with bag-of-words.





□ Trade-off between conservation of biological variation and batch effect removal in deep generative modeling for single-cell transcriptomics

>> https://www.biorxiv.org/content/10.1101/2022.07.14.500036v1.full.pdf

Using Pareto MTL for estimation of Pareto front in conjunction with MINE for measurement of batch effect to produce the trade-off curve between conservation of biological variation and removal of batch effect.

To control batch effect, the generative loss of scVI is penalized by the Hilbert-Schmidt Independence Criterion (HSIC). The generative loss of SAUCIE, a sparse autoencoder, is penalized by the Maximum Mean Discrepancy (MMD).

MINE is preferable to the more standard MMD measure in the sense that the former produces trade-off points that respect subproblem ordering and are interpretable in surrogate metric spaces.





□ RLSuite: An integrative R-loop bioinformatics framework

>> https://www.biorxiv.org/content/10.1101/2022.07.13.499820v1.full.pdf

R-loops are three-stranded nucleic acid structures containing RNA:DNA hybrids. While R-loop mapping via high-throughput sequencing can reveal novel insight into R-loop biology, the quality control of these data is a non-trivial task for which few bioinformatic tools exist.

RLSuite provides an integrative workflow for R-loop data analysis, including automated pre-processing of R-loop mapping data using a standard pipeline, multiple robust methods for quality control, and a range of tools for the initial exploration of R-loop data.





□ Clair3-trio: high-performance Nanopore long-read variant calling in family trios with trio-to-trio deep neural networks

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac301/6645484

Clair3-Trio employs a Trio-to-Trio deep neural network model, which allows it to input the trio sequencing information and output all of the trio’s predicted variants within a single model to improve variant calling.

MCVLoss is a novel loss function tailor-made for variant calling in trios, leveraging the explicit encoding of the Mendelian inheritance. Clair3-Trio uses independent dense layers to predict each individual’s genotype, zygosity and two INDEL lengths in the last layer.





□ Large-Scale Multiple Sequence Alignment and the Maximum Weight Trace Alignment Merging Problem

>> https://ieeexplore.ieee.org/document/9832784/

MAGUS uses divide-and-conquer: it divides the sequences into disjoint sets, computes alignments on the disjoint sets, and then merges the alignments using a technique it calls the Graph Clustering Method (GCM).

GCM is a heuristic for the NP-hard Maximum Weight Trace, adapted to the Alignment Merging problem. The input to the MWT problem is a set of sequences and weights on pairs of letters from different sequences, and the objective is an MSA that has the maximum total possible weight.








ODESZA / Light Of Day (feat. Ólafur Arnalds)

2022-07-16 23:22:33 | Music20

□ ODESZA - Light Of Day (feat. Ólafur Arnalds)

>> http://odesza.com/

“The track is a cinematic escape that transforms from a futuristic piano-driven ballad into a complex electronic reveal.”

“The track's four-on-the-floor rhythms are beautifully contrasted by the ambient textures.”


アイスランドの作曲家、Ólafur ArnaldsをフィーチャーしたODESZAの新曲。哀愁の旋律を奏でるストリングス、淡いパーカッション、ミュートされた優しい鍵盤の音色を包み込むように、エッジの効いたエレクトロアレンジが眩いスペクトルを放っている。

ボーカルパートはStephen Ambroseの1972年のフォークソング、“Mary”からのサンプリング。


ODESZA - The Last Goodbye (feat. Bettye LaVette) - Official Visualizer


□ ODESZA - This Version Of You (feat. Julianna Barwick)

>> https://youtu.be/7Fb4-XC6eWg


□ ODESZA - All My Life

>> https://youtu.be/46toI-ZUGDg

ODESZAのNew Album “The Last Goodbye”は、最前線の実験的エレクトロビート、伝統的なFunk/Soulのボーカルリソース、そして民族歌唱のサンプリングやオーケストラなど、90-00年代New Age Musicの現代的アップデートも凝縮した、音楽のタイムカプセル。