lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Orpheus.

2023-05-15 05:15:05 | Science News
(Art by ekaitsa)





□ ORFeus: A Computational Method to Detect Programmed Ribosomal Frameshifts and Other Non-Canonical Translation Events

>> https://www.biorxiv.org/content/10.1101/2023.04.24.538127v1

ORFeus uses a hidden Markov model to infer translation patterns from ribo-seq data that is inherently noisy and sparse. The model identifies changes in reading frame and additional upstream or downstream reading frames, making it suitable for detection of many alternative translation events.

ORFeus can identify novel or extended ORFs (including uORFs and dORFs) with either canonical or non-canonical start codons, as well as programmed ribosomal frameshifts and stop codon readthrough events. For each transcript, ORFeus returns the most probable state path.





□ scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI

>> https://www.biorxiv.org/content/10.1101/2023.04.30.538439v1

scGPT, a single-cell foundation model by GPT on over 10 million cells. scGPT uses an in-memory data structure to store hundreds of datasets that allow fast access. The learned gene embedding maps decode known pathways by grouping together genes that are functionally relevant.

With zero-shot learning, the pre-trained model is able to reveal meaningful cell clusters on unseen datasets. With finetuning in a few-shot learning setting, the model achieves state-of-the-art performance on a wide range of downstream tasks.

scGPT employes the generative self-supervised objective to iteratively predict GE values of unknown tokens from known tokens in an auto-regressive manner. scGPT's embedding architecture can easily extend to multiple sequencing modalities, batches, and perturbation states.





□ REVNANO: Reverse Engineering DNA Origami Nanostructure Designs from Raw Scaffold and Staple Sequence Lists

>> https://www.biorxiv.org/content/10.1101/2023.05.03.539261v1

REVNANO, a constraint programming solver that recovers the (approximate) staple-scaffold contact map from origami sequences. REVNANO uses graph layout techniques to convert the topological contact map into an approximate geometric origami schematic.

REVNANO leverages the unique physical features of origami nanostructures as heuristics. DNA, RNA or hybrid scaffolded origami are all supported. The quality of the REVNANO solution is quantified by taking the base hamming distance between the ground truth contact map.





□ UnitedNet: Explainable multi-task learning for multi-modality biological data analysis

>> https://www.nature.com/articles/s41467-023-37477-x

UnitedNet has an encoder-decoder-discriminator structure and is trained by joint group identification / cross-modal prediction. Its structure does not presume that the data distributions are known - instead implicitly approximates the statistical characteristics of each modality.

UnitedNet uses SHapley Additive exPlanations algorithm and indicates the relevance relationship between gene expression and DNA accessibility with cell-type specificity. UnitedNet fuses these codes into shared latent codes using an adaptive weighting scheme.





□ AirLift: A Fast and Comprehensive Technique for Remapping Alignments between Reference Genomes

>> https://www.biorxiv.org/content/10.1101/2021.02.16.431517v2

AirLift, a methodology and tool for quickly, comprehensively, and accurately remapping a read data set that had previously been mapped to an older reference genome to a newer reference genome.

AirLift provides BAM-to-BAM remapping results on which downstream analysis can be immediately performed. AirLift Index exploits the similarity b/n two references to quickly identify candidate locations that a read should be remapped to based on its original mapping.





□ DELVE: Feature selection for preserving biological trajectories in single-cell data

>> https://www.biorxiv.org/content/10.1101/2023.05.09.540043v1

DELVE (dynamic selection of locally covarying features), an unsupervised feature selection method for identifying a representative subset of dynamically-expressed molecular features that recapitulates cellular trajectories.

DELVE uses a bottom-up approach to mitigate the effect of unwanted sources of variation confounding inference, and instead models cell states from dynamic feature modules that constitute core regulatory complexes.





□ Designing molecular RNA switches with Restricted Boltzmann machines

>> https://www.biorxiv.org/content/10.1101/2023.05.10.540155v1

Restricted Boltzmann machines (RBM), a simple two-layer machine learning model, capture intricate sequence dependencies induced by secondary and tertiary structure, as well as the switching mechanism, resulting in a model that can be used for the design of allosteric RNA.

The hidden units of the RBM must extract features shared by the data sequences and thus likely to be important for their biological function. Conservation of probability mass implies that regions of sequence space not populated by data sequences must be penalized.

The RBM is able to model complex interactions. After marginalizing over the hidden units configurations, effective interactions arise between the visible units. RBM can represent schematically a three-body interaction, arising from the three connections of the summed hidden unit.





□ metapaths: similarity search in heterogeneous knowledge graphs via meta paths

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad297/7152274

Once informative meta paths for a given KG have been defined, these meta paths define the semantics of the relationships between nodes in the KG, thereby enabling heterogeneous graph convolutional and graph attention networks for downstream machine learning analyses.

The primitives of the metapaths package identify the neighbors of a specified node with a given type by querying either an edge t or, for efficiency, an adjacency list precomputed from the edge list.

The meta path traversal function accepts an origin node, a destination node, and a specified meta path; then, via the neighbor identification functions, it starts at the origin node and recursively expounds the sequence of node types until the destination node is reached.






□ EvoAug: improving generalization and interpretability of genomic deep neural networks with evolution-inspired data augmentations

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02941-w

Random transformation of DNA sequences can potentially alter their function in unknown ways. EvoAug pretrains sequence-based deep learning models for regulatory genomics data w/ evolution-inspired augmentations followed by a finetuning on the original, unperturbed sequence data.

EvoAug data augmentations introduce a modeling bias to learn invariances of the (un)natural symmetries generated by the augmentations.

Random insertions and deletions assume that the distance between motifs is not critical, whereas random inversions and translocations promote invariances to motif strand orientation and the order of motifs.





□ ProteinSGM: Score-based generative modeling for de novo protein design

>> https://www.nature.com/articles/s43588-023-00440-3

ProteinSGM, a continuous-time score-based generative model that generates high-quality de novo proteins. ProteinSGM learns to generate four matrices that fully describes a protein's backbone, which are used as smoothed harmonic constraints in the Rosetta minimization protocol.

ProteinSGM generates variable-length structures with a mean < -3.9 REU per residue, indicative of native-like structures. It provides an alternative approach that uses MinMover for backbone minimization, and ProteinMPNN and OmegaFold for sequence design and structure prediction.





□ CEBRA: Learnable latent embeddings for joint behavioural and neural analysis

>> https://www.nature.com/articles/s41586-023-06031-6

CEBRA is a nonlinear dimensionality reduction method newly developed to explicitly leverage auxiliary (behaviour) labels and/or time to discover latent features in time series data—in this case, latent neural embeddings.

CEBRA can be used for supervised and self-supervised analysis, thereby directly facilitating hypothesis- and discovery-driven science. It produces both consistent embeddings across subjects and can find the dimensionality of neural spaces that are topologically robust.





□ The categorical basis of dynamical entropy

>> https://arxiv.org/abs/2301.09205

The focus of topological Dynamical systems theory is to derive properties of the system. The objects that are usually in consideration are invariant behavior such as attractors, invariant sets and omega-limit sets, and asymptotic properties such as invariant measures and entropy.

A category-theoretic view of topological dynamical entropy, which reveals that the common limit is a consequence of the structural assumptions on these notions. One of the key tools developed is that of a qualifying pair of functors, which ensure a limit preserving property.

The diameter and Lebesgue number of open covers of a compact space, form a qualifying pair of functors. The various notions of complexity are expressed as functors, and natural transformations between these functors lead to their joint convergence to the common limit.





□ A draft human pangenome reference

>> https://www.nature.com/articles/s41586-023-05896-x

Flagger detects different types of misassemblies within a phased diploid assembly. The pipeline works by mapping the HiFi reads to the combined maternal and paternal assembly in a haplotype-aware manner.

Flagger identifies coverage inconsistencies within these read mappings. Coverage is calculated across the genome and a mixture model is fit to account for reliably assembled haploid sequence and various classes of unreliably assembled sequence.





□ Squigulator: simulation of nanopore sequencing signal data with tunable noise parameters

>> https://www.biorxiv.org/content/10.1101/2023.05.09.539953v1

Squigulator generates simulated nanopore signal data based on an input reference genome or transcriptome sequence, or directly from a set of basecalled reads.

Squigulator uses an idealised 'pore model' that specifies the predicted current signal reading associated with every possible DNA or RNA k-mer, as appropriate to the specific nanopore protocol being emulated.

Squigulator generates sequential signal values corresponding to sequential k-mers in the provided reference sequence. squigulator transforms the data using Gaussian noise functions in both the time and amplitude domains to produce realistic, rather than ideal, signal reads.





□ Ariadne: Synthetic Long Read Deconvolution Using Assembly Graphs

>> https://www.biorxiv.org/content/10.1101/2021.05.09.443255v3

Ariadne, a novel assembly graph-based SLR deconvolution algorithm, that can be used to extract single-species read-clouds from SLR datasets to improve the taxonomic classification and de novo assembly of complex populations, such as metagenomes.

Ariadne leverages the linkage information encoded in the full de Bruin-based assembly graph generated by a de novo assembly tool such as cloudSPAdes to generate up to 37.5-fold more read clouds containing only reads from a single fragment.





□ Merizo: a rapid and accurate domain segmentation method using invariant point attention

>> https://www.biorxiv.org/content/10.1101/2023.02.19.529114v2

Network inputs to the IPA encoder are the single and pairwise representations and backbone frames in the style of AlphaFold2. The IPA encoder comprises six weight-shared blocks, each containing a single IPA block with RoPE positional encoding, and a bi-GRU transition block.

In the Masked transformer decoder, learnable domain mask embeddings dare concatenated to the single representation and passed through a 10-layer MHA stack with ALiBi positional encoding.

The predicted domain mask tensor is split according to the predicted domain and is passed through a two-layer biGRU, followed by projection into one dimension to produce a single ploU value for each domain. ndom represents the number of predicted domains.





□ Evolutionary graph theory on rugged fitness landscapes

>> https://www.biorxiv.org/content/10.1101/2023.05.04.539435v1

A unifying theory of how heterogenous structure shapes evolutionary dynamics. Even a simple extension to a two-mutational landscape can exhibit evolutionary dynamics not observed in deme-based models and that cannot be predicted using single-mutation results.

This model can be applied to understand the evolutionary trajectory of cellular systems with complex architectures. Heterogenous structure can affect fitness landscape crossing by allowing intermediate mutants to persist for longer, until the final beneficial mutation occurs.





□ The Compositional Structure of Bayesian Inference

>> https://arxiv.org/abs/2305.06112

A compositional Bayesian inversion of Markov kernels in isolation, using a suitable axiomatisation of a category of Markov kernels. It builds categories whose morphisms are pairs of a Markov kernel and an associated 'Bayesian inverter', which is itself built compositionally.

Symmetric monoidal categories with compatible families of copy and delete morphisms have been identified as an expressive language for synthetically representing concepts from probability theory.

A categorical translation of Bayes allows for a general definition of a Bayesian inverse to a morphism in a Markov category. The category of Bayesian lenses is constructed as a fired category that is closely related to the families fibration, in the semantics of dependent types.





□ CoCoNat: a novel method based on deep-learning for coiled-coil prediction

>> https://www.biorxiv.org/content/10.1101/2023.05.08.539816v1

CoCoNat encodes sequences with the combination of two state-of-the- art protein language models and implements a three-step deep learning procedure concatenated with a Grammatical-Restrained Hidden Conditional Random Field (GRHCRF) for CCD identification and refinement.

CoCoNat makes use of residue embeddings obtained with large-scale protein Language Models (pLMs) to represent proteins in training and testing sets. CoCoNat adopts a 15 residue long sliding window, takes as input, where each residue is represented with a 2304-feature vector.





□ snATAK: Assessing the multimodal tradeoff

>> https://www.biorxiv.org/content/10.1101/2021.12.08.471788v2

snATAK incorporates kallisto and other tools in a workflow that facilitates the preprocessing of snATAC-seq data from numerous technologies in minimal computing environments. snATAK can be used for allele-specific analysis of multimodal data, even in the absence of genotype data.

snATACK consists of first mapping reads to a reference genome using Minimap2. snATAK identifies putative open chromatin regions with Genrich. A kallisto pseudoalignment index is made and reads are remapped using kalisto. The snATAK output is compatible with the Signac and ArchR.





□ GenPhys: From Physical Processes to Generative Models

>> https://arxiv.org/abs/2304.02637

GenPhys (Generative Models from Physical Processes), a frame-work that can convert physical Partial differential equations (PDEs) to generative models. Diffusion models and Poisson flow generative models leverage the diffusion equation and the Poisson equation.

There exists non s-generative model which can also provide useful generative modeling, such as the case in quantum machine learning with dynamics based on the Schrödinger equation and quantum circuits.





□ Learning Decision Trees with Gradient Descent

>> https://arxiv.org/abs/2305.03515

Gradient-based decision trees (GDTs), a novel approach for learning hard, axis-aligned Decision Trees (DTs) with gradient descent. The proposed method uses backpropagation with a straight-through operator on a dense DT representation to jointly optimize all tree parameters.

GDTs are less prone to overfitting. GDT optimizes the gradient descent algorithm by exploiting common stochastic gradient descent techniques, including mini-batch calculation and momentum using the Adam optimizer with weight averaging.





□ LatentDiff: A Latent Diffusion Model for Protein Structure Generation

>> https://arxiv.org/abs/2305.04120

Latent Diff generates a novel protein backbone structure. They first sample multivariate Gaussian noise and use the learned latent diffusion model to generate 3D positions and node embeddings in the latent space.

Latent Diff uses a pre-trained equivariant 3D autoencoder to transform protein backbones into a more compact latent space, and models the latent distribution with an equivariant latent diffusion model.





□ Sequence UNET: High-throughput deep learning variant effect prediction

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02948-3

Sequence UNET is trained to directly predict variant frequency or to classify low frequency variants, as a proxy for deleteriousness, and then fine-tuned for pathogenicity prediction.

Sequence UNET uses a fully convolutional architecture. Convolutional kernels also naturally integrate information from nearby amino acids. The model outputs a matrix of per position features and can therefore be trained to predict various positional properties.





□ aaHash: recursive amino acid sequence hashing

>> https://www.biorxiv.org/content/10.1101/2023.05.08.539909v1

aaHash, a recursive hashing algorithm tailored for amino acid sequences. This algorithm utilizes multiple hash levels to represent biochemical similarities between amino acids. aaHash performs ~10X faster than generic string hashing algorithms in hashing adjacent k-mers.

aaHlash builds on ntHash, a rolling hash algorithm for DNA/RNA sequences, and adapts it for amino acid sequences. aaHash also supports using different levels of hashes together to create a multi-level pattern, mimicking the functionality of spaced seeds.





□ BGWAS: Bayesian variable selection in linear mixed models with nonlocal priors for genome-wide association studies

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05316-x

BGWAS uses a novel nonlocal prior for linear mixed models (LMMs). The screening step fits as many LMMs as the number of SNPs using a mixture of a Dirac delta at zero and a nonlocal prior, and estimates the probability of the Dirac delta component.

BGWAS uses a pMOM nonlocal prior for LMMs that uses the full Fisher information matrix. BGWAS either uses complete enumeration or searches the model space with a genetic algorithm.





□ AIONER: All-in-one scheme-based biomedical named entity recognition using deep learning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad310/7160912

AIONER, a new NER tagger that takes full advantage of various existing datasets for recognizing multiple entities simultaneously, despite their inherent differences in scope and quality, through a novel all-in-one (AIO) scheme.

The AIO scheme utilizes a small dataset recently annotated with multiple Entity types as a bridge to integrate multiple datasets annotated with a subset of entity types, thereby recognizing multiple entities at once, resulting in improved accuracy and robustness.





□ NanoPack2: Population scale evaluation of long-read sequencing data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btad311/7160911

The cramino, chopper, kyber, and phasius tools are written in Rust and available as executable binaries without requiring installation or managing dependencies. Binaries build on musl are available for broad compatibility.

Phasius is developed to visualize the results of read phasing, which shows in a dynamic genome browser style the length and interruptions between contiguously phased blocks from a large number of individuals together with genome annotation, for example, segmental duplications.





□ copMEM2: Robust and scalable maximum exact match finding

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad313/7160910

copMEM2, a multi-threaded MEM finding tool, targeting the execution speed and reducing the memory, as well as incorporating an improvement to speed up its processing by orders of magnitude when the pair of genomes is highly similar.

copMEM2 allows to compute all MEMs of minimum length 50 between the human and mouse genomes in 59s, using 10.40 GB of RAM and 12 threads, being at least a few times faster than its main contenders. On a pair of human genomes, hg18 and hg19, the results are 324s and 16.57 GB.





□ Integration of a multi-omics stem cell differentiation dataset using a dynamical model

>> https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1010744

A hierarchical dynamical model that allowed us to integrate all data sets. This model was able to explain mRNA-protein discordance for most genes and identified instances of potential microRNA-mediated regulation.

Overexpression or depletion of microRNAs identified by the model, followed by RNA sequencing and protein quantification, were used to follow up on the predictions of the model.





□ Improving variant calling using population data and deep learning

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05294-0

A population-aware DeepVariant models with a new channel encoding allele frequencies. This model reduces variant calling errors, improving both precision and recall in single samples, and reduces rare homozygous and pathogenic clinvar calls cohort-wide.

The relative advantage of the population-aware models increase at lower coverage, suggesting that population information is most valuable in difficult examples, where read-level information alone may not be sufficient for confident calling.





□ DeSide: A unified deep learning approach for cellular decomposition of bulk tumors based on limited scRNA-seq data

>> https://www.biorxiv.org/content/10.1101/2023.05.11.540466v1

The DeSide architecture considers only non-cancerous cells during the training process, indirectly calculating the proportion of cancerous cells.

DeSide avoids directly handling the often more variable heterogeneity of cancerous cells, and instead leverages scRNA-seq data from three different cancer types to empower the DNN model with a robust generalization capability across diverse cancers.





□ A Superior Thumb Drive: Optimizing DNA Stability for DNA Data Storage

>> https://www.biorxiv.org/content/10.1101/2023.05.11.540302v1

While methods to achieve DNA stability for hundreds or even millennia are possible, they call for completely enclosing DNA inside a silica matrix.

For instance, for an Archival Storage system whose DNA is enclosed in silica, the probability of strand loss or breakage is much lower, thereby enabling the use of longer DNA strands and higher information densities.

Conversely, for Working or Short-Term Storage systems, shorter strand lengths and lower information density requirements would be more appropriate due to the higher likelihood of strand loss.




Tranquility.

2023-05-15 05:13:05 | Science News

(Art by ekaitza)




□ scSpace: Reconstruction of the cell pseudo-space from single-cell RNA sequencing data

>> https://www.nature.com/articles/s41467-023-38121-4

scSpace (single-cell spatial position associated co-embeddings), an integrative method that uses ST data as a spatial reference to reconstruct the pseudo-space. A space-informed clustering is conducted to identify spatially variable cell subpopulations within the scRNA-seq data.

scSpace uses a transfer component analysis (TCA), it enables eliminating the batch effect between single-cell and ST data and extracting the shared latent feature. TCA projects the scRNA-seq and spatial transcriptomics data into a Reproducing Kernel Hilbert Space.





□ DEGAP: Dynamic Elongation of a Genome Assembly Path

>> https://www.biorxiv.org/content/10.1101/2023.04.25.538224v1

DEGAP (Dynamic Elongation of a Genome Assembly Path), a novel gap-filling software that can resolve gap regions in genomes. DEGAP optimizes HiFi reads by identifying the differences b/n reads and provides ‘GapFiller’ or ‘CtgLinker’ modes to eliminate or shorten gaps in genomes.

DEGAP elongates all contigs with supplied HiFi data, assesses the potentially neighbored contigs. DEGAP adopts a cyclic elongation strategy that automatically and dynamically adjusts parameters according to the complexity of the sequences and selects the optimal extension path.





□ scDisInFact: disentangled learning for integration and prediction of multi-batch multi-condition single-cell RNA-sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.05.01.538975v1

scDisInFact (single cell disentangled Integration preserving condition-specific Factors) learns latent factors that disentangle condition effects from batch effects, enabling it to simultaneously perform: batch effect removal, CKG detection, and perturbation prediction.

The disentangled latent space allows scDisInFact to perform the CKG detection and perturbation prediction, and to overcome the limitation of existing methods for each task. scDisInFact can remove batch effect while keeping the condition effect in gene expression data.





□ scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics

>> https://www.nature.com/articles/s41587-023-01772-1

The scDesign3 model is flexible to incorporate cell covariates (such as cell type, pseudotime, and spatial coordinates) via the use of generalized additive models, making the scDesign3 model fit well to various single-cell and spatial omics data a property confirmed by scDesign3's realistic simulation.

scDesign3 has a model alteration functionality enabled by its transparent probabilistic modeling: given the scDesign3 model parameters estimated on real data, users can alter the model parameters to reflect a hypothesis and generate the corresponding synthetic data that bear real data characteristics.





□ CellTypist v2.0: Automatic cell type harmonization and integration across Human Cell Atlas datasets

>> https://www.biorxiv.org/content/10.1101/2023.05.01.538994v1

CellTypist v2.0 accurately guantifies cell-cell transcriptomic similarities and enables robust and efficient cross-dataset meta-analyses. Cell types are placed into a relationship graph that hierarchically defines shared and novel cell subtypes.

CellTypist uses PCT, a multi-target regression tree algorithm. CellTypist defines semantic relationships among cell types / captures their underlying hierarchies, which are further leveraged to guide the downstream data integration at different levels of annotation granularities.





□ GATE: Moving Fast With Broken Data

>> https://arxiv.org/pdf/2303.06094.pdf

GATE, the Partition Summarization (PS) approach to data validation. The method creates a vector of statistics for each time step and performs a k-nearest neighbor algorithm against historical vectors to label the current time step's vector as anomalous or acceptable.

GATE significantly outperforms other methods in terms of mitigating false positives when ML pipelines have many correlated features because of GATE's clustering component, which only triggers an alert when an entire group of correlated features is anomalous.





□ ATOMRefine: Atomic protein structure refinement using all-atom graph representations and SE(3)-equivariant graph transformer

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad298/7152976

ATOMRefine, a deep learning-based, end-to-end, all-atom protein structural model refinement method. It uses a SE(3)-equivariant graph transformer network to directly refine protein atomic coordinates in a predicted tertiary structure represented as a molecular graph.

ATOMRefine enables the network to leverage sequence-based and spatial information from the entire protein structures to update node and edge features and catch the global and local structural variation from the initial model to the native structure iteratively.





□ Restrander: rapid orientation and QC of long-read cDNA data

>> https://www.biorxiv.org/content/10.1101/2023.05.02.539165v1

Restrander was faster than Oxford Nanopore Technologies’ existing tool Pychopper, and correctly restranded more reads due to its strategy of searching for polyA/T tails in addition to primer sequences from the reverse transcription and template-switch steps.

Each read from the reverse strand is replaced with reverse-complement, ensuring all reads in the output have the same orientation as the original transcripts. Restrander classifies artefactual reads for QC and ensure only high-quality reads are taken for downstream processing.





□ ROptimus: a parallel general-purpose adaptive optimisation engine

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad292/7152277

ROptimus, a general-purpose optimisation engine in R that can be plugged to any, simple or complex, modelling initiative through a few lucid interfacing functions, to perform a seamless optimisation with rigorous parameter sampling.

ROptimus features simulated annealing and replica exchange implementations equipped with adaptive thermoregulation to drive Monte Carlo optimisation process in a flexible manner, through constrained acceptance frequency but unconstrained adaptive pseudo temperature regimens.





□ Unifilar Machines and the Adjoint Structure of Bayesian Models

>> https://arxiv.org/abs/2305.02826

There is an adjunction between ‘dynamical’ and ‘epistemic’ models of a hidden Markov process. Concepts such as Bayesian filtering and conjugate priors arise as natural consequences of this adjunction.

Strongly representable Markov categories include BorelStoch (whose objects are standard Borel spaces and whose morphisms are Markov kernels) and the Kleisli category of the (real-valued) distribution monad, which is called Dist.

Unifilar machines outputs are stochastic but whose state updates are deterministic. Its state space consists of probability distributions over the hidden states of the system, and its dynamics are given by Bayesian updating.




□ StarCoder: A State-of-the-Art LLM for Code

>> https://huggingface.co/blog/starcoder

15B LLM with 8k context
Trained on permissively-licensed code
Acts as tech assistant
80+ programming languages
Open source and data
Online demos
VSCode plugin
1 trillion tokens





□ A Bayesian Noisy Logic Model for Inference of Transcription Factor Activity from Single Cell and Bulk Transcriptomic Data

>> https://www.biorxiv.org/content/10.1101/2023.05.03.539308v1

NLBayes: A noisy Boolean logic Bayesian model for TF activity inference from differential gene expression data and causal graphs. This approach provides a flexible framework to incorporate biologically motivated TF-gene regulation logic models.

NLBayes incorporates the prior information on causal regulatory interactions and makes posterior adjustments to further account for noise and determine the context-specific posterior network structure and active regulators through a Gibbs sampling procedure.





□ Dawnn: single-cell differential abundance with neural networks

>> https://www.biorxiv.org/content/10.1101/2023.05.05.539427v1

Dawnn uses a deep neural network model that has been trained to estimate the relative abundance of cells from each sample or condition in a cell’s neighbourhood. Dawnn predicts the probability w/ which each cell was drawn from a given sample or condition using simulated datasets.

Dawn controls the false discovery rate (FDR), the proportion of cells incorrectly cssified as belonging to regions exhibiting DA, using the Benjamini-Yekutieli procedure, a variant of the Benjamini-Hochberg procedure that does not assume independence between hypotheses.





□ Ribotin: rDNA consensus sequence builder

>> https://github.com/maickrau/ribotin

Ribotin inputs hifi or duplex, and optionally ultralong ONT. Extracts rDNA-specific reads based on k-mer matches to a reference rDNA sequence or based on a verkko assembly

Ribotin builds a DBG out of them, extracts the most covered path as a consensus and bubbles as variants. Optionally assembles highly abundant rDNA morphs using the ultralong ONT reads.





□ Aggregating network inferences: towards useful networks

>> https://www.biorxiv.org/content/10.1101/2023.05.05.539529v1

They suggest to combine edge frequencies directly to reconstruct the network. This approach ensures that only robust and reproducible edges are included in the consensus network.

The first consensus step relies on selecting edges w/ high inclusion frequency in the networks reconstructed from resampled data. The 2nd aggregation step is the inference of a consensus network considering each method advantages and counter balancing each estimation's default.





□ Foldseek: Fast and accurate protein structure search

>> https://www.nature.com/articles/s41587-023-01773-0

Foldseek discretizes the query structures into sequences over the 3Di alphabet and uses a pre-trained 3Di substitution matrix to search through the 3Di sequences of the target structures using the double-diagonal k-mer-based prefilter and gapless alignment prefilter modules.

Foldseek uses vectorized Smith–Waterman local alignment combining 3Di and amino acid substitution scores. Alternatively, a global alignment is computed with a 1.7-times accelerated TM-align.





□ ProteinGenerator: Joint Generation of Protein Sequence and Structure with RoseTTAFold Sequence Space Diffusion

>> https://www.biorxiv.org/content/10.1101/2023.05.08.539766v1

Beginning from random amino acid sequences, ProteinGenerator generates sequence and structure pairs by iterative denoising, guided by any desired sequence and structural protein attributes.

ProteinGenerator readily generates sequence-structure pairs satisfying the input conditioning criteria, and experimental validation showed that the designs were monomeric by size exclusion chromatography, had the desired secondary structure content by circular dichroism.





□ Improving de novo protein binder design with deep learning

>> https://www.nature.com/articles/s41467-023-38328-5

The physically based Rosetta approach frames both the folding and binding problems in energetic terms; for the approach to succeed, the designed sequence must have as its lowest energy state in isolation the designed monomer structure.

ProteinMPNN, a novel deep learning-augmented de novo protein binder design protocol. It shows retrospectively and prospectively that this improved protocol has nearly 10-fold higher success rate than the original energy-based method.





□ HMMerge: an ensemble method for multiple sequence alignment

>> https://academic.oup.com/bioinformaticsadvances/article/3/1/vbad052/7126611

HMMerge builds on the technique from its predecessor alignment methods, UPP and WITCH, which build an ensemble of profile HMMs to represent the backbone alignment and add the remaining sequences into the backbone alignment using the ensemble.

HMMerge builds a new ‘merged’ HMM from the ensemble, and then using that merged HMM to align the query sequences. We show that HMMerge is competitive with WITCH, with an advantage over WITCH when adding very short sequences into backbone alignments.





□ Correcting gradient-based interpretations of deep neural networks for genomics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02956-3

Even though DNNs can learn a function everywhere in Euclidean space, one-hot encoded DNA is a categorical variable that lives on a lower-dimensional simplex.

Random off-simplex function behavior can introduce a random gradient component orthogonal to the simplex, which manifest as spurious noise in the input gradients

This proposed gradient correction—subtracting the original gradient components by the mean gradients across components for each position—is general for all data with categorical inputs, including DNA, RNA, and protein sequences.





□ GKLOMLI: a link prediction model for inferring miRNA–lncRNA interactions by using Gaussian kernel-based method on network profile and linear optimization algorithm

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05309-w

GKLOMLI, a novel link prediction model based on Gaussian kernel-based method and linear optimization algorithm for inferring miRNA–lncRNA interactions. The Gaussian kernel-based method was employed to output two similarity matrixes of miRNAs and lncRNAs.

Based on the integrated matrix combined with similarity matrixes and the observed interaction network, a linear optimization-based link prediction model was trained for inferring miRNA–lncRNA interactions.





□ Estimating the mean in the space of ranked phylogenetic trees

>> https://www.biorxiv.org/content/10.1101/2023.05.08.539790v1

A simulation study to validate our method and compare it to other tree summary approaches such as the Maximum Clade Credibility (MCC) method. They assess suitability of a treespace for statistical analyses, e.g. its "smoothness" w/ respect to probability distributions over trees.

The RNNI space is a treespace of ranked phylogenetic trees, which are rooted binary trees where internal nodes are ordered according to times of the corre-ponding evolutionary events, assuming no co-occurrence.

The RNNI space is then defined as a graph where vertices are ranked trees and edges are representing either a rank or an NNI move that transforms one tree into another.

The CENTROID algorithm minimizes the sum of squared (SoS) distances b/n a summary tree and a given tree sample and stops when it finds a locally optimal tree, approximating a centroid tree. The algorithm proceeds iteratively by computing the SoS values for all neighbors.





□ Model selection and robust inference of mutational signatures using Negative Binomial non-negative matrix factorization

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05304-1

A Negative Binomial NMF with a patient specific dispersion parameter to capture the variation across patients and derive the corresponding update rules for parameter estimation.

A novel model selection procedure inspired by cross-validation to determine the number of signatures. It uses the Kullback–Leibler divergence which would favor the Poisson model. This means that a direct comparison b/n the cost values for Po-NMF / NBN-NMF is not feasible.





□ STAGEs: A web-based tool that integrates data visualization and pathway enrichment analysis for gene expression studies

>> https://www.nature.com/articles/s41598-023-34163-2

STAGEs (Static and Temporal Analysis of Gene Expression studies) is a web-based and high-throughput analysis pipeline with an intuitive user interface that allows systematic characterisation of static and temporal transcriptomic data.

STAGEs converts the ratio values to log2-transformed fold change values at backend, and the correlation matrix is generated by performing pairwise correlations of the log2-transformed fold changes between the different experimental conditions.





□ Insights from a genome-wide truth set of tandem repeat variation

>> https://www.biorxiv.org/content/10.1101/2023.05.05.539588v1

By identifying the subset of insertions and deletions that represent TR expansions or contractions with motifs between 2 and 50 base pairs, we obtained accurate genotypes for 139,795 pure and 6,845 interrupted repeats in a single diploid sample.

This approach did not require running existing genotyping tools on short read or long read sequencing data and provided an alternative, more accurate view of tandem repeat variation.

The Synthetic Diploid (SynDip) Benchmark provides genotypes for 5, 182,765 SNV, insertion and deletion variants, as well as a set of high-confidence regions spanning 2.71 gigabases where genotypes are highly accurate.





□ Butt-seq: a new method for facile profiling of transcription

>> https://genesdev.cshlp.org/content/early/2023/05/10/gad.350434.123.abstract

Butt-seq (bulk analysis of nascent transcript termini sequencing), which can produce libraries from purified nascent RNA in 6 h and from as few as 10,000 cells—an improvement of at least 10-fold over existing techniques.

Butt-seq shows that inhibition of the superelongation complex (SEC) causes promoter-proximal pausing to move upstream in a fashion correlated with subnucleosomal fragments.





□ NGBO: Introducing -omics metadata to biobanking ontology

>> https://www.biorxiv.org/content/10.1101/2023.05.09.539725v1

NGBO is based on available genomics standards (e.g., Minimum information about a microarray experiment (MIAME)), the College of American Pathologists (CAP) laboratory accreditation requirements, and the Open Biological and Biomedical Ontologies Foundry principles.

NGBO fills the need for semantically enabling the discovery and integration of omics datasets and realization of FAIR data representation, which will impact the efficiency of finding, integrating, and re-using biobanking data of interest.





□ Robust discovery of causal gene networks via measurement error estimation and correction

>> https://www.biorxiv.org/content/10.1101/2023.05.09.540002v1

A new framework for causal discovery that is robust against measurement noise by extending an established statistical approach CIT (Causal Inference Test).

RCD (Robust Causal Discovery) estimates measurement error from gene expression data and then incorporate it to get consistent parameter estimates that could be used with appropriately extended statistical tests of correlation or mediation done in the original CIT.





□ Simple Tidy GeneCoEx: A gene co-expression analysis workflow powered by tidyverse and graph-based clustering in R

>> https://acsess.onlinelibrary.wiley.com/doi/10.1002/tpg2.20323

Simple Tidy GeneCoEx detects co-expression modules enriched in specific cell types, which were used to discover candidate genes in a biosynthetic pathway for complex plant natural products.

Simple Tidy GeneCoEx detects modules that are, on average, equivalently tight or tighter than those detected by WGCNA. A potential reason underlying the differences in module tightness might be due to the module detection methods.

By default, WGCNA uses hierarchical clustering followed by tree cutting to detect modules. Simple Tidy GeneCoEx uses the Leiden algorithm to detect modules, which returns modules that are highly interconnected.





□ Fulgor: A fast and compact k-mer index for large-scale matching and color queries

>> https://www.biorxiv.org/content/10.1101/2023.05.09.539895v1

Fulgor is a colored compacted de Bruijn graph index for large-scale matching and color queries, powered by SSHash. Fulgor has a generic intersection algorithm that can work over any compressed color sets, provided that an iterator over each color supports two primitives - Next and NextGEQ(x).

Themisto, an index for alignment-free matching that substantially outperforms these prior methods in the context of indexing and mapping against large collections of genomes. Compared to Bifrost, Themisto uses practically the same space, but is faster to build and query.

Compared to the fastest variant of Metagraph, Themisto offers similar query performance, but is much more space-efficient; on the other hand, Themisto is much faster to query than Metagraph-BRWT, the most-space efficient variant of Metagraph.





□ RaPID-Query for Fast Identity by Descent Search and Genealogical Analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad312/7160137

A new method, random projection-based identical-by-descent (IBD) detection (RaPID) query, is introduced to make fast genealogical search possible. RaPID-Query identifies IBD segments between a query haplotype and a panel of haplotypes.

By integrating matches over multiple PBWT indexes, RaPID- Query manages to locate IBD segments quickly with a given cutoff length while allowing mismatched sites.





□ CARMA is a new Bayesian model for fine-mapping in genome-wide association meta-analyses

>> https://www.nature.com/articles/s41588-023-01392-0

CARMA, a Bayesian model for fine-mapping that includes flexible specification of the prior distribution of effect sizes, joint modeling of summary statistics and functional annotations and accounting for discrepancies b/n summary statistics and external linkage disequilibrium in meta-analyses.

CARMA has higher power and lower false discovery rate (FDR) when including functional annotations, and higher power, lower FDR and higher coverage for credible sets in meta-analyses.





□ DeCOIL: Optimization of Degenerate Codon Libraries for Machine Learning-Assisted Protein Engineering

>> https://www.biorxiv.org/content/10.1101/2023.05.11.540424v1

DEgenerate Codon Optimization for Informed Libraries (DeCOIL), a generalized method which directly optimizes DC libraries to be useful for protein engineering: to sample protein variants that are likely to have both high fitness and high diversity in the sequence search space.

DeCOIL can be used to generate a designed library for screening based on computational predictors (ZS scores or ML models) at many possible points along the route to engineering a protein. DeCOIL enables protein engineering using ftMLDE with comparable outcomes.





□ moscot: Mapping cells through time and space

>> https://www.biorxiv.org/content/10.1101/2023.05.11.540374v1

moscot supports multimodal data throughout the framework by exploiting joint cellular representations. moscot improves scalability by adapting and demonstrating the applicability of recent methodological innovations to atlas-scale datasets.

moscot unifies previous single-cell applications of OT in the temporal and spatial domain and introduces a novel spatiotemporal application. All of this is achieved with a robust and intuitive API that interacts with the broader scverse ecosystem.







Equanimity.

2023-05-15 05:10:05 | Science News
(Art by ekaitza)






Mark

>> https://www.vastspace.com/roadmap

Very exciting timeline from Haven-1 in 2025 on F9 to 2030 Starship class space station/modules to 100m spinning station in the 2040’s.

Excellent plan and realistic timeline.




□ NextPolish2: a repeat-aware polishing tool for genomes assembled using HiFi long reads

>> https://www.biorxiv.org/content/10.1101/2023.04.26.538352v1

NextPolish2 can fix base errors in “highly accurate” draft assemblies without introducing overcorrections, even in regions with highly repetitive elements. Through the built-in phasing module, it can not only correct the error bases, but also maintain the original haplotype consistency.

NextPolish2 follows the Kmer Score Chain (KSC) algorithm of its previous version to perform an initial rough correction, and detect low-quality positions (LQPs) where the chosen alleles account for ≤ 0.95 of the total during a traceback procedure.

NextPolish2 repeats the above procedure until all conflict communities are resolved (the number of iterations can be adjusted according to user settings) and then use the KSC algorithm to generate a draft consensus sequence.





□ CODEC: Single duplex DNA sequencing with CODEC detects mutations with high sensitivity

>> https://www.nature.com/articles/s41588-023-01376-0

CODEC (Concatenating Original Duplex for Error Correction), a hybrid method that combines the massively parallel nature of NGS and the resolution of single-molecule sequencing by reading both strands of each DNA duplex with single NGS read pairs.

The CODEC structure can be built by replacing a typical adapter duplex with the CODEC adapter quadruplex, containing all elements required for NGS.

CODEC to physically concatenate the Watson strand with the reverse complement of the Crick strand into a single strand without forming a prohibitive hairpin or inverted repeat structure from two complementary sequences.





□ TRASH: Tandem Repeat Annotation and Structural Hierarchy

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad308/7159186

TRASH (Tandem Repeat Annotation and Structural Hierarchy) is a tool that identifies and maps tandem repeats in nucleotide sequence, without prior knowledge of repeat composition.

TRASH analyses a fasta assembly file, identifies regions occupied by repeats and then precisely maps them and their higher order structures.

TRASH searches for continuous, highly similar, tandemly arranged DNA repeats of a similar unit size. This excludes transposable elements and interspersed repeats from analysis and allows for precise definition of tandemly arranged repeats.





□ GraNA: Supervised biological network alignment with graph neural networks

>> https://www.biorxiv.org/content/10.1101/2023.04.24.538184v1

GraNA, a deep learning framework for the supervised NA paradigm for the pairwise network alignment problem. GraNA utilizes within-network interactions and across-network anchor links for learning protein representations and predicting functional correspondence.

GraNA integrates sequence similarity edges as additional anchor links to guide the alignment and pre-computed network embeddings as node features to better encode the topological roles of network nodes.





□ Riboformer: A Deep Learning Framework for Predicting Context-Dependent Translation Dynamics

>> https://www.biorxiv.org/content/10.1101/2023.04.24.538053v1

Riboformer uses a transformer architecture that detects long-range dependencies in the regulation of elongation. Riboformer models the context-dependent changes in ribosome dynamics at codon resolution.

The transformer block consists of self-attention layers that gather the impact of distant codons based on their sequence representations, in contrast to convolutional neural network that relies on convolution operators to detect local sequence motifs.

Riboformer can be combined with in silico mutagenesis analysis to identify sequence motifs that contribute to ribosome stalling. It also utilizes a reference input to prevent the learning of noninformative signals due to the experimental bias.





□ CellANOVA: Signal recovery in single cell batch integration

>> https://www.biorxiv.org/content/10.1101/2023.05.05.539614v1

CellANOVA utilizes a “pool-of-controls”, applicable across diverse settings, to separate unwanted variation from biological variation. CellANOVA allows the recovery of subtle biological signals and corrects, to a large extent, the data distortion introduced by integration.

A control-pool is a set of samples whereby variation beyond what is preserved by the existing integration are not of interest to the study. The control-pool samples are utilized to estimate a latent linear space that captures cell- and gene-specific unwanted batch variations.

CellANOVA produces a batch corrected GE matrix which can be used for gene-pathway level downstream analyses. By using the control pool in the estimation of the batch variation space, CellANOVA recovers any variation in the non-control samples that lie outside this space.





□ ProteiNN: a Transformer-based model for end-to-end single-sequence protein structure prediction

>> https://www.biorxiv.org/content/10.1101/2023.04.26.538026v1

ProteiNN predicts protein secondary and tertiary structures directly from integer-encoded amino acid sequences. The model was trained and evaluated using the SideChainNet dataset, which provides the basis for complete model training.

The input to the module is a sequence of feature vectors mapped to these component spaces via linear transformations. The multi-head mechanism enables the model to learn relationships between amino acids in parallel.

ProteiNN uses a gating mechanism that modulates the information flow between the input and output, allowing the model to emphasize specific relationships and discard irrelevant information selectively.





□ DeepUMQA3: a web server for model quality assessment of protein complexes

>> https://www.biorxiv.org/content/10.1101/2023.04.24.538194v1

DeepUMQA and DeepUMQA2, new features were designed for complex structures, and the lDDT of each residue and the accuracy of interface residues were predicted using an improved deep neural network.

At the level of overall complex, the overall complex is regarded as a large monomer structure. DeepUMQA3 provides fast and accurate interface residue accuracy prediction and per-residue lDDT prediction services for protein complexes.





□ ecpc: an R-package for generic co-data models for high-dimensional prediction

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05289-x

ecpc originally accommodated various and possibly multiple co-data sources, including categorical co-data, i.e. groups of variables, and continuous co-data. Continuous co-data were handled by adaptive discretisation, potentially inefficiently modelling and losing information.

An extension to the method for generic co-data models, particularly for continuous co-data. At the basis lies a classical linear regression model, regressing prior variance weights on the co-data. Co-data variables are then estimated with empirical Bayes moment estimation.




□ MaxKAT: A maximum kernel-based association test to detect the pleiotropic genetic effects on multiple phenotypes

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btad291/7146028

MaxKAT reduces computational intensity greatly while maintaining high accuracy. Extensive simulations demonstrate that MaxKAT can properly control type I error rates and obtain remarkably higher power than KAT under most of the considered scenarios.

A generalized extreme value distribution is employed to calculate the statistical significance of MaxKAT under the null hypothesis. In addition, the proposed test can accommodate high-dimensional data and yield high power against various alternative hypotheses.





□ SeqImprove: Machine Learning Assisted Creation of Machine Readable Sequence Information

>> https://www.biorxiv.org/content/10.1101/2023.04.25.538300v1

SeqImprove is designed to aid authors in creating machine readable sequence data with complete metadata. It consists of a user-interface that was built using modular code. It can be reused by others to work as the front-end for their curation software.

As input, SeqImprove takes in a sequence file in the Synthetic Biology Open Language (SBOL) format or a link to a sequence stored in SynBioHub. It makes the information machine readable by using existing ontologies to structure the metadata.





□ CNV-ClinViewer: Enhancing the clinical interpretation of large copy-number variants online

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad290/7146044

CNV-ClinViewer enables real-time interactive exploration of large CNV datasets in a user-friendly designed interface and facilitates semi-automated clinical CNV interpretation following the ACMG guidelines by integrating the ClassifCNV tool.

The CNV-ClinViewer allows analysis of single or multiple CNVs, of the used to identify them. Minimal required information for each CNV, including whole chromosome trisomies and monosomies, is the chromosome, start, end and CNV type.





□ OrthoVenn3: an integrated platform for exploring and visualizing orthologous data across genomes

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkad313/7146343

OrthoVenn3 provides gene family contraction and expansion analysis to support researchers better understanding the evolutionary history of gene families, as well as collinearity analysis to detect conserved and variable genomic structures.

OrthoVenn3 offers multiple out-puts, including the UpSet table, occurrence table, phylogenetic tree, and collinearity graph, which provides users with various perspectives on their data.





□ ELVAR: Cell-attribute aware community detection improves differential abundance testing from single-cell RNA-Seq data

>> https://www.biorxiv.org/content/10.1101/2023.04.28.538653v1

ELVAR uses cell attribute aware clustering when inferring differentially enriched communities within the single-cell manifold. ELVAR can detect disease relevant DA-shifts in other cell-types and biological conditions.

The improved sensitivity to detect DA-shifts, as displayed by ELVAR, was also seen when benchmarked against an analogous clustering-based DA-method that uses Louvain in place of EVA.





□ xQTLbiolinks: a comprehensive and scalable tool for integrative analysis of molecular QTLs

>> https://www.biorxiv.org/content/10.1101/2023.04.28.538654v1

xQTLbiolinks is a end-to-end bioinformatic tool for efficient mining and analyzing public and user-customized xQTLs data for the discovery of disease susceptibility genes.

xQTLbiolinks allows users to conveniently retrieve ×QTLs data and metainformation for further analysis through gene names/IDs, tissue names, or genomic regions of interest.





□ Combining LIANA and Tensor-cell2cell to decipher cell-cell communication across multiple samples

>> https://www.biorxiv.org/content/10.1101/2023.04.28.538731v1

Integrating LIANA and Tensor-cell2cell, which combined can deploy multiple existing methods and resources, to enable the robust and flexible identification of cell-cell communication programs across multiple samples.

In this protocol, the integration of the tools facilitates the choice of method to infer cell-cell communication and subsequently perform an unsupervised deconvolution to obtain and summarize biological insights.





□ Signed distance correlation (SiDCo): an online implementation of distance correlation and partial distance correlation for data-driven network analysis

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btad210/7151065

SiDCo is a GUI-platform for calculation of distance correlation in omics data, measuring linear and non-linear dependences between variables, as well as correlation between vectors of different lengths, e.g., different sample sizes.

Distance correlations can be selected as one-to-one / one-to-all correlations, showing relationships b/n each / all other features one at a time. SiDCo uses partial distance correlation, calculated using the Gaussian Graphical model approach adapted to distance covariance.





□ ERStruct: a fast Python package for inferring the number of top principal components from whole genome sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05305-0

ERStruct enables the inference of population structure using whole-genome sequencing data. By leveraging parallel computing and GPU acceleration, ERStruct achieves significant improvements in the speed of matrix operations for large-scale data.

In GOE.py, Monte Carlo method is used in the ERStruct algorithm to obtain the null distribution of our proposed ERStruct test statistic, which starts by generating multiple replications of high-dimensional Gaussian Orthogonal Ensemble matrices.





□ PascalX: a python library for GWAS gene and pathway enrichment tests

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad296/7151067

PascalX allows for scoring genes and annotated gene sets for enrichment signals based on data from, both, single GWAS and pairs of GWAS. The gene scores take into account the correlation pattern between SNPs.

They are based on the cumulative density function of a linear combination of χ2 distributed random variables, which can be calculated either approximately or exactly to high precision.





□ CZ CELLxGENE Discover Census

>> https://chanzuckerberg.github.io/cellxgene-census/

The Census provides efficient computational tooling to access, query, and analyze all single-cell RNA data from CZ CELLxGENE Discover.

Using a new access paradigm of cell-based slicing and querying, you can interact with the data through TileDB-SOMA, or get slices in AnnData or Seurat objects, thus accelerating your research by significantly minimizing data harmonization.





□ kimma: flexible linear mixed effects modeling with kinship covariance for RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad279/7152273

kimma supports DEG analyzes incl. covariance random effects. Kimma is an open-source R package that provides flexible linear mixed effects modeling for bulk RNA-seq data including univariate, multivariate, random, and covariance random effects as well as gene-level weights.

kimma utilizes a single function, kmFit, for modeling, ensuring consistent syntax, inputs, and outputs. Moreover, kimma provides post-hoc pairwise tests, model fit metrics like AIC, and fit warnings on a per gene basis.





□ CAGECAT: The CompArative GEne Cluster Analysis Toolbox for rapid search and visualisation of homologous gene clusters

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05311-2

CAGECAT has been designed to provide rapid interoperability between these functions, where homologous clusters of interest can be selected to be used in subsequent analysis.

CAGECAT can yield relevant matches that aid in the comparison, taxonomic distribution, or evolution of an unknown query. The search module leverages the cblaster pipeline, which utilises remote BLAST searches via NCBI’s servers as well as accelerated local Hidden Markov Model.





□ cellsnake: a user-friendly tool for single cell RNA sequencing analysis

>> https://www.biorxiv.org/content/10.1101/2023.05.03.539204v1

Cellsnake allows parallelization and readily utilizes high performance computing (HPC) platforms. cellsnake provides metagenome analysis capabilities if unmapped reads are available.

cellsnake can utilize different scRNA-seq algorithms to simplify tasks such as automatic mitochondrial gene trimming, selection of optimal clustering resolution, doublet filtering, visualization of marker genes, enrichment analysis and pathway analysis.





□ Whole-genome long-read sequencing downsampling and its effect on variant calling precision and recall

>> https://www.biorxiv.org/content/10.1101/2023.05.04.539448v1

Defining read-based methodologies as those requiring alignment of individual sequencing reads to a reference genome and applying specific read-based variant-calling algorithms to these alignments to identify variants.

Assembly-based methods first generate ab initio a whole-genome assembly from LRS reads without guidance from a particular reference genome, and then proceed analogously by aligning this assembly to a reference genome to call variants using assembly-based calling algorithms.





□ HiPhase: Jointly phasing small and structural variants from HiFi sequencing

>> https://www.biorxiv.org/content/10.1101/2023.05.03.539241v1

HiPhase jointly phases SNVs, indels, and structural variants called from PacBio HiFi sequencing on diploid organisms. HiPhase uses two novel approaches to solve the phasing problem: dual mode allele assignment and a phasing algorithm based on the A* search algorithm.

HiPhase offers additional benefits: no down-sampling, multi-allelic variation, logic to span coverage gaps with supplementary alignments, innate multi-threading, built-in statistics gathering, and assigning aligned reads to a haplotype (“haplotagging”) while phasing.





□ scMayoMap: an easy-to-use tool for cell type annotation in single-cell RNA-sequencing data analysis

>> https://www.biorxiv.org/content/10.1101/2023.05.03.538463v1

ScMayoMap takes the standard cluster marker gene list as input and returns the cell type prediction results in a plot and the mapped gene list. scMayoMap allows assignment of multiple cell types to the same cluster if their evidence is similar.

scMayoMap can predict PBMC cell types with small errors, suggesting that marker-based approach is still a promising approach if applied properly.





□ DeepGNN: Semi-supervised learning improves regulatory sequence prediction with unlabeled sequences

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05303-2

DeepGNN, a paradigm shift with semi-supervised learning, which does not only exploit labeled sequences (e.g. human genome with ChIP-seq experiment), but also unlabeled sequences available in much larger amounts.

In parallel, the model takes as a secondary input the graph matrix connecting homologous sequences between species. An improvement would be to infer the homology matrix from the sequence embedding itself during training.





□ Challenges and considerations for reproducibility of STARR-seq assays

>> https://genome.cshlp.org/content/early/2023/05/02/gr.277204.122.long

A strong advantage of STARR-seg is its ability to screen random fragments of DNA from any source for enhancer activity. To this effect, DNA can be sourced from commercially available DNA repositories, from specific populations carrying non-coding mutations or SNPs to be assayed.

Cloning strategies such as In-fusion HD, Gibson assembly, and NEBuilder HiFi DNA Assembly allow for fast and one-step reactions that use complimentary overhang sequences on the inserts and the vector.

Highlighting the different challenges in performing STARR-seg, a particularly long and difficult assay with huge potential to identify detailed enhancer landscapes and validate enhancer function.





□ STEMSIM: a simulator of within-strain short-term evolutionary mutations for longitudinal metagenomic data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad302/7156836

STEMSIM (short-term evolutionary mutations simulator), which can generate mutations incl. SNV and InDel with various frequency distributions within strains in raw metagenomic sequencing data under a specified nucleotide substitution model.

STEMSIM directly takes the output of CAMISIM as input data. Next, the raw sequencing reads are mapped to the original reference genomes to obtain the alignment files (sam/bam) by Bowtie2.

Then, the details of mutations are gerated according to the specified parameters, such as the number of nucleotide substitutions, and the distribution and trajectory of allele frequency.





□ scDist: Robust identification of perturbed cell types in single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2023.05.06.539326v1

scDist estimates the distance between condition means in high-dimensional gene expression space for each cell type. scDist can recover biologically relevant between-group differences while also controlling for sample-level variability.

scDist is based on a linear mixed-effects model of single-cell GE counts. scDist uses an approximation for the between-group differences, based on a low-dimensional embedding, which results in a computationally convenient implementation that is substantially faster than Augur.





□ crosshap: Local haplotype visualization for trait association analysis

>> https://www.biorxiv.org/content/10.1101/2023.05.07.539781v1

crosshap performs density-based clustering of variants based on their linkage profiles to capture haplotype structures in local genomic regions. Tightly linked variants are clustered into MGs, and individuals are grouped into local haplotypes by shared allelic combinations.

Visualization tools are provided by crosshap for choosing optimal clustering parameters and producing intuitive crosshap figures that present information on the complex relationships between linked variants, haplotype combinations, and phenotypic/metadata traits of individuals.





□ SpatialData: an open and universal data framework for spatial omics

>> https://www.biorxiv.org/content/10.1101/2023.05.05.539647v1

SpatialData, a framework that establishes a unified and extensible multi-platform file-format, lazy representation of larger-than-memory data, transformations, and alignment to common coordinate systems.

SpatialData facilitates spatial annotations and cross-modal aggregation and analysis, the utility of which is illustrated via multiple vignettes including integrative analysis on a multi-modal Xenium and Visium breast cancer study.





Ophanim.

2023-05-15 05:05:05 | アート・文化


“Their entire bodies, including their backs, hands, and wings, were full of eyes all around, as were their four wheels.” (Ezekiel 10:12)

□ Biblically Accurate Angels Would Actually Be Pretty Scary

>> https://www.historydefined.net/biblically-accurate-angels-would-actually-be-pretty-scary/





□ nanoSHAPE: Direct detection of RNA modifications and structure using single-molecule nanopore sequencing

>> https://www.cell.com/cell-genomics/fulltext/S2666-979X(22)00014-3

nanoSHAPE combines long-read, direct RNA sequencing with a new SHAPE reagent that, by virtue of its high reactivity and small adduct size, enables full-length probing of structure in long RNAs.

The nanoSHAPE centroid structure includes fewer long-range base pairs and has larger loop sizes than does the SHAPE-MaP-based structure. nanoSHAPE features a 3′-to-5′ read direction, may over-detect reactivity at loop-closing base pairs.





□ GraphCPLMQA: Assessing protein model quality based on deep graph coupled networks using protein language model

>> https://www.biorxiv.org/content/10.1101/2023.05.16.540981v1

The GraphCPLMQA consists of a graph encoding module and a transform-based convolutional decoding module. The underlying relational representations of sequence and high-dimensional geometry structure are extracted by protein language models with Evolutionary Scale Modeling.

The mapping connection between structure and quality are inferred by the representations and low-dimensional features. The triangular location and residue level contact order features are designed to enhance the association between the local structure and the overall topology.






□ The emergence of clusters in self-attention dynamics

>> https://arxiv.org/abs/2305.05465

Characterizing clustered representations of a trained Transformer by studying the asymptotic behavior of a sequence of tokens (X1 (t), ..., In (t)) as they evolve through the layers of a transformer architecture using the dynamics.

Particles, representing tokens, tend to cluster toward particular limiting objects as time tends to infinity. The type of limiting object that emerges depends on the spectrum of the value matrix. In the one-dimensional case, the self-attention matrix converges to a low-rank Boolean matrix.





□ The edge of chaos: quantum field theory and deep neural networks

>> https://scipost.org/10.21468/SciPostPhys.12.3.081

The edge of chaos is determined by the point at which the largest Lyapunov exponent becomes positive, which yields precisely the criticality condition.

They compute both the O(1) corrections quantifying fluctuations from typicality in the ensemble of networks, and the subleading O(T/N) corrections due to finite-width effects.

These provide corrections to the correlation length that controls the depth to which information can propagate through the network, and thereby sets the scale at which such networks are trainable by gradient descent.

This analysis provides a first-principles approach to the rapidly emerging NN-QFT correspondence, and opens several interesting avenues to the study of criticality in deep neural networks.





□ MIDAS: Protein-metabolite interactomics of carbohydrate metabolism reveal regulation of lactate dehydrogenase

>> https://www.science.org/doi/10.1126/science.abm3452

MIDAS (mass spectrometry integrated with equilibrium dialysis for the discovery of allostery systematically) probes interactions for 33 enzymes from human carbohydrate metabolism identified 830 protein-metabolite interactions.

MIDAS can detect many new interactions, incl. regulation of lactate dehydrogenase by ATP and long-chain acyl coenzyme A, which may help to explain known physiological relations between fat and carbohydrate metabolism in different tissues.





□ EVRC: Reconstruction of chromosome 3D structure models using Error-Vector Resultant algorithm with Clustering coefficient

>> https://www.biorxiv.org/content/10.1101/2023.05.11.540436v1

The EVRC algorithm that utilizes Hi-C experiments data to reconstruct the 3D structure of chromatin. EVRC relies on the co-clustering coefficient and error-vector resultant. The EVRC algorithm begins by calculating the co-clustering coefficient between chromatin fragments.

In the 3D structure, the reciprocal of the space distance between each two points is taken as the interaction frequency between the two points, and the interaction matrix of the structure is obtained in this way.

The single-chain structure generates only the interaction matrix within a single curve (simulating single chromosome), while the double-chain structure generates the interaction matrix within each of the two curves and the interaction matrix.





□ Cactus: a user-friendly and reproducible ATAC-Seq and mRNA-Seq analysis pipeline for data preprocessing, differential analysis, and enrichment analysis

>> https://www.biorxiv.org/content/10.1101/2023.05.11.540110v1

Cactus conducts preprocessing on raw sequencing reads, followed by differential analysis between conditions. Results are split into Differential Analysis Subsets (DASs) based on significance threshold, direction of change, annotated genomic regions, and experiment type.

Cactus conducts preprocessing on raw sequencing reads, followed by differential analysis between conditions. Results are split into Differential Analysis Subsets (DASs) based on significance threshold, direction of change, annotated genomic regions, and experiment type.





□ Baldur: Bayesian hierarchical modeling for label-free proteomics exploiting gamma dependent mean-variance trends

>> https://www.biorxiv.org/content/10.1101/2023.05.11.540411v1

Baldur, a novel Bayesian regression model to characterize local mean-variance trends in the data to describe measurement uncertainty and to estimate the decision model hyperparameters.

Baldur vastly improves over state-of-the-art methods (Limma-Trend and t-test) in several spike-in datasets by having competitive performance in detecting true positives while showing superiority by greatly reducing false positives.





□ MethPhaser: methylation-based haplotype phasing of human genomes

>> https://www.biorxiv.org/content/10.1101/2023.05.12.540573v1

MethPhaser, a tool that operates on a set of already phased variants based on SNVs from, e.g., WhatsHap or Hapcut2. MethPhaser then utilizes the heterozygous methylation information across the autosomes to connect phaseblocks together and thus improve the overall phasing.

MethPhaser improves variant-based phasing with minimal impact on phasing errors. MethPhaser is able to combine both phaseblocks and thus generate a single larger block by leveraging the heterozygous methylation signal in this region.





□ Resolving the unsolved: Comprehensive assessment of tandem repeats at scale

>> https://www.biorxiv.org/content/10.1101/2023.05.12.540470v1

Tandem Repeat Genotyping Tool (TRGT), a novel method for repeat analysis of long reads, as well as a companion method for Tandem Repeat Visualization (TRVZ). TRGT makes it possible to analyze structurally complex tandem repeats.

TRVZ affords a visual inspection of repeat alleles. TRGT reports haplotype-resolved germline variation together with methylation status across simple and complex TRs, and can detect mosaic mutations.





□ Towards Computing Attributions for Dimensionality Reduction Techniques

>> https://www.biorxiv.org/content/10.1101/2023.05.12.540592v1

An efficient implementation for the gradient computation for this dimensionality reduction technique. We show that our explanations identify significant features using novel validation methodology; using synthetic datasets and the popular MNIST benchmark dataset.

Inspecting the gradients of t-SNE in the same manner as one would look at gradients with respect to their inputs in relation to supervised classifiers trained also via Stochastic Gradient Descent.





□ TDbasedUFE and TDbasedUFEadv: bioconductor packages to perform tensor decomposition based unsupervised feature extraction

>> https://www.biorxiv.org/content/10.1101/2023.05.14.540687v1

TDbasedUFE and TDbasedUFEadv are easy for a person who is not familiar with the concept of tensors to use. Since the matrix can be regarded as a two-mode tensor, TDbasedUFE and TDbasedUFEadv are also used to apply PCA-based unsupervised FE to the dataset.

TDbasedUFE focuses only on two popular functions among those possible by TD-based unsupervised FE, since TD-based unsupervised FE can perform numerous applications, not all of which are required by the majority of people.

TDbasedUFE and TDbasedUFEadv accept a multiple omics profile dataset formatted as a tensor to which TD is applied. They employed Tucker decomposition as a TD using the higher-order singular value decomposition (HOSVD) algorithm.





□ SCSC: Considering Zeros in Single Cell Sequencing Data Correlation Analysis

>> https://www.biorxiv.org/content/10.1101/2023.05.13.540566v1

SCSC is versatile and logically adaptable to single-cell multi-omic measures. It can be used for assessing the gene-gene co-exression and genetic feature-gene expression correlation. Four strategies (conventional, non-zero, dropout-weighted, imputation) were enabled.

Filtering out zeros reduces the MAE of correlation estimation compared to using all original or imputed data, in almost all scenarios with varying drop-out rates, expression levels, total number of cells, single- cell library sizes, overdispersion scales and variations.





□ uORF4u: a tool for annotation of conserved upstream open reading frames

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad323/7162684

A tool for conserved uORF annotation in 5ʹ upstream sequences of a user-defined protein of interest or a set of protein homologues. It can also be used to find small conserved ORFs within a set of nucleotide sequences.

The output includes publication-quality figures with multiple sequence alignments, sequence logos and locus annotation of the predicted conserved uORFs in graphical vector format.

For identified potential frames, the tool searches for conserved ORFs using a greedy algorithm: uORF4u iterates through sequences and tries to maximise the sum of pairwise alignment scores between uORFs.





□ LENS: Landscape of Effective Neoantigens Software

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad322/7162685

LENS (Landscape of Effective Neoantigen Software) predicts tumor-specific/associated antigens from single nucleotide variants, insertions and deletions, fusion events, splice variants, cancer testis antigens, overexpressed self-antigens, viruses, and endogenous retroviruses.

LENS includes phasing and germline variant information in epitope identification, and it harmonizes variant RNA expression across genomic sources to provide a more usable relative expression ranking each peptide epitope.





□ scAnno: a deconvolution strategy-based automatic cell type annotation tool for single-cell RNA-sequencing data sets

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbad179/7161854

scAnno (scRNA-seq data annotation), an automated annotation tool for scRNA-seq data sets primarily based on the single-cell cluster levels, using a joint deconvolution strategy and logistic regression.

scAnno offers a possibility to obtain genes with high expression and specificity in a given cell type as cell type-specific genes (marker genes) by combining co-expression genes with seed genes as a core.





□ SonicParanoid2: fast, accurate, and comprehensive orthology inference with machine learning and language models

>> https://www.biorxiv.org/content/10.1101/2023.05.14.540736v1

Sonic aranoid2 performs de novo orthology inference using a novel graph-based algorithm that halves the execution time with an AdaBoost classifier and avoiding unnecessary alignments. SonicParanoid2 conducts domain-based orthology inference using Doc2Vec neural network models.

SonicParanoid2 uses fast profile searches on Pfam? to infer the domain architectures of the input proteins and converts them into "phrases", where "words" are the annotated functional domains and the amino-acid lengths of the inter-domain regions.

The orthologs predicted by these two algorithms are merged and input into the Markov cluster algorithm (MCL) to infer the ortholog groups (OGs) for the N input proteomes.





□ Transcription factor exchange enables prolonged transcriptional bursts

>> https://www.biorxiv.org/content/10.1101/2023.05.15.540758v1

A tracking system to understand how transient TF binding relates to transcription activation. Gal4 is naturally lowly abundant and only has 15 genomic binding sites and the shared GALI-GAL10 promoter is the promoter with the most binding sites.

To measure Gal4 binding and GAL10 transcription simultaneously, we developed a tracking algorithm to track the GAL10 TS in a single plane in the z direction with an active feedback loop.

The main kinetic mode of activation of GAL10 is that Gal4 molecules show cooperative binding / exchange during a burst. A general mechanism for TF-mediated regulation, where TF cooperative binding and TF exchange at multiple binding sites enable prolonged transcriptional bursts.





□ Speed reading the epigenome and genome

>> https://www.nature.com/articles/s41587-023-01757-0

Dual sequencing of the epigenome and genome could have broad implications in oncology.





□ The ENCODE4 long-read RNA-seq collection reveals distinct classes of transcript structure diversity

>> https://www.biorxiv.org/content/10.1101/2023.05.15.540865v1

A framework to systematically characterize and quantify the diversity between the detected transcripts from each gene by computing a summary gene triplet, which is related to but distinct from transcript triplets.

Using triplets in a simplex representation demonstrates how promoter selection, splice pattern, and 3’ processing are deployed across human tissues, with nearly half of multi-transcript protein coding genes showing a clear bias toward one of the three diversity mechanisms.





□ Cello scope: a probabilistic model for marker-gene-driven cell type deconvolution in spatial transcriptomics data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02951-8

Cello scope, a novel Bayesian probabilistic graphical model of gene expression in ST data, which deconvolutes cell type composition in ST spots, and a method to infer model parameters based on an MCMC algorithm.

Cello scope implements a semi-automatic procedure of marker selection. Cello scope considers that the expression levels of marker genes in each cell type are unknown and are modeled using hidden variables Λ.





□ fimpera: drastic improvement of Approximate Membership Query data-structures with counts

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad305/7169157

fimpera, that enables the improvement of any cAMQ performance. Applied to counting Bloom filters, our proposed algorithm reduces the false positive rate by two orders of magnitude and it improves the precision of the reported abundances.

fimpera allows for the reduction of the size of a counting Bloom filter by two orders of magnitude while maintaining the same precision. fimpera does not introduce any memory overhead and may even reduces the query time.





□ RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02961-6

RabbitTClust, an efficient clustering toolkit based on MinHash sketch distance measurement for large-scale genome datasets. Fast sketching (an approximate, compact summary of the original data) is used to compute similarities among genomes with a small memory footprint.

clust-mst relies on a graph-based linear space clustering algorithm based on minimum spanning tree (MST) computation to perform single-linkage hierarchical clustering.

This MST construction relies on dynamically generating and merging partial clustering results without storing the whole distance matrix, which in turn allows for both memory reduction and efficient parallelization.





□ GbyE: A New Genome Wide Association and Prediction Model based on Genetic by Environmental Interaction

>> https://www.biorxiv.org/content/10.1101/2023.05.17.541129v1

GbyE, a new genotype design model program for genome-wide association and prediction using Kronecker product, which can enhance the statistical power of GWAS and GS by utilizing the interaction effects of multiple environments or traits.

The GbyE model improves the prediction accuracy of the three Bayesian models BRR, BayesA, and BayesLASSO using information from GEI and increases the prediction accuracy by 9.4%, 9.1%, and 11%, respectively, relative to the Mean value method.





□ scNanoATAC-seq: a long-read single-cell ATAC sequencing method to detect chromatin accessibility and genetic variants simultaneously within an individual cell

>> https://www.nature.com/articles/s41422-022-00730-x

scNanoATAC-seq is a TGS platform-based long-read single-cell ATAC sequencing method that can be applied in various biological fields. It can detect chromatin accessibility and genetic variants (including SVs, SNPs, and CNVs) within an individual cell simultaneously.

scNanoATAC-seq provides the direct evidence of co-accessibility between neighboring peaks from scNanoATAC-seq, where the chromatin accessibility of two sites in the same single cell and in fact on the same allele was detected simultaneously by a long read.





□ Sequencing accuracy and systematic errors of nanopore direct RNA sequencing

>> https://www.biorxiv.org/content/10.1101/2023.03.29.534691v2

The systematic sequencing errors at single nucleotide and motif levels are also prevalent in the reads basecalled by RODAN, suggesting that the fundamental causes of sequencing between read quality scores and error rates, and how adaptor detection failure can impact the read quality of short sequences.

While read quality scores approximated error rates at base and read levels, failure to detect DNA adapters may lead to data loss. By comparing distinct basecallers, some sequencing errors are attributable to signal insufficiency rather than algorithmic (base-calling) artefacts.





□ CENTRE: A gradient boosting algorithm for Cell-type-specific ENhancer-Target pREdiction

>> https://www.biorxiv.org/content/10.1101/2023.05.16.541035v1

CENTRE is a machine learning framework that predicts enhancer target interactions in a cell-type-specific manner, using only gene expression and ChIP-seq data for three histone modifications for the cell type of interest.

CENTRE computes CT-specific and generic features for all potential ET pairs. ET feature vectors are then fed to a pre-trained XGBOOST classifier, and a probability of an interaction is assigned to ET pairs. ET pairs w/ higher probability than 0.5 are labeled as interacting pairs.





□ CrossmodalNet: Interpretable modeling of time-resolved single-cell gene-protein expression

>> https://www.biorxiv.org/content/10.1101/2023.05.16.541011v1

CrossmodalNet, an interpretable ML model with customized adaptive loss that learns to translate between modalities of genes and proteins using CITE-seq data while encoding temporal information. CrossmodalNet is capable of elucidating noise-free causal gene-protein relationships.

By combining the interpretability of linear models with the flexibility of non-linear models, CrossmodalNet decomposes transcriptional information of cells into basal and temporal domain, with the latter forming an easy-to-interpret time embedding.





□ Adaptive RAxML-NG: Accelerating Phylogenetic inference under Maximum Likelihood using dataset difficulty

>> https://www.biorxiv.org/content/10.1101/2023.05.15.540873v1

Adaptive RAxML-NG is based on two new mechanisms for faster and more efficient exploration, that is, NNI rounds and the 1% ML convergence interval to terminate the first more superficial phase of topological moves early.

Adaptive RAxML-NG modifies the thoroughness of the tree search strategy, as well as additional heuristic search parameters (e.g., the number of distinct starting trees or the maximum subtree re-insertion radius of SLOW-SPR moves), as a function of the predicted difficulty.





□ MASIv2 enables standardization and integration of multi-modal single-cell and spatial omics data with one general framework

>> https://www.biorxiv.org/content/10.1101/2023.05.15.540808v1

MASIv2 can scale to integrate multiple modalities at once, including gene expression, chromatin accessibility, DNA methylation, and chromatin structure.

MASIv2 uses Louvain community detection to identify a group of key points for each modality. Next, they train a linear regression model that can match two key points of the same cell type but from two modalities. It only adds K * (Q - 1) trainable weights with O - 1 bias terms into the framework.





□ Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation

>> https://www.biorxiv.org/content/10.1101/2023.05.16.540882v1

Minmers are a novel "non-forward" winnowing scheme with a (w, s)-window guarantee. It generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window.

Minmers eliminate Jaccard estimator bias and enable new methods to reduce mapping runtime compared to MashMap2. MashMap3 with minmers not only produced unbiased and more accurate predictions of the ANI than Minimap2 and MashMap2, but it did so in a fraction of the time.





□ SPUMONI 2: improved classification using a pangenome index of minimizer digests

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02958-1

SPUMONI 2 is an efficient tool for sequence classification of both short and long reads. It performs multi-class classification using a novel sampled document array.

By incorporating minimizers, SPUMONI 2’s index is 65 times smaller than minimap2’s for a mock community pangenome. SPUMONI 2 achieves a speed improvement of 3-fold compared to SPUMONI and 15-fold compared to minimap2.





□ Icarust, a real-time simulator for Oxford Nanopore adaptive sampling.

>> https://www.biorxiv.org/content/10.1101/2023.05.16.540986v1

Icarust, a tool enabling more accurate approximations of sequencing runs. Icarust recreates all the required endpoints of MinKNOW to perform adaptive sampling and writes output compatible with current base-callers and analysis pipelines.

Icarust is capable of serving nanopore signal to simulate a MinION or PromethION flow cell experiment from any reference genome using either R9 or R10 pore signal.






TÁR

2023-05-15 04:04:04 | 映画


□ 『TÁR』

Release: 2022
Directed/Written by Todd Field
Music by Hildur Guðnadóttir
Cinematography by Florian Hoffmeister

緊張と断裂の反復。苛烈なまでの愛と表現に囚われた人間の支配欲が、やがて周囲の反発から力学的均衡に至るまでの過程を必然の報復として描く。破滅であり再生の物語。音楽は感情に名をつけ、名は記憶となる。だが表現された記憶は誰のものか。ウカヤリ族の伝統歌が重要なキーとなる



音楽を主題にした映画だけれど、音楽書法ではなく映画文法に依って音楽の本質を捉えようとする作品である。レナード・バーンスタインの有名な講釈がターの原点となったように、表現とは、己の過去と未来の実像と絶えず対峙することで為され、それは正に『中断』が加筆された楽譜それ自体である



□ Guðnadóttir: Tár - III. Moderato
Deutsche Grammophonからのサントラ、アヴァンギャルドな表題曲を初め、劇中のリハーサルの一角を捉えた機会音楽(ターは「リハーサルにこそ発見がある」とした)、インスピレーションとなったウカヤリ族の歌と、ペルーの森林のフィールドレコーディングも収録されている



□ Cura Mente https://youtu.be/JBHph4tSr-I



Pitfalls.

2023-05-15 03:03:03 | Science News


The Pitfalls of generative AI can generally be replaced by the problem of determining where to intervene with evaluation procedures for noise and bias in procedures.

Regulatory measures for high-risk classification are operational evaluations and have not accurately estimated the technical hurdles at present. However, retroactively tracing from "facts" to "outcomes" becomes mechanically difficult.


生成系AIのPitfallは、概してプロシージャにおいて不可測なノイズやバイアスの評価手順をどこに介在させるかという問題に置き換えられる。ハイリスク分類の規制は運用上の評価であり、技術的なハードルを現状正確に見積もってはいない。但し『事実』から『結果』へ遡行するのは力学的に困難となる。