lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Chrysanthemum.

2023-04-24 04:44:44 | Science News

(Art by gen_ericai)




□ Genomic language model: Deep learning of genomic contexts predicts protein co-regulation and function

>> https://www.biorxiv.org/content/10.1101/2023.04.07.536042v1

A genomic language model (gLM) learns contextualized protein embeddings that capture the genomic context as well as the protein sequence itself, and appears to encode biologically meaningful and functionally relevant information. gLM is learning co-regulated functional modules.

gLM is based on the transformer architecture. gLM is trained with the masked language modeling objective, with the hypothesis that its ability to attend to different parts of a multi-gene sequence will result in the learning of gene functional semantics and regulatory syntax.





□ GPN: DNA language models are powerful zero-shot predictors of genome-wide variant effects

>> https://www.biorxiv.org/content/10.1101/2022.08.22.504706v2

GPN’s internal representation of DNA sequences can distinguish genomic regions like introns, untranslated regions, and coding sequences. The confidence of GPN’s predictions can help reveal regulatory grammar.

GPN can be employed to calculate a pathogenicity or functionality score for any SNP in the genome using the log-likelihood ratio between the alternate and reference allele. GPN can learn from joint nucleotide distributions across all similar contexts appearing in the genome.

GPN uses the Hugging Face library to one-hot encode the masked DNA sequence and process it thru 25 convolutional blocks. Each block contains a dilated layer, feed-forward layer, intermediate residual connections, and layer normalization. The embedding is fixed at 512 dimensions.





□ The architecture of information processing in biological systems

>> https://arxiv.org/abs/2301.12812

An archetypal model for sensing that starts from a thermodynamically consistent description. The combined effects of storage and negative feedback promote the emergence of a rich information dynamics shaped by adaptation and finite-time memory.

A chemical information reservoir for the system allows it to dynamically build up information on an external environment while reducing internal dissipation. Optimal sensing emerges from a dissipation-information trade-off, requires far-from-equilibrium in low-noise regimes.





□ DeepCORE: An interpretable multi-view deep neural network model to detect co-operative regulatory elements

>> https://www.biorxiv.org/content/10.1101/2023.04.19.536807v1

DeepCORE uses a multi-view architecture to integrate genetic and epigenetic profiles in a DNN. It captures short-range and long-range interactions between REs through BiLSTM.

The learnt attention is a vector of length equal to the number of output nodes from the CNN layer containing importance score of each genomic region.

DeepCORE then joins the two views by concatenating the decoder outputs from each view and giving it to a fully connected feedforward neural network (FNN) to predict continuous gene transcription levels.





□ An in-depth comparison of linear and non-linear joint embedding methods for bulk and single-cell multi-omics

>> https://www.biorxiv.org/content/10.1101/2023.04.10.535672v1

Non-linear methods developed in other fields generally outperformed the linear and simple non-linear ones at imputing missing modalities. CGVAE and ccVAE did better than PoE and MoE on both bulk and single-cell data, while they typically underperformed in the other tasks.

ccVAE uses a single encoder for the concatenation of both modalities, which might be beneficial for generation coherence, as the latent space is directly and concurrently influenced by matched samples from all modalities.

The architecture of CGVAE is identical to that of MoE and PoE with separate encoders per modality. MOFA+ has the advantage that it provides useful diagnostic messages about the input data as well as the learnt space.





□ Genotyping variants at population scale using DRAGEN gVCF Genotyper

>> https://www.illumina.com/science/genomics-research/articles/gVCF-Genotyper.html

DRAGEN gVCF Genotyper implements an iterative workflow to add new samples to an existing cohort. This workflow allows users to efficiently combine new batches of samples with existing batches without repeated processing.

DRAGEN gVCF Genotyper computes many variant metrics on the fly, among them allele counts. DRAGEN gVCF Genotyper relies on the gVCF input format, which contains both variant information, like a VCF, and a measure of confidence of a variant not existing at a given position.





□ SAVEMONEY: Barcode-free multiplex plasmid sequencing using Bayesian analysis and nanopore sequencing

>> https://www.biorxiv.org/content/10.1101/2023.04.12.536413v1

SAVEMONEY (Simple Algorithm for Very Efficient Multiplexing of Oxford Nanopore Experiments for You) guides researchers to mix multiple plasmids and subsequently computationally de-mixes the resultant sequences.

SAVEMONEY involves submitting samples with multiple different plasmids mix and deconvolving the obtained sequencing results while maintaining the quality of the analysis. SAVEMONEY leverages plasmid maps, which are in most cases already made prior to plasmid construction.





□ GraphPart: Homology partitioning for biological sequence analysis

>> https://www.biorxiv.org/content/10.1101/2023.04.14.536886v1

GraphPart, an algorithm for homology partitioning, where as many sequences as possible are kept in the dataset, but partitions are defined such that closely related sequences always end up in the same partition.

GraphPart operates on real-valued distance metrics. Sequence identities ranging from 0 to 1 are converted to distances as d(a,b) = 1-identity(a,b). The partitioning threshold undergoes the same conversion. GraphPart can accept any similarity metric and skip the alignment step.





□ RASP / FAAST: Assisting and Accelerating NMR Assignment with Restrainted Structure Prediction

>> https://www.biorxiv.org/content/10.1101/2023.04.14.536890v1

RASP (Restraints Assisted Structure Predictor) is an architecture derived from AlphaFold evoformer and structure module, and it accepts abstract or experimental restraints, sparse or dense, to generate structures.

FAAST(iterative Folding Assisted peak ASsignmenT) is an iterative NMR NOESY peak assignment pipeline. Using chemical shift and NOE peak lists as input, FAAST assigns NOE peaks iteratively and generates a structure ensemble.





□ Emergent autonomous scientific research capabilities of large language models

>> https://arxiv.org/abs/2304.05332

An Intelligent Agent system that combines multiple large language models for autonomous design/planning/execution. The Agent's scientific research capabilities with 3 distinct examples, with the most complex being the successful performance of catalyzed cross-coupling reactions.

The Agent calculates the required volumes of all reactants and writes the protocol. Subsequent GC-MS analysis of the reaction mixtures revealed the formation of the target products for both reactions. Agent corrects its own code based on the automatically generated outputs.






□ Generative Agents: Interactive Simulacra of Human Behavior

>> https://arxiv.org/abs/2304.03442

Generative agents wake up, cook breakfast, and head to work; artists paint, while authors write; they form opinions, notice each other, and initiate conversations; they remember and reflect on days past as they plan the next day.

An architecture that extends a large language model to store a complete record of the agent's experiences using natural language, synthesize those memories over time into higher-level reflections, and retrieve them dynamically to plan behavior.





□ Many bioinformatics programming tasks can be automated with ChatGPT

>> https://arxiv.org/abs/2303.13528


ChatGPT failed to solve 5 of the exercises within 10 attempts. This table summarizes characteristics of these exercises and provides a brief summary of complications that ChatGPT faced when attempting to solve them.





□ LOCC: a novel visualization and scoring of cutoffs for continuous variables

>> https://www.biorxiv.org/content/10.1101/2023.04.11.536461v1

Luo’s Optimization Categorization Curves (LOCC) helps visualize more information for better cutoff selection and understanding of the importance of the continuous variable against the measured outcome.

The LOCC score is made of three numeric components: a significance aspect, a range aspect, and an impact aspect. The higher the LOCC score, the more critical and predictive the expression is for prognosis.





□ Demultiplex2: robust sample demultiplexing for scRNA-seq

>> https://www.biorxiv.org/content/10.1101/2023.04.11.536275v1

deMULTIplex2, a mechanism-guided classification algorithm for multiplexed scRNA-seq data that successfully recovers many more cells across a spectrum of challenging datasets compared to existing methods.

deMULTIplex2 is built on a statistical model of tag read counts derived from the physical mechanism of tag cross-contamination. Using GLM and expectation-maximization, deMULTIplex2 probabilistically infers the sample identity of each cell and classifies singlets w/ high accuracy.





□ acorn: an R package for de novo variant analysis

>> https://www.biorxiv.org/content/10.1101/2023.04.11.536422v1

Acorn is an R package that works with de novo variants (DNVs) already called using a DNV caller. The toolkit is useful for extracting different types of DNVs and summarizing characteristics of the DNVs.

Acorn consists of several functions to analyze DNVs. readDNV reads in DNV data and turns it into an R object for use with other functions within acorn. Acorn fills a gap in genomic DNV analyses between the calling of DNVs and ultimate downstream statistical assessment.





□ VIPRS: Fast and accurate Bayesian polygenic risk modeling with variational inference

>> https://www.cell.com/ajhg/fulltext/S0002-9297(23)00093-9

VIPRS, a Bayesian summary statistics-based PRS method that utilizes variational inference techniques to approximate the posterior distribution for the effect sizes.

VIPRS is consistently competitive w/ the state-of-the-art in prediction accuracy while being more than twice as fast as popular MCMC-based approaches. This performance advantage is robust across a variety of genetic architectures, SNP heritabilities, and independent GWAS cohorts.





□ A gene-level test for directional selection on gene expression

>> https://academic.oup.com/genetics/advance-article/doi/10.1093/genetics/iyad060/7111744

Applying The QX test for polygenic selection to regulatory variants identified using Joint-tissue Imputation (JTI) models to test for population-specific selection on gene regulation in 26 human populations.

The gamma-corrected approach was uniformly more powerful than the permutation approach. Indeed, while the gamma-corrected test approaches a power of 1.0 under regimes with stronger selection, the effect-permuted version never reached that.





□ bootRanges: Flexible generation of null sets of genomic ranges for hypothesis testing

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad190/7115835

bootRanges provides fast functions for generation of block bootstrapped genomic ranges representing the null hypothesis in enrichment analysis. bootRanges offers greater flexibility for computing various test statistics leveraging other Bioconductor packages.

Shuffling/permutation schemes may result in overly narrow test statistic null distributions and over-estimation of statistical significance, while creating new range sets w/ a block bootstrap preserves local genomic correlation structure and generates reliable null distributions.










□ catELMo: Context-Aware Amino Acid Embedding Advances Analysis of TCR-Epitope Interactions

>> https://www.biorxiv.org/content/10.1101/2023.04.12.536635v1

catELMo, whose architecture is adapted from ELMo (Embeddings from Language Models), a bi-directional context-aware language model. catELMo consists of a charCNN layer and four bidirectional LSTM layers followed by a softmax activation.

catELMo is trained on more than four million TCR sequences collected from ImmunoSEQ in an unsupervised manner, by contextualizing amino acid inputs and predicting the next amino acid token.





□ Streamlining PacBio HiFi assembly and QC with the hifi2genome workflow

>> https://research.arcadiascience.com/pub/resource-hifi2genome

hifi2genome assembles PacBio HiFi reads from a single organism and produce quality control statistics for the resulting assembly. The product of this pipeline is an assembly, mapped reads, and interactive visualizations reported with MultiQC.

hifi2genome uses Flye to assemble PacBio HiFi reads into contigs, followed by parallel processing steps for generating QC statistics. These steps include assembly QC stats with QUAST, lineage-specific QC stats with BUSCO, and mapping stats using SAMtools and minimap2.





□ disperseNN: Dispersal inference from population genetic variation using a convolutional neural network

>> https://academic.oup.com/genetics/advance-article/doi/10.1093/genetics/iyad068/7117621

disperseNN uses forward in time spatial genetic simulationsto train a deep neural network to infer the mean, per-generation dispersal distance from a single population sample of single nucleotide polymorphism (SNP) genotypes, e.g., whole genome data or RADseq data.

disperseNN predicts σ from full-spatial test data after simulations w/ 100 generations. Using successive layers of data compression, through convolution / pooling, to coerce disperseNN to look at the genotypes at different scales and learn the extent of linkage disequilibrium.





□ LinRace: single cell lineage reconstruction using paired lineage barcode and gene expression data

>> https://www.biorxiv.org/content/10.1101/2023.04.12.536601v1

LinRace (Lineage Reconstruction w/ asymmetric cell division model), that integrates the lineage barcode and gene expression data using the asymmetric cell division model and infers cell lineage under a framework combining Neighbor Joining and maximum-likelihood heuristics.

LinRace outputs more accurate cell division trees than existing methods for lineage reconstruction. Moreover, LinRace can output the cell states (cell types) of ancestral cells, which is rarely performed with existing lineage reconstruction methods.





□ Automatic extraction of ranked SNP-phenotype associations from text using a BERT-LSTM-based method

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05236-w

Transforms do not embed positional information as they do in recurrent models; however, they still embody positional information in modeling sentence order. Early stopping is a regularization technique to prevent over fitting when learning something iteratively.

Although the used linguist features could be employed to implement a superior association extraction method outperforming the kernel-based counterparts, the used BERT-CNN-LSTM-based methods exhibited the best performance.





□ nRCFV: a new, dataset-size-independent metric to quantify compositional heterogeneity in nucleotide and amino acid datasets

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05270-8

nRCFV, a truly normalised Relative Compositional Frequency Variation value. This new metrics add a normalisation constant to each of the different RCFV values (total, character-specific, taxon-specific) to mitigate the effect of increasing taxa number and sequence length.





□ Wearable-ome meets epigenome: A novel approach to measuring biological age with wearable devices.

>> https://www.biorxiv.org/content/10.1101/2023.04.11.536462v1

Aging is a dynamic process and simply utilizing chronological age as a predictor of All-Cause Mortality and disease onset is insufficient. Instead, measuring the organismal state of function, biological age, may provide greater insight.





□ PhenoCellPy: A Python package for biological cell behavior modeling

>> https://www.biorxiv.org/content/10.1101/2023.04.12.535625v1

PhenoCellPy defines Python classes for the Cell Volume (which it subdivides between the cytoplasm and nucleus) and its evolution, the state of the cell and the behaviors the cell displays in each state (called the Phase), and the sequence of behaviors (called the Phenotype).

PhenoCellPy’s can extend existing modeling frameworks as an embedded model. It integrates with the frameworks by defining the cell states (phases), signaling when a state change occurs, if division occurs, and by receiving information from the framework.





□ scVAEDer: The Power of Two: integrating deep diffusion models and variational autoencoders for single-cell transcriptomics analysis

>> https://www.biorxiv.org/content/10.1101/2023.04.13.536789v1

scVAEDer, a scalable deep-learning model that combines the power of variational autoencoders and deep diffusion models to learn a meaningful representation which can capture both global semantics and local variations in the data.

scVAEDer combes the strengths of VAEs and Denoising Diffusion Models (DDMs). It incorporates both VAE and DDM priors to more precisely capture the distribution of latent encodings in the data. By using vector arithmetic in the DDM space scVAEDer outperforms SOTA methods.





□ DeepEdit: single-molecule detection and phasing of A-to-I RNA editing events using nanopore direct RNA sequencing

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02921-0

DeepEdit can identify A-to-I editing events on single nanopore reads and determine the phasing information on transcripts through nanopore direct RNA sequencing.

DeepEdit is a fully connected neural network model which takes advantage of the raw electrical signal features flanking the editing sites. A total of 40,823 I-type reads from FY-ADAR2 and randomly chosen 47,757 HFF1 reads were used as the positive and negative controls.





□ GBC: a parallel toolkit based on highly addressable byte-encoding blocks for extremely large-scale genotypes of species

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02906-z

Genotype Block Compressor (GBC) manages genotypes in Genotype Block (GTB). GTB is a unified data structure to store large-scale genotypes into many highly addressable byte-encoding compression blocks. Then, multiple advanced algorithms were developed for efficient compression.

The AMDO (approximate minimum discrepancy ordering) algorithm is applied on the variant level to sort the variants with similar genotype distributions for improving the compression ratio. The ZSTD algorithm is then adopted to compress the sorted data in each block.





□ Multivariate Genome-wide Association Analysis by Iterative Hard Thresholding

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad193/7126408

Multivariate IHT for analyzing multiple correlated traits. In simulation studies, multivariate IHT exhibits similar true positive rates, significantly lower false positive rates, and better overall speed than linear mixed models and canonical correlation analysis.

In IHT the most computationally intensive operations are the matrix-vector and matrix- matrix multiplications required in computing gradients. To accelerate these operations, SIMD (single instruction, multiple data) is employed for vectorization and tiling.





□ moslin: Mapping lineage-traced cells across time points

>> https://www.biorxiv.org/content/10.1101/2023.04.14.536867v1

moslin, a Fused Gromov-Wasserstein-based model to couple matching cellular profiles across time points. moslin leverages both intra-individual lineage relations and inter-individual gene expression similarity.

moslin uses lineage information at two or more time points and to include the effects of cellular growth and stochastic cell sampling. The algorithm combines gene expression with lineage information at all time points to reconstruct precise differentiation trajectories.





□ MolCode: An Equivariant Generative Framework for Molecular Graph-Structure Co-Design

>> https://www.biorxiv.org/content/10.1101/2023.04.13.536803v1

MolCode, a roto-translation equivariant generative framework for Molecular graph-structure Co-design. In MolCode, 3D geometric information empowers the molecular 2D graph generation, which in turn helps guide the prediction of molecular 3D structure.

MolCode not only consistently generates valid and diverse molecular graphs/structures with desirable properties, but also generate drug-like molecules with high affinity to target proteins, which demonstrates MolCode’s potential applications in material design and drug discovery.





□ AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data

>> https://www.biorxiv.org/content/10.1101/2023.04.17.537157v1

AlcoR addresses the challenge of automatically modeling and distinguishing LCRs. AlcoR enables the use of models with different memories, providing the ability to distinguish local from distant low-complexity patterns.

AlcoR is reference- and alignment-free, providing additional methodologies for testing, incl. a highly-flexible simulation method for generating biological sequences with different complexity levels, sequence masking, and a automatic computation of the LCR maps into ideogram.





□ De novo reconstruction of satellite repeat units from sequence data

>> https://arxiv.org/abs/2304.09729

Satellite Repeat Finder (SRF), a de novo assembler for reconstructing SatDNA repeat units and can identify most known HORs and SatDNA in well-studied species without prior knowledge on monomer sequences or repeat structures.

SRF uses a greedy algorithm to assemble SatDNA repeat units, but may miss the lower abundance and higher diversity unit when sharing long similar sequences. SRF may reconstruct repeat units similar in sequence. The similar repeat units may be mapped the same genomic locus.





Cloud nine.

2023-04-24 04:43:44 | Science News

(Art by gen_ericai)




□ Cellcano: supervised cell type identification for single cell ATAC-seq data

>> https://www.nature.com/articles/s41467-023-37439-3

Cellcano adopts a two-round prediction strategy. In the first round, Cellcano trains a Multi-layer Preceptron (MLP) model on reference gene scores with known cell labels. Then, Cellcano uses the trained MLP to predict cell types on target gene scores.

With the predicted probability matrix, entropies are calculated for each cell. Cells with relatively low entropies are selected as anchors to train a Knowledge Distillation (KD) model. The trained KD model is used to predict cell types in remaining non-anchors.





□ Building pangenome graphs

>> https://www.biorxiv.org/content/10.1101/2023.04.05.535718v1

PanGenome Graph Builder (PGGB), a reference-free pipeline to construct unbiased pangenome graphs. Its output presents a base-level representation of the pangenome, including variants of all scales from SNPs to SVs. The graph is unbiased - all genomes are treated equivalently.

PGGB uses an all-to-all alignment of the input sequences. PGGB makes no assumptions about phylogenetic relationships, orthology groups, or evolution- ary histories, allowing data to speak for itself without risk of implicit bias that may affect inference made on the graph.





□ scMSGL: Kernelized multiview signed graph learning for single-cell RNA sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05250-y

scMSGL is based on recently developed graph signal processing (GSP) based graph learning, where GRNs and gene expressions are modeled as signed graphs and graph signals.

scMSGL learns functional relationships between genes across multiple related classes of single cell gene expression datasets under the assumption that there exists a shared structure across classes.

scMSGL formulates a highly efficient optimization framework that extends the signed graph learning approach to high dimensional datasets with multiple classes. The kernelization trick embedded within the algorithm renders it capable of handling sparse and noisy features.





□ SLAT: Spatial-linked alignment tool for aligning heterogenous slices properly

>> https://www.biorxiv.org/content/10.1101/2023.04.07.535976v1

SLAT (Spatially-Linked Alignment Tool), a graph-based algorithm for efficient and effective alignment of spatial omics data. SLAT is the first algorithm capable of aligning heterogenous spatial data across distinct technologies and modalities.

By modeling the intercellular relationship as a spatial graph, SLAT adopts graph neural networks and adversarial matching for aligning spatial slices. SLAT calculates a similarity score for each aligned cell pair, making it possible to pinpoint spatially discrepant regions.





□ XClone: detection of allele-specific subclonal copy number variations from single-cell transcriptomic data

>> https://www.biorxiv.org/content/10.1101/2023.04.03.535352v1

XClone accounts for two modules: the B-allele frequency (BAF) of heterozygous variants and the sequencing read depth ratio (RDR) of individual genes, respectively detecting the variation states on allelic balance and absolute copy number, which are further combined to generate the final CNV states.

XClone is implemented a three-step of haplotyping, from individual SNPs to a gene by population-based phasing, from consecutive genes to a gene bin by an Expectation-Maximization algorithm, and from gene bins to a chromosome arm by a dynamic programming method.

XClone employs two orthogonal strategies to smooth the CNV-state assignments on BAF/RDR: horizontally w/ hidden Markov models along the genome and vertically w/ KNN cell-cell connectivity graph, which not only denoises the data but also preserves the single-cell resolution.





□ TESA: A Weighted Two-stage Sequence Alignment Framework to Identify DNA Motifs from ChIP-exo Data

>> https://www.biorxiv.org/content/10.1101/2023.04.06.535915v1

TESA constructs a graph, in which vertices represent sequence segments and an unweighted edge connecting two vertices indicates a highly ranked similarity between them among all pairs of sequence segments between two sequences.

TESA identifies dense subgraphs as the seed for graph clustering. Then, TESA performs graph clustering based on seeds, leading to vertex clusters, each of which corresponds to a preliminary motif.

TESA optimizes the lengths of preliminary motifs using a bookend model. We call the sequence segments corresponding to the assembled clusters as motif seeds. TESA refines the sequence segments for each motif, by scoring them using the motif profile built from the motif seeds.





□ scART: recognizing cell clusters and constructing trajectory from single-cell epigenomic data

>> https://www.biorxiv.org/content/10.1101/2023.04.08.536108v1

scART integrates the MST and DDRTree algorithms used in reserved graph embedding (RGE), a population graph-based pseudotime analysis algorithm in scRNA-seq analysis.

scART predicts the developmental trajectory based on the lower dimensional space that the cells lie upon and use a cell-cell graph to describe the structure among cells. scART identifies branch points that describe significant divergences in cellular states automatically.





□ CLAMP: Enhancing Activity Prediction Models in Drug Discovery with the Ability to Understand Human Language

>> https://arxiv.org/pdf/2303.03363.pdf

Scientific language models (SLMs) can utilize both natural language and chemical structure but are suboptimal activity predictors. Large language models have demonstrated great zero- and few-shot capabilities.

The SLMs Galactica and KV-PLM tokenize the SMILES representations of chemical structures and embed those chemical tokens in the same embedding space as language tokens.

CLAMP improves predictive performance on few-shot learning benchmarks and zero-shot problems in drug discovery. CLAMP uses separate encoders for chemical and natural language data and embeds them into a joint embedding space.





□ LRU: Resurrecting Recurrent Neural Networks for Long Sequences

>> https://arxiv.org/abs/2303.06349

Deep Linear Recurrent Unit (LRU) architecture is inspired by S4. The model is a stack of LRU blocks, with nonlinear projections in between, and also uses skip connections 90% and normalization methods like batch/layer normalization.

Normalizing the hidden activations on the forward pass is important when learning tasks w/ long-sequences. LRU shares similarities w/ modern deep state-space models, its design does not rely on discretization of a latent continous-time system or on structured transition matrices.





□ K-RET: Knowledgeable Biomedical Relation Extraction System

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad174/7108769

RET is a flexible biomedical Relation Extraction (RE) system, allowing for the use of any pre-trained BERT-based system (e.g., SciBERT and BioBERT) to inject knowledge in the form of knowledge bases from a single source, multiple sources, and multi-token entities.

Adding a Knowledge layer to these entities from associations made w/ their domain ontologies. The tokens are flattened into a sequence for token embedding. Embedding / Seeing layers are fed to the Mask-transformer, corresponding to a stack of multiple mask-self-attention blocks.





□ mixMVPLN: Finite Mixtures of Matrix Variate Poisson-Log Normal Distributions for Three-Way Count Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad167/7108770

mixMVPLN is an R package for performing model-based clustering of three-way count data using mixtures of matrix variate Poisson-log normal (mixMVPLN) distributions.

mixMVPLN consists of three different frameworks: A Markov chain Monte Carlo expectation-maximization algorithm (MCMC-EM), Variational Gaussian approximations (VGAs), and a hybrid approach that combines both the variational approximation-based approach and MCMC-EM-based approach.





□ ICOR: improving codon optimization with recurrent neural networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05246-8

ICOR adopts the Bidirectional Long-Short-Term Memory (BiLSTM) architecture because of its ability to preserve temporal information from both the past and future. In a gene, the BiLSTM would theoretically use surrounding synonymous codons to make a prediction.

The ICOR architecture consists of a 12-layer RNN. It serves as the “brain” for the codon optimization tool. By providing the amino acid sequence as an input, ICORnet can output a nucleotide codon sequence that would ideally match the codon biases of the host genome.





□ Reconstruction Set Test (RESET): a computationally efficient method for single sample gene set testing based on randomized reduced rank reconstruction error

>> https://www.biorxiv.org/content/10.1101/2023.04.03.535366v1

RESET quantifies gene set importance at both the sample-level and for the entire data based on the ability of genes in each set to reconstruct values for all measured genes.

RESET is realized using a computationally efficient randomized reduced rank reconstruction algorithm and can effectively detect patterns of differential abundance and differential correlation for both self-contained and competitive scenarios.





□ transXpress: a Snakemake pipeline for streamlined de novo transcriptome assembly and annotation

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05254-8

transXpress supports two popular assembly programs, Trinity and rnaSPAdes, and allows parallel execution on heterogeneous cluster computing hardware. The transXpress pipeline performs parallel execution of the underlying tools whenever possible.

transXpress splits the input datafiles (Trimmomatic / FASTA steps) into multiple partitions (batches) to speed up even single-threaded tasks by parallelization. The partial results files from such split tasks are then merged automatically back into a single output file.





□ bioseq2seq / LFNet: Improving deep models of protein-coding potential with a Fourier-transform architecture and machine translation task

>> https://www.biorxiv.org/content/10.1101/2023.04.03.535488v1

bioseq2seq can recover potentially translated micropeptides is a proof-of-concept for using machine predictions to explore the cryptic space of proteome. Local Filter Network (LFNet) is a computationally efficient network layer based on the short-time Fourier transform.

The LFNet architecture will be of broad utility in biological sequence modeling tasks, w/ frequency-domain multiplication enabling larger context convolutions than in common convolutional architectures and lower computational complexity of O(Nlog N) in comparison to transformers.





□ SpliceAI-10k calculator for the prediction of pseudoexonization, intron retention, and exon deletion

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad179/7109800

SAI-10k-calc was designed to predict specific types of splicing aberrations, namely: pseudoexonization, partial intron retention, partial exon deletion, (multi)exon skipping, and whole intron retention.

SAI-10k-calc can process SpliceAI scores resulting from SNVs at any exonic or intronic position, but not scores resulting from indels due to the complexity of distance interpretations for such variants.





□ Modular response analysis reformulated as a multilinear regression problem

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad166/7109803

This formulation brings a number of advantages over the classical approach by providing a natural way to model data variability across experimental replicates, or even multiple perturbations at certain or all the modules.

This work dramatically extended the domain of application of MRA to much larger networks of sizes up to 1,000. This is a 100-fold increase compared to MRA with standard linear algebra, which had difficulties going beyond 10-node networks in the experiments.





□ BLAZE: Identification of cell barcodes from long-read single-cell RNA-seq

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02907-y

BLAZE eliminates the requirement for matched short-read scRNA-seq, simplifying long-read scRNA-seq workflows. BLAZE performs well across different sample types, sequencing depths, and sequencing accuracies and outperforms other barcode identification tools such as Sockeye.

BLAZE seamlessly integrates with the existing FLT-seq—FLAMES pipeline which performs UMI calling, read assignment, and mapping to enable the identification and quantification of RNA isoforms and their expression profiles across individual cells and cell types.





□ Genomics to Notebook (g2nb): extending the electronic notebook to address the challenges of bioinformatics analysis

>> https://www.biorxiv.org/content/10.1101/2023.04.04.535621v1

The g2nb environment incorporates multiple bioinformatics software platforms within the notebook interface. A standard Jupyter notebook consists of a sequence of cells, each of which can contain text or executable code.

g2nb provides an interface within the notebook to tools that are hosted on a remote Galaxy or GenePattern server. g2nb presents a form-like interface similar to the web interface of the original platforms, requiring that an investigator provide only the input parameters and data.





□ THAPBI PICT - a fast, cautious, and accurate metabarcoding analysis pipeline

>> https://www.biorxiv.org/content/10.1101/2023.03.24.534090v1

The THAPBI PICT core workflow comprises data reduction to unique marker sequences, often called amplicon sequence variants (ASVs), discard of low abundance sequences to remove noise and artifacts, and classification using a curated reference database.





□ Smmit: Integrating multiple single-cell multi-omics samples

>> https://www.biorxiv.org/content/10.1101/2023.04.06.535857v1

Smmit, a computational pipeline that leverages existing integration methods to simultaneously integrate both samples and modalities and produces a unified representation of reduced dimensions.

Smmit builds upon existing integration methods of Harmony and Seurat. Smmit employs Harmony to integrate multiple samples within each data modality. Smmit applies Seurat’s Weighted Nearest Neighbor function to integrate multiple data modalities and produces a single UMAP space.





□ clusterMaker2: a major update to clusterMaker, a multi-algorithm clustering app for Cytoscape

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05225-z

clusterMaker2 provides new capabilities to use remote servers to execute algorithms asynchronously. clusterMaker2 performs a variety of analyses, incl. Leiden clustering to break the entire network into smaller clusters, hierarchical clustering and dimensionality reduction.





□ GenoVi: an open-source automated circular genome visualizer for bacteria and archaea

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010998

GenoVi automatically calculates the GC content and GC skew from a genome, and unless specified, assigns CDS to COG categories. GenoVi produces histograms, heatmaps and tables of COG categories and frequency, and a table with general information about each contig/replicon.

GenoVi, a Python command-line tool able to create custom circular genome representations for the analysis and visualization of microbial genomes and sequence elements.





□ streammd: fast low-memory duplicate marking using a Bloom filter

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btad181/7110893

streammd is implemented as a C++ program running in a single process. A Bloom filter is initialized with k = 10 hash functions and a bit array sized to meet user-specified memory and false-positive requirements. Input is QNAME-grouped SAM records.





□ Automatic block-wise genotype-phenotype association detection based on hidden Markov model

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05265-5

A Hidden Markov Model for the classification of influential sites. The states themselves are governed by a Markov process, with a starting state probability vector for the first site and a transition probability matrix when passing from one site to the next.

The algorithm accepts as input a matrix of genotypes and a vector of phenotypes, and alternates between updating the most probable state sequence and updating the model parameters, until finally halting and outputting its best estimate of the most probable state sequence.





□ scSPARKL: Apache Spark based parallel analytical framework for the downstream analysis of scRNA-seq data.

>> https://www.biorxiv.org/content/10.1101/2023.04.07.536003v1

scSPARKL leverages the power of Apache Spark to enable the efficient analysis of single-cell transcriptomic data. It incorporates six key operations: data reshaping, data preprocessing, cell/gene filtering, data normalization, dimensionality reduction, and clustering.

The dataframe is arranged according to the ranks of the rows. The retained rank matrix is used to reposition the obtained averages at their respective place. This makes it easier to compare the values of different distribution, while preserving the actual coherence of the matrix.





□ A modular metagenomics analysis system for integrated multi-step data exploration

>> https://www.biorxiv.org/content/10.1101/2023.04.09.536171v1

Each module is designed to complete a single analytic task (ex. de novo assembly), accepting a standardized input format (ex. CSV of paths to FastQ files) generated by antecedent modules, and generating a standardized output format(s) (ex. CSV of paths to assembled contigs).





□ Regression Transformer enables concurrent sequence regression and generation for molecular language modelling

>> https://www.nature.com/articles/s42256-023-00639-z

The Regression Transformer (RT), a method that abstracts regression as a conditional sequence modelling problem. This introduces a new direction for multitask language models, seamlessly bridging sequence regression and conditional sequence generation.

Despite solely relying on tokenization of numbers and cross-entropy loss, RT can successfully solve regression tasks. The same model can generate text sequences given continuous properties. They devises numerical encodings (NEs) to inform the model about the semantic proximity.





□ GALBA: Genome Annotation with Miniprot and AUGUSTUS

>> https://www.biorxiv.org/content/10.1101/2023.04.10.536199v1

GALBA is a fully automated pipeline that takes protein sequences of one or many species and a genome sequence as input, aligns the proteins to the genome with miniprot, trains AUGUSTUS, and then predicts genes with AUGUSTUS using the protein evidence.

GALBA uses miniprothint - an alignment scorer. miniprothint discards the least reliable evidence and separates the remaining evidence into high/low confidence. High-confidence evidence is used to select training gene candidates and is enforced during gene prediction w/ AUGUSTUS.





□ Comparison of transformations for single-cell RNA-seq data

>> https://www.nature.com/articles/s41592-023-01814-1

Variance-stabilizing transformations based on the delta method promise an easy fix for heteroskedasticity if the variance predominantly depends on the mean.

Considering the acosh transformation equation, the shifted logarithm equation with pseudo-count y0 = 1 or y0 = 1 / (4α) and the shifted logarithm with CPM.

The Pearson residuals-based transformation has attractive theoretical properties and, in this benchmarks, performed similarly well as the shifted logarithm transformation. It stabilizes the variance across all genes and is less sensitive to variations of the size factor.





□ Comprehensive benchmark and architectural analysis of deep learning models for nanopore sequencing basecalling

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02903-2

This toolbox can be used as benchmark for cross-comparison of existing and future basecallers.

Transformer layers have gained popularity in other fields due to increased performance and speed. However, the top ten models all use RNN (LSTM) layers in their encoders. A direct comparison shows that RNNs outperform Transformer layers in all the metrics.





□ uTR: Decomposing mosaic tandem repeats accurately from long reads

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad185/7114028

uTR estimates a mosaic TR pattern for an input DNA string, but the pattern and the string may have a number of mismatches because of variants in units.





□ FISHFactor: A Probabilistic Factor Model for Spatial Transcriptomics Data with Subcellular Resolution

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad183/7114027

FISHFactor is a non-negative, spatially informed factor analysis model with a Poisson point process likelihood to model single-molecule resolved data, as for example obtained from multiplexed fluorescence in-situ hybridization methods.

FISHFactor allows to integrate multiple cells by jointly inferring cell-specific factors and a weight matrix that is shared between cells. The model is implemented using the deep probabilistic programming language Pyro and the Gaussian process package GPyTorch.





□ methylR: a graphical interface for comprehensive DNA methylation array data analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad184/7114023

methylR, a complete pipeline for the analysis of both 450K and EPIC Illumina arrays which not only offers data visualization and normalization but also provide additional features such as the annotation of the genomic features resulting from the analysis.





□ The scverse project provides a computational ecosystem for single-cell omics data analysis The scverse project provides a computational ecosystem for single-cell omics data analysis

>> https://www.nature.com/articles/s41587-023-01733-8




FierceBiotech

Nothing good lasts forever. That sentiment held true for private biotech financing in 2022. After two record-setting years, fundraising finally fell, dipping 24% from the highs of 2021. Let’s take a deeper dive into 2022 VC trends.



Raft.

2023-04-24 04:42:44 | Science News
(Art by Beau Wright)



□ PAUSE: principled feature attribution for unsupervised gene expression analysis

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02901-4

PAUSE, a novel, fully-unsupervised attribution method and demonstrate how it can be used to identify important pathways in transcriptomic datasets when combined with biologically-constrained autoencoders.

PAUSE uses a pathway module VAE, which is a sparse variational autoencoder model with deep, non-linear encoders / decoders. pmVAE uses sparse masked weight matrices to separate the weights of the encoder and decoder neural networks into non-interacting modules for each pathway.





□ GRACES: Graph Convolutional Network-based Feature Selection for High-dimensional and Low-sample Size Data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad135/7135826

GRACES can select important features for HDLSS data. GRACES exploits latent relations between samples using various overfitting-reducing techniques to iteratively find a set of optimal features that give rise to the greatest decrease in the optimization loss.

GRACES outperforms HSIC Lasso and DNP (and other baseline methods) on both synthetic and real-world datasets. GRACES constructs a dynamic similarity graph based on the selected feature at each iteration; GRACES exploits advanced GCN (i.e., GraphSAGE) to refine sample embeddings.

GRACES is an iterative algorithm w/ 5 components: feature initialization / graph construction/NN/multiple dropouts/gradient computation. It involves considering weights along the dimensions corresponding to the selected features in the input weight matrix, w/o a bias vector.





□ Entropy predicts sensitivity of pseudo-random seeds

>> https://www.biorxiv.org/content/10.1101/2022.10.13.512198v2

Although the entropy curves are in general more spread out, the relative distances are relatively well preserved. The relative increase in entropy correlates well with the relative increase in sensitivity. And providing 3 new seed constructs, mixedstrobes/altstrobes/multistrobes.

Pseudo-random seed constructs have over k-mers reduces. This is because the high overlap of k-mers is removed with subsampling. Since the minimap2 implementation is centered around minimizers, it is possible that aligners customized for strobemers or other pseudo-random seeds.






□ Barren plateaus in quantum tensor network optimization

>> https://quantum-journal.org/papers/q-2023-04-13-974/

Analyzing the barren plateau phenomenon in the variational optimization of quantum circuits inspired by matrix product states (qMPS), tree tensor networks (qTTN), and the multiscale entanglement renormalization ansatz (qMERA).

The variance of the cost function gradient decreases exponentially with the distance of a Hamiltonian term from the canonical centre in the quantum tensor network. qMPS most gradient variances decrease exponentially and for qTTN as well as qMERA they decrease polynomially.

Focusing on k-local Hamiltonians, i.e. sums of observables which act on at most k qubits. One example of a 2-local Hamiltonian is the transverse-field quantum Ising chain. qMPS avoids the barren plateau problem for a Hamiltonian that is a sum of local terms acting on all qubits.





□ Bayes Hilbert Spaces for Posterior Approximation

>> https://arxiv.org/abs/2304.09053

Bayes Hilbert spaces are studied in functional data analysis in the context where observed functions are probability density functions and their application to computational Bayesian problems is in its infancy.

Exploring Bayes Hilbert spaces and their connection to Bayesian computation, in particular novel connections between Bayes Hilbert spaces, Bayesian coreset algorithms, and Bayesian coresets constructed using the Kullback-Leibler divergence and Bayes Hilbert spaces.





□ Categorical Structure in Theory of Arithmetic

>> https://arxiv.org/abs/2304.05477

A categorical analysis of the arithmetic theory 𝐼Σ1. It provides a categorical proof of the classical result that the provably total recursive functions in 𝐼Σ1 are exactly the primitive recursive functions. They construct the category PriM and show it is a pr-coherent category.

This strategy is to construct a coherent theory of arithmetic T, and prove that T presents the initial coherent category equipped with a parametrised natural number object. T is the Π2-fragment of 𝐼Σ1, and conclude they have the same class of provably total recursive functions.





□ Categories of hypermagmas, hypergroups, and related hyperstructures

>> https://arxiv.org/abs/2304.09273

Investigating the categories of hyperstructures that generalize hypergroups. By allowing hyperoperations w/ possibly empty products, one obtains categories with desirable features such as completeness and cocompleteness, free functors, regularity, and closed monoidal structures.

Unital, reversible hypermagmas -- mosaics -- form a worthwhile generalization of (canonical) hypergroups from the categorical perspective. Notably, mosaics contain pointed simple matroids as a subcategory, and projective geometries as a full subcategory.










□ COVET / ENVI: The covariance environment defines cellular niches for spatial inference

>> https://www.biorxiv.org/content/10.1101/2023.04.18.537375v1

COVET (the covariance environment), a representation that can capture the rich, continuous multivariate nature of cellular niches by capturing the gene-gene covariate structure across cells in the niche, which can reflect the cell-cell communication between them.

ENVI (Environmental variational inference), a conditional variational autoencoder that jointly embeds spatial and single-cell RNA-seq data into a latent space.

ENVI architecture includes a single encoder for both spatial and single-cell genomics data, and two decoder networks—one for the full transcriptome, and the second for the COVET matrix, providing spatial context.





□ LogBTF: Gene regulatory network inference using Boolean threshold network model from single-cell gene expression data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad256/7133738

LogBTF, a novel embedded Boolean threshold network method which effectively infers GRN by integrating regularized logistic regression and Boolean threshold function.

First, the continuous gene expression values are converted into Boolean values and the elastic net regression model is adopted to fit the binarized time series data.

Then, the estimated regression coefficients are applied to represent the unknown Boolean threshold function of the candidate Boolean threshold network as the dynamical equations.

To overcome the multi-collinearity and over-fitting problems, an effective approach is designed to optimize the network topology by adding a perturbation design matrix to the input data and thereafter setting sufficiently small elements of the output coefficient vector to zeros.





□ TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis

>> https://arxiv.org/abs/2210.02186

TimesNet extends the analysis of temporal variations into the 2D space by transforming the 1D time series into a set of 2D tensors. This transformation can embed the intra/inter period-variations into the 2D tensors, making the 2D-variations to be modeled by 2D kernels.

TimesNet uses TimesBlock as a task-general backbone for time series analysis. TimesBlock can discover the multi-periodicity adaptively and extract the complex temporal variations from transformed 2D tensors by a parameter-efficient inception block.





□ GearNet: Protein Representation Learning by Geometric Structure Pretraining

>> https://arxiv.org/abs/2203.06125

GearNet (GeomEtry-Aware Relational Graph Neural Network) a simple yet effective structure-based encoder, which encodes spatial information by adding different types of sequential or structural edges and then performs relational message passing on protein residue graphs.

GearNet uses a sparse edge message passing mechanism to enhance the protein structure encoder, which is the first attempt to incorporate edge-level message passing on GNNs for protein structure encoding.





□ Automatic Gradient Descent: Deep Learning without Hyperparameters

>> https://arxiv.org/abs/2304.05187

The theory extends mirror descent to non-convex composite objective functions: the idea is to transform a Bregman divergence to account for the non-linear structure of neural architecture.

Automatic gradient descent trains both fully-connected and convolutional networks. This framework is properly placed in the context of existing frameworks such as the majorise-minimise meta-algorithm, mirror descent and natural gradient descent.





□ HyperDB: A hyper-fast local vector database for use with LLM Agents. HyperDB separates relevant from irrelevant documents by the support of HW accelerated vector operations

>> https://github.com/jdagdelen/hyperDB




□ RedPajama

>> https://www.together.xyz/blog/redpajama

The RedPajama base dataset is a 1.2 trillion token fully-open dataset created by following the recipe described in the LLaMA paper.





□ RNA covariation at helix-level resolution for the identification of evolutionarily conserved RNA structure

>> https://www.biorxiv.org/content/10.1101/2023.04.14.536965v1

R-scape calculates covariation between all pairs of position in an alignment. However, RNA base pairs do not occur in isolation. The Watson-Crick base pairs stack together forming helices that constitute the scaffold that facilitates the formation of the non-WC base pairs

Helix-level aggregated covariation increases sensitivity in the detection of evolutionarily conserved RNA structure. To achieve this, a new measure has been introduced that aggregates the covariation significance and power calculated at the base-pair level resolution.





□ SPADE: Spatial Deconvolution for Domain Specific Cell-type Estimation

>> https://www.biorxiv.org/content/10.1101/2023.04.14.536924v1

SPADE (SPAtial DEconvolution) incorporates spatial patterns during cell type decomposition. SPADE utilizes a combination of scRNA-seq data, spatial location information, and histological information to estimate the proportion of cell types present at each spatial location.

The SPADE algorithm formulates the cell type deconvolution task as a constrained nonlinear optimization problem. It aims to minimize the relative error between true and estimated gene expression while adhering to non-negativity and sum-to-one constraints.





□ cogeqc: Assessing the quality of comparative genomics data and results

>> https://www.biorxiv.org/content/10.1101/2023.04.14.536860v1

cogeqc calculates a protein domain-aware orthogroup score that aims at maximizing the number of shared protein domains within the same orthogroup.

The assessment of synteny detection consists in representing anchor gene pairs as a synteny network and analyzing its graph properties, such as clustering coefficient, node count, and scale-free topology fit.





□ Building Block-Based Binding Predictions for DNA-Encoded Libraries

>> https://chemrxiv.org/engage/chemrxiv/article-details/6438943f08c86922ffeffe57

A method for analyzing DNA-encoded library (DEL) selection data at the building block-level, with the goal of gaining insights we can use to design better DELs for subsequent screening rounds.

A simple and interpretable method is developed to predict the behavior of new building blocks, their interactions with known building blocks, and the activity of full compounds.

They calculates all-by-all similarity matrices for building blocks at each position individually and then evaluated combinatorial effects at a later step in the analysis. Additionally, it mimics considerations involved in library design.





□ DESpace: spatially variable gene detection via differential expression testing of spatial clusters

>> https://www.biorxiv.org/content/10.1101/2023.04.17.537189v1

DESpace, a novel approach to discover spatially variable genes (SVGs). The framework inputs all types of SRT data, summarizes spatial information via spatial clusters, and identifies spatially variable genes by performing differential gene expression testing between clusters.

DESpace displays a higher true positive rate than competitors, controls for FP and FDR. DESpace leads to analogous results when inputting spatial clusters estimated from StLearn or BayesSpace, which indicates that DESpace is robust with respect to the spatial clusters provided.





□ Identification of genetic variants that impact gene co-expression relationships using large-scale single-cell data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02897-x

Conducting a co-eQTL meta-analysis across four scRNA-seq peripheral blood mononuclear cell datasets using a novel filtering strategy followed by a permutation-based multiple testing approach.

Part of the variable correlation could be explained by the sparsity of the single-cell data, as higher expressed gene pairs correlated better, but at least a few example cases showed the potential occurrence of Simpson’s paradox.





□ BREADR: An R Package for the Bayesian Estimation of Genetic Relatedness from Low-coverage Genotype Data

>> https://www.biorxiv.org/content/10.1101/2023.04.17.537144v1

BREADR (Biological RElatedness from Ancient DNA in R) leverages the so-called pairwise mismatch rate, calculated on optimally-thinned genome-wide pseudo-haploid sequence data, to estimate genetic relatedness up to the second degree, assuming an underlying binomial distribution.

BREADR also returns a posterior probability for each degree of relatedness, from identical twins/same individual, first-degree, second-degree or "unrelated" pairs, allowing researchers to quantify and report the uncertainty, even for particularly low-coverage data.





□ NucleosomeDB - a database of 3D nucleosome structures and their complexes with comparative analysis toolkit

>> https://www.biorxiv.org/content/10.1101/2023.04.17.537230v1

NucleosomeDB allows researchers to search, explore, and compare nucleosomes with each other, despite differences in composition and peculiarities of their representation.

By utilizing the information contained within the NucleosomeDB, researchers can gain valuable insights into how nucleosomes interact with DNA and other proteins, assess the implications of mutations and protein binding on nucleosome structure.





□ MARVEL: an integrated alternative splicing analysis platform for single-cell RNA sequencing data

>> https://academic.oup.com/nar/article/51/5/e29/6985826

MARVEL, a comprehensive R package for single-cell splicing analysis applicable to RNA-seq generated from the plate- and droplet-based methods. MARVEL enables systematic and integrated splicing and gene expression analysis of single cells to characterize the splicing landscape.

MARVEL uses a splice junction- based approach to compute PSI values. For MAST, MARVEL computes the number of genes detected per cell (gene detection rate) and includes this variable as a covariate in the zero- inflated regression model.





□ Transcriptome Complexity Disentangled: A Regulatory Elements Approach

>> https://www.biorxiv.org/content/10.1101/2023.04.17.537241v1

By using the prior knowledge of the critical roles of transcription factors and microRNAs in gene regulation, it can establish a low-dimensional representation of cell states and infer the entire transcriptome from a limited number of regulatory elements.

The value of a reduced cell state representation lies in its ability to capture the gene expression distribution not only under normal conditions but also under various perturbations, such as drugs, mutations, or gene knockouts.





□ Benchmarking causal reasoning algorithms for gene expression-based compound mechanism of action analysis

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05277-1

According to statistical analysis (negative binomial model), the combination of algorithm and network most significantly dictated the performance of causal reasoning algorithms, with the SigNet recovering the greatest number of direct targets.

CARNIVAL with the Omnipath network was able to recover the most informative pathways containing compound targets, based on the Reactome pathway hierarchy. CARNIVAL, SigNet and CausalR ScanR all outperformed baseline gene expression pathway enrichment results.





□ PyAGH: a python package to fast construct kinship matrices based on different levels of omic dataPyAGH: a python package to fast construct kinship matrices based on different levels of omic data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05280-6

PyAGH can calculate additive, dominant and epistatic kinship matrices based on genomic data within one population and different additive kinship matrices across multiple populations efficiently.

PyAGH supports construction of kinship matrices using pedigree, microbiome and transcriptome data. In addition, the output of PyAGH can be easily provided to downstream mainstream software, such as DMU, GCTA, GEMMA and BOLT-LMM.





□ StableLM: Stability AI Language Models

>> https://github.com/Stability-AI/StableLM

StableLM-Tuned-Alpha is the fine-tuned model with Stanford Alpaca's procedure, using a combination of five recent datasets for conversational agents: Stanford's Alpaca, Nomic-AI's gpt4all, RyokoAI's ShareGPT52K datasets, Databricks labs' Dolly, and Anthropic's HH.

StableLM-Alpha models are trained on the new dataset that build on The Pile, which contains 1.5 trillion tokens, roughly 3x the size of The Pile. These models will be trained on up to 1.5 trillion tokens. The context length for these models is 4096 tokens.





□ PyHMMER: A Python library binding to HMMER for efficient sequence analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad214/7131068

PyHMMER provides Python integration of the popular profile Hidden Markov Model software HMMER via Cython bindings. A new parallelization model greatly improves performance when running multithreaded searches, while producing the exact same results as HMMER.

PyHMMER increases flexibility of use, allowing creating queries directly from Python code, launching searches and obtaining results without I/O, or accessing previously unavailable statistics like uncorrected p-values.





□ Grambank reveals the importance of genealogical constraints on linguistic diversity and highlights the impact of language loss

>> https://www.science.org/doi/10.1126/sciadv.adg6175

Grambank - a systematic sample of the structural diversity of the world’s languages. With over 400,000 data points, Grambank covers 2467 languages, grammatical phenomena in 195 features, from word order to verbal tense, nominal plurals, and many other linguistic variables.

Grambank deploys a Bayesian regression model of unusualness. The spatial and phylogenetic effects are both variance covariance (VCV) matrices based on a Brownian motion approach. The spatial data are taken from Glottolog, and the phylogeny is the global language tree.

Grambank uses the Agglomerated Endangerment Scale (AES) and categorize languages as either nonthreatened or threatened. The rest of the analysis is different in that it uses BRMS rather than Bayesian inference for latent Gaussian models (INLA).





□ UniMax: Fairer and more Effective Language Sampling for Large-Scale Multilingual Pretraining

>> https://arxiv.org/abs/2304.09151

UNIMAX, a new sampling method that delivers more uniform coverage of head lan- guages while mitigating overfitting on tail languages by explicitly capping the number of repeats over each language’s corpus.

UNIMAX controls the extent of data repeats of any language, providing a direct solution to overfitting on low-resource languages, w/o imposing any reprioritization on higher-resource. UNIMAX performs well across several benchmarks and model scales, up to 13 billion parameters.





□ satmut_utils: a simulation and variant calling package for multiplexed assays of variant effect

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02922-z

The satmut_utils “call” workflow is an end-to-end variant caller for MAVEs that supports direct analysis of targeted sequencing data from both (a) amplicon and (b) rapid amplification of cDNA ends (RACE)-like library preparation methods.

The satmut_utils “sim” workflow takes a Variant Call Format (VCF) and alignment (BAM) file with paired reads as input and generates variants in the reads at specified frequencies. Outputs are a VCF of true positive (truth) variants and counts, along with edited reads (FASTQ).

The number of fragments to edit and the read positions to edit are determined for each variant based on specified frequencies in the input VCF. “sim” employs a heuristic to sample reads for editing at each target position while prohibiting variant conversion.





□ pycoMeth: a toolbox for differential methylation testing from Nanopore methylation calls

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02917-w

pycoMeth Meth_Seg, a Bayesian changepoint detection algorithm for multi-read-group segmentation of methylation profiles, designed for the de novo discovery of methylation patterns from multiple (haplotyped) ONT sequenced samples.

pycoMeth Meth_Seg takes into account an arbitrary number of read groups (e.g., biological samples, haplotypes, or individual molecules/reads) to detect a dynamic set of methylation patterns from which it then derives a single consensus segmentation.





□ DFHiC: A dilated full convolution model to enhance the resolution of Hi-C data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad211/7135829

DFHiC has no restrictions on the input size of Hi-C data. The limitation effect caused by cutting matrix is eliminated and the Hi-C matrix no longer needs to be divided into several parts to enhance the entire chromosome.

The dilated convolution is able to effectively explore the global patterns in the overall Hi-C matrix by taking advantage of the information of the Hi-C matrix in a way of the longer genomic distance.





□ PSGRN: A gene regulatory network inference model based on pseudo-siamese network

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05253-9

PSGRN (pseudo-Siamese GRN), a multilevel, multi-structure framework) for inferring large-scale GRNs from time-series expression datasets.

Based on the pseudo-Siamese network, Gated recurrent unit captures the time features of each TF and target matrix and learn the spatial features of the matrices after merging by applying the DenseNet framework. Finally, they applied a sigmoid function to evaluate interactions.







Morph.

2023-04-24 04:40:44 | Science News

(Artwork by ekaitza)




□ D-SPIN constructs gene regulatory network models from multiplexed scRNA-seq data revealing organizing principles of cellular perturbation response

>> https://www.biorxiv.org/content/10.1101/2023.04.19.537364v1

D-SPIN ((Dimension-reduced Single-cell Perturbation Integration Network), a mathematical modeling and network inference framework that constructs gene regulatory network models directly from single-cell perturbation-response data.

D-SPIN exploits a natural factoring within the mathematical structure of Markov random fields inference to separate the learning problem into two steps, construction of a unified GRN and inference of how each perturbation interacts w/ the gene programs within the unified network.






□ Dynamic Jacobian Ensemble: Emergent stability in complex network dynamics

>> https://www.nature.com/articles/s41567-023-02020-8

The dynamic Jacobian ensemble, which allows us to systematically investigate the fixed-point dynamics of a range of relevant network-based models. Within this ensemble, complex systems exhibit discrete stability classes. These range from asymptotically unstable to sensitive.

The asymptotic predictions capture the system’s global stability, but have no bearing on the dynamic stability of small motifs or sub-networks, which may be locally unstable.

Still in an asymptotically stable system, the global impact of such unstable motifs, vanishes in the limit of large N, and hence the system as a whole remains insensitive to these local discrepancies.





□ Reference-free and cost-effective automated cell type annotation with GPT-4 in single-cell RNA-seq analysis

>> https://www.biorxiv.org/content/10.1101/2023.04.16.537094v1

GPT-4 can automatically annotate cell types by utilizing marker gene information. GPT-4 annotations fully or partially match manual annotations for at least 75% of cell types, demonstrating GPT-4’s ability to generate cell type annotations comparable to those of human experts.

GPT-4 offers cost-efficiency and seamless integration into existing single-cell analysis pipelines, such as Seurat and Scanpy. For each cell type, reproducibility is defined as the proportion of instances in which GPT-4 generates the most prevalent cell type annotation.





□ Applications of transformer-based language models in bioinformatics: a survey

>> https://academic.oup.com/bioinformaticsadvances/article/3/1/vbad001/6984737

Transformer-based models have brushed up on SOTA performance with a large margin in most bioinformatics tasks. GeneBERT was pre-trained using large-scale genomic data in a multi-modal and self-supervised manner.

scBERT reused large-scale unlabeled scRNA-seq data to accurately capture the expression information of a single gene and the gene–gene interactions. The accuracy of scBERT in the prediction of novel and known cell types increased by 0.155 and 0.158, respectively.





□ Categories enriched over symmetric closed multicategories

>> https://arxiv.org/abs/2304.11227

Constructing a machine which takes as input a locally small symmetric closed complete multicategory V. And its output is again a locally small symmetric closed complete multicategory V-Cat, the multicategory of small V-categories and multi-entry V-functors.

A complete multicategory V is a multicategory which has all small products and all equalizers. Morphisms are short multilinear maps. The internal hom object is a vector space of multilinear maps. The symmetric multicategory has products and kernels / equalizers.





□ RMV-VAE: Representation Learning to Effectively Integrate and Interpret Omics Data

>> https://www.biorxiv.org/content/10.1101/2023.04.23.537975v1

RMV-VAE (Regularised Multi-View Variational Autoencoder) is composed of two Variational Autoencoders that take datasets as input and generate a regularised low dimensional representation of the data.

RMV-VAE uses a reconstruction loss between the model’s input X and the output Xˆ and a KL divergence between the encoded data and a Normal distribution, the model is forced to learn the "real" signal present in the data, thus prioritising signal to noise.

RMV-VAE formulates an ah-hoc regularisation of the latent space to obtain embeddings where patients with similar expression of fundamental genes are found close together.





□ Aligning distant sequences to graphs using long seed sketches

>> https://genome.cshlp.org/content/early/2023/04/18/gr.277659.123

Using long inexact seeds based on Tensor Sketching, to be able to efficiently retrieve similar sketch vectors, the sketches of nodes are stored in a Hierarchical Navigable Small Worlds.

The method scales to graphs with 1 billion nodes, with time and memory requirements for preprocessing growing linearly with graph size and query time growing quasi-logarithmically with query length.





□ CCNNs: Topological Deep Learning: Going Beyond Graph Data:

>> https://www.researchgate.net/publication/370134352_Topological_Deep_Learning_Going_Beyond_Graph_Data

Combinatorial complexes, a novel type of topological domain. Combinatorial complexes can be seen as generalizations of graphs that maintain certain desirable properties. Similar to hypergraphs, combinatorial complexes impose no constraints on the set of relations.

Combinatorial complexes permit the construction of hierarchical higher-order relations, analogous to those found in simplicial / cell complexes. A general class of message-passing Combinatorial Complex Neural Networks is developed for focusing primarily on attention-based CCNNs.

Combinatorial complexes generalize and combine useful traits of both hypergraphs and cell complexes, which have emerged as two promising abstractions that facilitate the generalization of graph neural networks to topological spaces.





□ Cameron R. Wolfe RT

>> https://twitter.com/cwolferesearch/status/1649476511248818182

Nearly all recently-proposed large language models (LLMs) are based upon the decoder-only transformer architecture. But, is this always the best architecture to use? It depends… 🧵 [1/8]





□ HOTSPOT: Hierarchical hOst predicTion for aSsembled Plasmid cOntigs with Transformer

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad283/7136643

HOTSPOT is based on a phylogenetic tree of plasmids' hosts from phylum to species. By incorporating the Transformer model, in each node’s taxon classifier, the top-down tree search achieves an accurate host taxonomy prediction for the input plasmid contigs.

HOTSPOT conducts hierarchical search from the root node to lower ranks to predict taxon. The tree search allows early stop when the prediction uncertainty based on Monte Carlo Dropout is above a given cutoff, we can improve prediction accuracy with minimal loss in resolution.

The Transformer block applied in HOTSPOT is the Transformer encoder, which can convert the input sentence into a latent vector w/ a fixed length. The feature vectors output by the 2 Transformer blocks and the Inc one-hot vector will be concatenated for the taxon classification.





□ Scaling Transformer to 1M tokens and beyond with Recurrent Memory Transformer

>> https://arxiv.org/abs/2304.11062

By employing a recurrent approach and memory, the quadratic complexity can be reduced to linear. Furthermore, models trained on sufficiently large inputs can extrapolate their abilities to texts orders of magnitude longer.

While larger models (OPT-30B, OPT-175B) tend to exhibit near-linear scaling on relatively short sequences up to 32,000, they reach quadratic scaling on longer sequences. Smaller models (OPT-125M, OPT-1.3B) demonstrate quadratic scaling even on shorter sequences.

The RMT’s capacity to successfully extrapolate to tasks of varying lengths, including those exceeding 1 million tokens with linear scaling of computations required. On sequences w/ 2,048,000 tokens, RMT can run OPT-175B w/ ×29 fewer FLOPs and w/ ×295 fewer FLOPs than OPT-135M.





□ Consequences and opportunities arising due to sparser single-cell RNA-seq datasets

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02933-w

As zeros become more abundant, a binarized expression might be as informative as counts. Using ~ 1.5 million cells, a strong point-biserial correlation b/n the normalized expression counts is observed, and its respective binarized variant, although differences b/n datasets exist.

This strong correlation implies that the binarized signal already captures most of the signal present in the normalized count data. This strong correlation is primarily explained by the detection rate and the variance of the non-zero counts of a cell.





□ CellHeap: A scRNA-seq workflow for large-scale bioinformatics data analysis

>> https://www.biorxiv.org/content/10.1101/2023.04.19.537508v1

CellHeap, a flexible, portable, and robust platform for analyzing large scRNA-seq datasets, with quality control throughout the execution steps, and deployable on platforms that support large-scale data, such as supercomputers or clouds.

One CellHeap’s phase can include many computational tools and couple them such that inputs and outputs consume/generate data in a flow that meets requirements for subsequent phases. It employs quality control to ensure correct results and relies on high-performance parallelizing.





□ TopoDoE: A Design of Experiment strategy for selection and refinement in ensembles of executable Gene Regulatory Networks

>> https://www.biorxiv.org/content/10.1101/2023.04.21.537619v1

TopoDoE, an iterative method for the in silico identification of the most informative perturbation – that is eliminating as many incorrect candidate GRNs as possible from the data gathered in one experiment.

GRNs generated by WASABI were defined by a mechanistic model of gene expression based on coupled Piecewise-Deterministic Markov Processes (PDMPs) governing how the mRNA and Protein quantities change over time.

When applied as a follow-up step to WASABI’s GRN inference algorithm, the presented strategy of network selection allowed to first identify and remove incorrect GRN topologies and then to recover a new GRN better fitting experimental data than any other candidate.





□ matchRanges: Generating null hypothesis genomic ranges via covariate-matched sampling

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad197/7135828

matchRanges computes for each range a propensity score, the probability of assigning a range to focal or background groups, given a chosen set of covariates. It provides 3 methods incl. nearest-neighbor matching, rejection sampling, and stratified sampling for null set selection.

matchRanges provides utilities for accessing matched data, assessing matching quality, and visualizing covariate distributions. The code has been optimized to accommodate genome scale data, matchRanges can efficiently process sets of millions of loci in seconds on a single core.





□ Fibertools: fast and accurate DNA-m6A calling using single-molecule long-read sequencing

>> https://www.biorxiv.org/content/10.1101/2023.04.20.537673v1

Fibertools enables highly accurate (over 90% precision and recall) m6A identification along multi-kilobase DNA molecules with a ~1,000-fold improvement in speed and the capacity to generalize to new sequencing chemistries.

fibertools also substantially reduces the amount of false-negative methylation calls, an improvement primarily driven by enabling m6A calling along multi-kilobase reads with fewer subread passes - a limitation of prior m6A calling tools.





□ scTenifoldXct: A semi-supervised method for predicting cell-cell interactions and mapping cellular communication graphs

>> https://www.cell.com/cell-systems/fulltext/S2405-4712(23)00030-3

scTenifoldXct detects ligand-receptor (LR)-mediated cell-cell interactions and mapping cellular communication graphs. Neural networks are employed to minimize the distance between corresponding genes while preserving the structure of gene regression networks.

scTenifoldXct is based on manifold alignment, using LR pairs as inter-data correspondences to embed ligand and receptor genes expressed in interacting cells into a unified latent space.





□ AsymmeTrix: Asymmetric Vector Embeddings for Directional Similarity Search

>> https://yoheinakajima.com/asymmetrix-asymmetric-vector-embeddings-for-directional-similarity-search/

By introducing a weighting factor based on a domain-specific asymmetric weighting function, AsymmeTrix is able to capture the inherent directionality of relationships between objects in various application domains.

Asymmetric kernel functions modify standard kernel functions or design custom functions to model asymmetric relationships. AsymmeTrix leverages graph-based structures to capture complex relationships and continuous vector spaces to represent objects in a continuous space.





□ A graph neural network-based interpretable framework reveals a novel DNA fragility–associated chromatin structural unit

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02916-x

A framework that integrates graph neural network (GNN) to unravel the relationship between 3D chromatin structure and DSBs using an advanced interpretable technique GNNExplainer.

FaCIN (DNA fragility–associated chromatin interaction network) is a bottleneck-like structure, and it helps to reveal a universal form of how the fragility of a piece of DNA might be affected by the whole genome through chromatin interactions.





□ Read2Tree: Inference of phylogenetic trees directly from raw sequencing reads

>> https://www.nature.com/articles/s41587-023-01753-4

Read2Tree directly processes raw sequencing reads into groups of corresponding genes and bypasses traditional steps in phylogeny inference, such as genome assembly, annotation and all-versus-all sequence comparisons, while retaining accuracy.

Read2Tree is 10–100 times faster than assembly-based approaches and in most cases more accurate—the exception being when sequencing coverage is high and reference species very distant.

Read2Tree is able to also provide accurate trees and species comparisons using only low-coverage (0.1×) datasets as well as RNA versus genomic sequencing and operates on long or short reads.





□ Transformation of alignment files improves performance of variant callers for long-read RNA sequencing data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02923-y

Transformations of the BAM alignment encodings are critical. This is because while variant calling from aligned DNA sequences data involves analysis of contiguously aligned reads, variant calling from lrRNA-seq alignments must handle reads with gaps representing intronic regions.

flagCorrection ensures all fragments retain the original flag. It enables an increase in recall of DeepVariant and the precision of Clair3’s pileup model (indel calling); Clair3-mix and SNCR + flagCorrection + DeepVariant are among the best-performing pipelines to call indels.





□ TimeAttackGenComp: Fast all versus all genotype comparison using DNA/RNA sequencing data: method and workflow

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05288-y

A Perl tool to rapidly compare genotypes from thousands of samples in an all vs. all manner. All vs. all is an O(n2) problem, and scalability is an issue for larger projects.

An end-to-end Workflow Descriptor Language (WDL)/Cromwell workflow taking FASTQ, BAM, or VCF files as input was developed for reproducibility and ease of use. Memory usage could be further improved with bit-packing, bit-vectors, and the use of lower-level languages.





□ squigualiser: A simple tool to Visualise nanopore raw signal-base alignment

>> https://github.com/hiruna72/squigualiser





□ BEERS2: RNA-Seq simulation through high fidelity in silico modeling

>> https://www.biorxiv.org/content/10.1101/2023.04.21.537847v1

BEERS2 takes input transcripts from either customizable input or from CAMPAREE simulated RNA samples. It produces realistic reads of these transcripts as FASTQ, SAM, or BAM formats with the SAM or BAM formats containing the true alignment to the reference genome.

BEERS2 combines a flexible and highly configurable design with detailed simulation of the entire library preparation and sequencing pipeline and is designed to incl. the effects of polyA selection and RiboZero for ribosomal depletion and hexamer priming sequence biases.





□ pyInfinityFlow: Optimized imputation and analysis of high-dimensional Flow Cytometry data for millions of cells

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad287/7142555

pyInfinityFlow is a Python package that enables imputation of hundreds of features from Flow Cytometry using XGBoost regression. It is an adaptation of the original implementation in R2 with the goal of optimizing the workflow for large datasets.

The final Infinity Flow object can be stored as sparse data objects (h5ad) or as a data frame stored in a binary feather file format, enabling direct manipulation with Scanpy, or other tools, to identify broad and rare cell populations with Leiden clustering.





□ hipFG: High-throughput harmonization and integration pipeline for functional genomics data

>> https://www.biorxiv.org/content/10.1101/2023.04.21.537695v1

hipFG, an automatically customized pipeline for efficient and scalable normalization of heterogenous FG data collections into standardized, indexed, rapidly searchable analysis-ready datasets while accounting for FG datatypes.

hipFG includes datatype-specific pipelines to process diverse types of FG data. These FG datatypes are categorized into three groups: annotated genomic intervals, quantitative trait loci (QTLs), and chromatin interactions.





□ Capture-recapture for -omics data meta-analysis

>> https://www.biorxiv.org/content/10.1101/2023.04.24.537481v1

The capture-recapture framework (C-R) statistically formalises the idea of inspecting list overlaps. The C-R model is a consistent estimator for the causal gene number in simple situations.

The estimate from C-R can be biased upwards, if the LD structure is ignored, because the causal signal spreads between linked SNPs which can then tag several different genes.





□ scDEED: a statistical method for detecting dubious 2D single-cell embeddings

>> https://www.biorxiv.org/content/10.1101/2023.04.21.537839v1

scDEED (single-cell dubious embedding detector) assigns every cell a “reliability score,” whose large value indicates that the cell’s immediate to mid-range neighbors are well preserved after the embedding.

scDEED offers users the flexibility to optimize hyperparameters in an intuitive and graphical way (users can see which cell embeddings are dubious under each hyperparameter setting), without modifying the embedding method’s algorithm.

scDEED’s definition of dubious cell embeddings distinguishes scDEED from DynamicViz, a method that optimizes hyperparameters by minimizing the variance of cell embeddings’ Euclidean distances across multiple bootstraps.





□ scGBM: Model-based dimensionality reduction for single-cell RNA-seq using generalized bilinear models

>> https://www.biorxiv.org/content/10.1101/2023.04.21.537881v1

scGBM, a novel method for model-based dimensionality reduction of single-cell RNA-seq data. scGBM employs a scalable algorithm to fit a Poisson bilinear model to datasets with millions of cells and quantifies the uncertainty in each cell’s latent position.

scGBM uses iteratively reweighted singular value decomposition (IRSVD) algorithm. IRSVD is asymptotically faster than Fisher scoring, and leverages special properties of Poisson GLMs to obtain vectorized updates for the intercepts.





□ SurVIndel2: improving local CNVs calling from next-generation sequencing using novel hidden information

>> https://www.biorxiv.org/content/10.1101/2023.04.23.538018v1

SurVIndel2 significantly reduces the number of called false positives, while retaining or even improving the sensitivity of the original SurVIndel, and generates precise breakpoints for most of the called CNVs.

SurVIndel2 detects candidate CNVs using split reads, discordant pairs, and a new type of evidence called hidden split reads. Hidden split reads can determine the existence and precise breakpoints of CNVs in repetitive regions.





□ AtlasXplore: a web platform for visualizing and sharing spatial epigenome data

>> https://www.biorxiv.org/content/10.1101/2023.04.23.537969v1.full.pdf

AtlasXplore integrates multiple layers of spatial epigenome. With the integration with Celery workers, there is unlimited potential for AtlasXplore to incorporate other software and functions for interactive exploration of high-dimensional data sets.

AtlasXplore protects private data with Amazon Cognito authentication, and makes published and exemplar data accessible to both guests and registered users. Users can search via PMID/author or filter by research group, type, species, and tissue.





□ cellDancer: A relay velocity model infers cell-dependent RNA velocity

>> https://www.nature.com/articles/s41587-023-01728-5

cellDancer, a scalable deep neural network that locally infers velocity for each cell from its neighbors and then relays a series of local velocities to provide single-cell resolution inference of velocity kinetics. The cellDancer algorithm separately trains a DNN for each gene.

cellDancer assesses the spliced and unspliced mRNA velocities of each cell in a DNN to calculate the cell-specific transcription, splicing and degradation rates (α, β, γ) and to predict the future spliced and unspliced mRNA by the outputted α, β and γ using an RNA velocity model.





□ TiDE: a time-series dense encoder for long-term time-series forecasting that enjoys the simplicity and speed of linear models while also being able to handle covariates and non-linear dependencies.

>> https://ai.googleblog.com/2023/04/recent-advances-in-deep-long-horizon.html

TiDE (Time-series Dense Encoder), a Multi-layer Perceptron (MLP) based encoder-decoder model for long-term time-series forecasting that enjoys the simplicity and speed of linear models while also being able to handle covariates and non-linear dependencies.

TiDE is more than 10x faster in training compared to transformer-based baselines while being more accurate on benchmarks. Similar gains can be observed in inference as it only scales linearly with the length of the context and the prediction horizon.