lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Year of the Dragon.

2024-01-17 23:22:33 | Science News





□ Scalable network reconstruction in subquadratic time

>> https://arxiv.org/abs/2401.01404

A general algorithm applicable to a broad range of reconstruction problems that achieves its result in subquadratic time, with a data-dependent complexity loosely upper bounded by O(N3/2 log N), but with a more typical log-linear complexity of O(N log2 N).

This algorithm relies on a stochastic second neighbor search that produces the best edge candidates with high probability, thus bypassing an exhaustive quadratic search.

This algorithm achieves a performance that is many orders of magnitude faster than the quadratic baseline, allows for easy parallelization. The strategy is applicable for algorithms that can be used w/ non-convex objectives, e.g. stochastic gradient descent / simulated annealing.





□ OmniNA: A foundation model for nucleotide sequences

>> https://www.biorxiv.org/content/10.1101/2024.01.14.575543v1

OmniNA represents an endeavor in leveraging foundation models for comprehensive nucleotide learning across diverse species and genome contexts. OmniNA can be fine-tuned to align multiple nucleotide learning tasks with natural language paradigms.

OmniNA employs a transformer-based decoder, undergoes pre-training through an auto-regressive approach. OmniNA was pre-trained on a scale of 91.7 million nucleotide sequences encompassing 1076.2 billion bases range across a global species and biological context.





□ STIGMA: Single-cell tissue-specific gene prioritization using machine learning

>> https://www.sciencedirect.com/science/article/pii/S0002929723004433

STIGMA predicts the disease-causing probability of genes based on their expression profiles across cell types, while considering the temporal dynamics during the embryogenesis of a healthy (wild-type) organism, as well as several intrinsic gene properties.

In STIGMA, supervised machine learning is applied to the single-cell gene expression data as well as intrinsic gene properties on positive and negative classes.

The STIGMA score that each gene receives is based on the cell type-specific temporal dynamics in gene expression and, to a smaller extent, is based on the gene-intrinsic metrics, including the population level constraint metrics.





□ RfamGen: Deep generative design of RNA family sequences

>> https://www.nature.com/articles/s41592-023-02148-8

RfamGen (RNA family sequence generator), a deep generative model that designs RNA family sequences in a data-efficient manner by explicitly incorporating alignment and consensus secondary structure information.

RfamGen can generate novel and functional RNA family sequences by sampling points from a semantically rich and continuous representation. RfamGen successfully generates artificial sequences with higher activity than natural sequences.





□ SYNTERUPTOR: mining genomic islands for non-classical specialised metabolite gene clusters

>> https://www.biorxiv.org/content/10.1101/2024.01.03.573040v1

SYNTERUPTR identifies genomic islands in a given genome by comparing its genomic sequence with those of closely related species. SYNTERUPTOR was designed and is focused on identifying SMBGC-containing genomic islands.

SYNTERUPTOR pipeline requires a dataset consisting of genome files selected by the user from species that are related enough to possess synteny blocks.

SYNTERUPTOR proceeds by performing pairwise comparisons between all Coding DNA Sequences (CDSs) amino acid sequences to identify orthologs. Subsequently, it constructs synteny blocks and detects any instances of synteny breaks.





□ ALG-DDI: A multi-scale feature fusion model based on biological knowledge graph and transformer-encoder for drug-drug interaction prediction

>> https://www.biorxiv.org/content/10.1101/2024.01.12.575305v1

ALG-DDI can comprehensively incorporate attribute information, local biological information, and global semantic information. ALG-DDI first employs the Attribute Masking method to obtain the embedding vector of the molecular graph.

ALG-DDI leverages heterogeneous graphs to capture the local biological information between drugs and several highly related biological entities. The global semantic information is also learned from the medicine-oriented large knowledge graphs.

ALG-DDI employs a transformer encoder to fuse the multi-scale drug representations and feed the resulting drug pair vector into a fully connected neural network for prediction.





□ FAVA: High-quality functional association networks inferred from scRNA-seq and proteomics data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae010/7513163

FAVA (Functional Associations using Variational Autoencoders) compresses high-dimensional data into a low-dimensional space. FAVA infers networks from high-dimensional omics data with much higher accuracy, across a diverse collection of real as well as simulated datasets.

In latent space, FAVA calculates the Pearson correlation coefficient (PCC) each pair of proteins, resulting in a functional association network. FAVA can process large datasets w/ over 0.5 million conditions and has predicted 4,210 interactions b/n 1,039 understudied proteins.





□ FFS: Fractal feature selection model for enhancing high-dimensional biological problems

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05619-z

In fractals, a central tenet posits that patterns recur at differing scales. This principle suggests that when one examines a minuscule segment of a fractal and juxtaposes it with a more significant portion of the same fractal, the patterns observed will bear striking resemblance.

FFS (Fractal Feature Selection) is proof of harmonic convergence of a low-complexity system with remarkable performance. FFS partitions features into blocks, measures similarity using the Root Mean Square Error (RMSE), and determines feature importance based on low RMSE values.

By conceptualizing these attributes as blocks, where each block corresponds to a particular data category, the proposed model finds that blocks with common similarities are often associated with specific data categories.





□ CytoCommunity: Unsupervised and supervised discovery of tissue cellular neighborhoods from cell phenotypes

>> https://www.nature.com/articles/s41592-023-02124-2

CytoCommunity learns a mapping directly from the cell phenotype space to the TCN space using a graph neural network model without intermediate clustering of cell embeddings.

By leveraging graph pooling, CytoCommunity enables de novo identification of condition-specific and predictive TCNs under the supervision of sample labels.

CytoCommunity formulates TCN identification as a community detection problem on graphs and use a graph minimum cut (MinCut)-based GNN model to identify TCNs.

CytoCommunity directly uses cell phenotypes as features to learn TCN partitions and thus facilitates the interpretation of TCN functions.

CytoCommunity can also identify condition-specific TCNs from a cohort of labeled tissue samples by leveraging differentiable graph pooling and sample labels, which is an effective strategy to address the difficulty of graph alignment.





□ scSNV-seq: high-throughput phenotyping of single nucleotide variants by coupled single-cell genotyping and transcriptomics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03169-y

scSNV-seq uses transcribed genetic barcodes to couple targeted single-cell genotyping with transcriptomics to identify the edited genotype and transcriptome of each individual cell rather than predicting genotype from gRNA identity.

scSNV-seq allows us to identify benign variants or variants with an intermediate phenotype which would otherwise not be possible.

The methodology is applicable to any other methods for introducing variation such as HDR, prime editing, or saturation genome editing since it does not rely on gRNA identity to infer genotype.





□ Fragmentstein: Facilitating data reuse for cell-free DNA fragment analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae017/7550024

Fragmentstein, a command-line tool for converting non-sensitive cDNA-fragmentation data into alignment mapping (BAM) files. Fragmentstein complements fragment coordinates with sequence information from a reference genome to reconstruct BAM files.

Fragmentstein creates alignment files for each sample using only non-sensitive information. The original alignment files and the alignment files generated by Fragmentstein were subjected to fragment length, copy number and nucleosome occupancy analysis.





□ DLemb / BioKG2Vec: PREDICTING GENE DISEASE ASSOCIATIONS WITH KNOWLEDGE GRAPH EMBEDDINGS FOR DISEASES WITH CURTAILED INFORMATION

>> https://www.biorxiv.org/content/10.1101/2024.01.11.575314v1

BioKg2Vec relies on a biased random-walk approach in which the user can prioritize specific connections by assigning a weight to edges. In the KG defined in this work we used 4 different node-types: drug, protein, function and disease.

DLemb is a shallow neural network. The input layer takes as input KG entities as numbers and outputs them to the embedding layer. Subsequently, embeddings are normalized, and a dot product is calculated between them resulting in the output layer.

DLemb is trained by providing a batch of correct links and wrong links in the KG to provide with positive and negative examples in what can be conceived as a link-prediction task. Embeddings are then optimized for every epoch by minimizing RMSE and using Adam optimization.





□ POP-GWAS: Valid inference for machine learning-assisted GWAS

>> https://www.medrxiv.org/content/10.1101/2024.01.03.24300779v1

POP-GWAS (Post-prediction GWAS) provides unbiased estimates and well-calibrated type-l error, is universally more powerful than conventional GWAS on the observed phenotype, and has minimal assumption on the variables used for imputation and choice of prediction algorithm.

POP-GWAS imputes the phenotype in both labeled and unlabeled samples, and performs three GWAS: GWAS of the observed and imputed phenotype in labeled samples, and GWAS on the imputed phenotype in unlabeled samples.





□ GLDADec: marker-gene guided LDA modelling for bulk gene expression deconvolution

>> https://www.biorxiv.org/content/10.1101/2024.01.08.574749v1

GLADADec (Guided Latent Dirichlet Allocation Deconvolution) utilizes marker gene names as partial prior information to estimate cell type proportions, thereby overcoming the challenges of conventional reference-based and reference-free methods simultaneously.

GLADADec employs a semi-supervised learning algorithm that combines cell-type marker genes with additional factors that may influence gene expression profiles to achieve a robust estimation of cell type proportions. An ensemble strategy is used to aggregate the output.





□ scGOclust: leveraging gene ontology to compare cell types across distant species using scRNA-seq data

>> https://www.biorxiv.org/content/10.1101/2024.01.09.574675v1

scGOclust constructs a functional profile of individual cells by multiplication of a gene expression count matrix of cells and a binary matrix with GO BP annotations of genes.

This GO BP feature matrix is treated similarly to a count matrix in classic single-cell RNA sequencing (scRNA-seq) analysis and is subjected to dimensionality reduction and clustering analyses.

scGOclust recapitulates the function spectrum of different cell types, characterises functional similarities between homologous cell types, and reveals functional convergence between unrelated cell types.





□ MATES: A Deep Learning-Based Model for Locus-specific Quantification of Transposable Elements in Single Cell

>> https://www.biorxiv.org/content/10.1101/2024.01.09.574909v1

MATES (Multi-mapping Alignment for TE loci quantification in Single-cell), a novel deep neural network-based method tailored for accurate locus-specific TE quantification in single-cell sequencing data across modalities.

MATES harnesses the distribution of uniquely mapped reads occurrence flanking TE loci and assigns multiple mapping TE reads for locus-specific TE quantification.

MATES captures complex relationships b/n the context distribution of unique-mapping reads flanking TE loci and the probability of multi-mapping reads assigned to those loci, handles the multi-mapping read assignments probabilistically based on the local context of the TE loci.





□ COFFEE: CONSENSUS SINGLE CELL-TYPE SPECIFIC INFERENCE FOR GENE REGULATORY NETWORKS

>> https://www.biorxiv.org/content/10.1101/2024.01.05.574445v1

COFFEE (COnsensus single cell-type speciFic inFerence for gEnE regulatory networks), a Borda voting based consensus algorithm that integrates information from 10 established GRN inference methods.

COFFEE has improved performance across synthetic, curated and experimental datasets when compared to baseline methods.

COFFEE's stability across differing datasets; even with Curated data, the consensus approach is able to capture high confidence edges when compared to the ground truth data.





□ HAT: de novo variant calling for highly accurate short-read and long-read sequencing data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad775/7510834

Hare And Tortoise (HAT) as an automated DNV detection workflow for highly accurate short-read and long-read sequencing data.

HAT is a computational workflow that begins with aligned read data (i.e., CRAM or BAM) from a parent-child sequenced trio and outputs DNVs. The HAT workflow consists of three main steps: GVCF generation, fam-ily-level genotyping, and filtering of variants to get final DNVs.

HAT detects high-quality DNVs from Illumina short-read whole-exome sequencing, Illumina short-read whole-genome sequencing, and highly accurate PacBio HiFi long-read whole-genome sequencing data.





□ SVCR: The Scalable Variant Call Representation: Enabling Genetic Analysis Beyond One Million Genomes

>> https://www.biorxiv.org/content/10.1101/2024.01.09.574205v1

SVCR achieves this by adopting reference blocks from the Genomic Variant Call Format (GVCF) and employing local allele indices. SVCR is also lossless and mergeable, allowing for N+1 and N+K incremental joint-calling.

SVCR-VCF encodes SVCR in VCF format, and VDS, which uses Hail's native format. Their experiments confirm the linear scalability of SVCR-VCF and VDS, in contrast to the super-linear growth seen with standard VCF files.

VDS Combiner, a scalable, open-source tool for producing a VDS from GVCFs and unique features of VDS which enable rapid data analysis.

PVCF defines the semantics of fields such as GT, AD, GP, PL, and, for list fields, the relationship between their length and the number of alternate alleles. VCF, as a format, describes, for example, how a number or a list is rendered in plaintext.

PVCF represents a collection of sequences as a dense matrix, with one column per sequenced sample and one row for every variant site. PVCF permits both a multiallelic representation (wherein each locus appears in at most one row) and a biallelic representation.





□ Poincaré and SimBio: a versatile and extensible Python ecosystem for modeling systems.

>> https://www.biorxiv.org/content/10.1101/2024.01.10.574883v1

Poincaré allows defining differential equation sys-tems, while SimBio builds on it for defining reaction networks. They are focused on providing an ergonomic experience to end-users by integrating well with IDEs and static analysis tools through the use of standard modern Python syntax.

The models built using these packages can be introspected to create other representations, such as graphs connecting species and/or reactions, or tables with parameters or equations.





□ Secreted Particle Information Transfer (SPIT) - A Cellular Platform For In Vivo Genetic Engineering

>> https://www.biorxiv.org/content/10.1101/2024.01.11.575257v1

Compared to the limited packaging capacities of contemporary in vivo gene therapy delivery platforms, a human cell's nucleus contains approximately 6 billion base pairs of information. They hypothesized that human cells could be applied as vectors for in vivo gene therapy.

SPIT is modified to secrete a genetic engineering enzyme within a particle that transfers this enzyme into a recipient cell, where it manipulates genetic information.





□ Decoder-seq enhances mRNA capture efficiency in spatial RNA sequencing

>> https://www.nature.com/articles/s41587-023-02086-y

Decoder-seq (Dendrimeric DNA coordinate barcoding design for spatial RNA sequencing) combines dendrimeric nanosubstrates with microfluidic coordinate barcoding to generate spatial arrays with a DNA density approximately ten times higher than previously reported methods.

Decoder-seq improves the detection of lowly expressed olfactory receptor (Olfr) genes in mouse olfactory bulbs and contributed to the discovery of a unique layer enrichment pattern for two Olfr genes.





□ GVRP: Genome Variant Refinement Pipeline for variant analysis in non-human species using machine learning

>> https://www.biorxiv.org/content/10.1101/2024.01.14.575595v1

GVRP employs a machine learning-based approach to refine variant calls in non-human species. Rather than training separate variant callers for each species, we employ a machine learning model to accurately identify variations and filter out false positives from DeepVariant.

In GVRP, they omit certain DeepVariant preprocessing steps and leverage the ground-truth Genome In A Bottle (GIAB) variant calls to train the machine learning model for non-human species genome variant refinement.





□ BAMBI: Integrative biostatistical and artificial-intelligence method discover coding and non-coding RNA genes as biomarkers

>> https://www.biorxiv.org/content/10.1101/2024.01.12.575460v1

BAMBI (Biostatistics and Artificial-Intelligence integrated Method for Biomarker /dentification), a robust pipeline that identifies both coding and non-coding RNA biomarkers for disease diagnosis and prognosis.

BAMBI can process RNA-seq data and microarray data to pinpoint a minimal yet highly predictive set of RNA biomarkers, thus facilitating their clinical application.

BAMBI offers visualization of biomarker expression and interpretation their functions using co-expression networks and literature mining, enhancing the interpretability of the results.





□ PoMoCNV: Inferring the selective history of CNVs using a maximum likelihood model

>> https://www.biorxiv.org/content/10.1101/2024.01.15.575676v1

PoMoCNV (POlymorphism-aware phylogenetic MOdel for CNV datasets) infers the fitness parameters and transition rates associated with different copy numbers along branches in the phylogenetic tree, tracing back in time.

Utilizing the phylogenetic tree of populations and estimated copy numbers, PoMoCNV was utilized to infer the evolutionary parameters governing CNV evolution along branches.

In PoMoCNV, the likelihood of this birth-death process is modeled per genomic segment, taking into account the copy number (allele) fitness and frequencies.





□ O-LGT: Online Hybrid Neural Network for Stock Price Prediction: A Case Study of High-Frequency Stock Trading in the Chinese Market

>> https://www.mdpi.com/2225-1146/11/2/13

O-LGT, an online hybrid recurrent neural network model tailored for analyzing LOB data and predicting stock price fluctuations in a high-frequency trading (HFT) environment.

O-LGT combines LSTM, GRU, and transformer layers, and features efficient storage management. When computing the stock forecast for the immediate future, O-LGT only use the output calculated from the previous trading data together with the current trading data.





□ GYOSA: A Distributed Computing Solution for Privacy-Preserving Genome-Wide Association Studies

>> https://www.biorxiv.org/content/10.1101/2024.01.15.575678v1

GYOSA, a secure and privacy-preserving distributed genomic analysis solution. Unlike in previous work, GYOSA follows a distributed processing design that enables handling larger amounts of genomic data in a scalable and efficient fashion.

GYOSA provides transparent authenticated encryption, which protects sensitive data from being disclosed to unwanted parties and ensures anti-tampering properties for clients' data stored in untrusted infrastructures.





□ KaMRaT: a C++ toolkit for k-mer count matrix dimension reduction

>> https://www.biorxiv.org/content/10.1101/2024.01.15.575511v1

KaMRaT (k-mer Matrix Reduction Toolkit) is a program for processing large k-mer count tables extracted from high throughput sequencing data.

Major functions include scoring k-mers based on count statistics, merging overlapping k-mers into longer contigs and selecting k-mers based on their presence in certain samples.

KaMRaT merge builds on the concept of local k-mer extension ("unitigs") to improve extension precision by leveraging count data. KaMRaT enables the identification of condition-specific or differential sequences, irrespective of any gene or transcript annotation.





□ EvoAug-TF: Extending evolution-inspired data augmentations for genomic deep learning to TensorFlow

>> https://www.biorxiv.org/content/10.1101/2024.01.17.575961v1

EvoAug was introduced to train a genomic DNN with evolution-inspired augmentations. EvoAug-trained DNNs have demonstrated improved generalization and interpretability with attribution analysis.

EvoAug-TF is a TensorFlow implementation of EvoAug (a PyTorch package) that provides the ability to train genomic DNNs with evolution-inspired data augmentations. EvoAug-TF improves generalization and model interpretability with attribution methods.





□ SLEDGe: Inference of ancient whole genome duplications using machine learning

>> https://www.biorxiv.org/content/10.1101/2024.01.17.574559v1

SLEDGe (Supervised Learning Estimation of Duplicated Genomes) provides a novel means to repeatably and rapidly infer ancient WGD events
from Ks plots derived from genomic or transcriptomic data.

SLEDGe can simulate ancient WGDs of multiple ages and across a range of gene birth and death rates. It provides the first model-based approach to infer WGDs in Ks plots and makes WGD interpretation more repeatable and consistent.




Peter Kochinsky

>> https://rapport.bio/all-stories/semper-maior-spirits-rising-january-2024

Do you think of biotech as wasteful? How much of the biotech Universe's cash is locked away in companies that have lingered all year with a negative enterprise value? We looked.

Interested in the relevance of M&A to sector returns? How much of the returns from M&A accrue to companies held by at least one specialist? At least three? We looked.

What's it all mean for private companies looking to get public?

And overshadowing it all is a question: what can we do to protect the @biotech sector and biomedical innovation from the wrong stroke of a pen?


War without frontiers.

2024-01-16 04:48:40 | 国際・政治

□ Fabian Hoffmann

>> https://x.com/frhoffmann1/status/1746589423251403236

In this thread, I will explain why we are much closer to war with 🇷🇺 than most people realize and why our time window for rearmament is shorter than many believe. In my opinion, we have at best 2-3 years to re-establish deterrence vis-à-vis 🇷🇺. Here's why 👇 1/20

One common mistake in analyzing the threat posed by Russia is falling into the trap of 'mirror-imaging'. This means assuming that Russia views a potential conflict with us in the same way we view a potential conflict with them. Nothing could be further from the truth.

🇷🇺 does not plan for the type of large-scale conventional war with NATO that we are currently seeing in Ukraine & for which we are primarily preparing. Already before taking substantial losses on the 🇺🇦 battlefield, 🇷🇺 knew that it would be inferior in such a scenario.

Russian thinking on a war with NATO revolves around the concept of escalation control and escalation management. Russia's primary objective in a war with NATO is to effectively manage escalation and bring the war to an early end on terms that are favorable to Russia.

Terminating hostilities early is necessary, given that 🇷🇺 must secure a victorous outcome before NATO's conventional superiority comes to bear, most notably that of the United States. Two key concepts play a crucial role: de-escalation strikes and aggressive sanctuarization.




□ South Africa challenges Israel at the International Court of Justice

>> https://news.liverpool.ac.uk/2024/01/15/south-africa-challenges-israel-at-the-international-court-of-justice/

South Africa’s charge of genocide did not come out of the blue. In mid-November, UN experts called on the international community “to act quickly to prevent genocide”. International legal experts also expected a state to apply to the ICJ under the Genocide Convention, a judicial route that could offer some protection.

The complete siege of Gaza – cutting off food, water, fuel – the evacuation orders and the bombardment combined with the language of political and military leaders prompted these fears of an unfolding genocide.

Genocide constitutes certain ‘acts committed with intent to destroy, in whole or in part, a national, ethnical, racial or religious group, as such’. South Africa accuses Israel of killing Palestinians in Gaza, causing them serious bodily and mental harm, inflicting on them conditions of life calculated to bring about their physical destruction and imposing measures intended to prevent Palestinian births.

South Africa contextualises these acts within the “broader context of Israel’s conduct towards Palestinians during its 75-year-long apartheid, its 56-yearlong belligerent occupation of Palestinian territory and its 16-year-long blockade of Gaza”. But the focus of this application is Israel’s specific intent, during the current operation, “to destroy Palestinians in Gaza as a part of the broader Palestinian national, racial and ethnical group”.




Concrete Utopia.

2024-01-15 22:13:22 | 映画

『Concrete Utopia』

>> https://www.concrete-utopia-film.com/

傑作。破局災害の生存者たちによるディストピア形成と「普通の人々」が緩やかに狂気に陥る描写がリアル。序列決定プロトコルの瑕疵にスリラー要素を挟む必然性。特筆すべきは美しく壮大なエピローグ。倒壊した大聖堂のステンドグラス、忘れえぬ名シーンとして映画史に刻まれるだろう







2023
Directed by Tae-hwa Eom
Based on “Pleasant Bullying” by Kim Sungnyung
Cinematography by Cho Hyoung-rae
Music by Kim Hae-won

Apple Music Classical

2024-01-09 23:33:53 | art music

□ Apple Music Classical

>> https://apps.apple.com/jp/app/apple-music-classical/


Apple Music Classicalの日本版ローカライズがようやく開始。 『音楽学研究者チームが7年以上かけて作成した基本的なメタデータと5000万のデータポイント』とあるけれど、結局はチューニングが全てなので、当然プラットフォームごとに特色が生まれる。楽団提携によるプレミア音源の公開に注目

Lang ist die Zeit, es ereignet sich aber das Wahre.

2024-01-01 12:00:00 | Science News

(Created with Midjourney v6.0 ALPHA)




□ Stellarscope: A single-cell transposable element atlas of human cell identity

>> https://www.biorxiv.org/content/10.1101/2023.12.28.573568v1

Stellarscope (Single cell Transposable Element Locus Level Analysis of scRNA Sequencing), a scRNA-seq-based computational pipeline for characterizing cell identity. Stellarscope reassigns multi-mapped reads to specific genomic loci using an expectation-maximization algorithm.

Stellarscope provides a variety of reassignment strategies incl. filtering based on a threshold, excluding fragments with multiple optimal alignments, and randomly selecting from multiple optimal alignments; these criteria result in a different number of excluded alignments.

Stellarscope implements a generative model of single cell RNA-seq that rescales alignment probabilities for independently aligned reads based on the cumulative weights of all alignments, and uses the posterior probability matrix to reassign ambiguous fragments.





□ FinaleMe: Predicting DNA methylation by the fragmentation patterns of plasma cell-free DNA

>> https://www.biorxiv.org/content/10.1101/2024.01.02.573710v1

FinaleMe (FragmentatIoN AnaLysis of cEll-free DNA Methylation), to predict the DNA methylation status in each CpG at each cfDNA fragment and obtain the continuous DNA methylation level at CpG sites, mostly accurate in CpG rich regions.

FinaleMe is a non-homogeneous Hidden Markov Model. It incorporates the distance between CpG sites into the model and utilizes the following three features: fragment length, normalized coverage, and the distance of each CpG site to the center of the DNA fragment.





□ ECOLE: Learning to call copy number variants on whole exome sequencing data

>> https://www.nature.com/articles/s41467-023-44116-y

ECOLE (Exome-based COpy number variation calling LEarner) is based on a variant of the transformer model. ECOLE processes the read-depth signal over each exon. It learns which parts of the signal need to be focused on and in which context (i.e., chromosome) to call a CNV.

ECOLE uses the high-confidence calls obtained on the matched WGS samples as the semi-ground truth. ECOLE employs a multi-head attention mechanism which means multiple attentions are calculated over the signal which is concatenated and transformed into the 192 x 1001 dimensions.





□ Probabilistic Modeling for Sequences of Sets in Continuous-Time

>> https://arxiv.org/abs/2312.15045

A general framework for modeling set-valued data in continuous-time, compatible with any intensity-based recurrent neural point process model, where event types are subsets of a discrete set.

Their simplest baseline uses a homogeneous Poisson model as the temporal component and a static Bernoulli model for the set distribution (where the Bernoulli probabilities correspond to the marginal probabilities in the dataset), referred to below as the StaticB-Poisson model.

This simple baseline provides useful context for evaluating the effectiveness of more complex models for set-valued data over time. For the temporal component they use the Neural Hawkes (NH) model as a specific instantiation of the recurrent MTPP component.

In the Bernoulli variants of this model this is coupled with the Dynamic Bernoulli model for the set-component or the marginal Bernoulli option as a baseline (same model for sets as the Poisson baseline), referred as DynamicB-NH and StaticB-NH.





□ Gradient Flossing: Improving Gradient Descent through Dynamic Control of Jacobians

>> https://arxiv.org/abs/2312.17306

Gradient Flossing is based on a recently described link between the gradients of backpropagation through time and Lyapunov exponents, which are the time-averaged logarithms of the singular values of the long-term Jacobian.

Gradient flossing regularizes one or several Lyapunov exponents to keep them close to zero. This improves not only the error gradient norm but also the condition number of the long-term Jacobian. As a result, error signals can be propagated back over longer time horizons.





□ UVAE: Integration of Heterogeneous Unpaired Data with Imbalanced Classes

>> https://www.biorxiv.org/content/10.1101/2023.12.18.572157v1

UVAE (Unbiasing Variational Autoencoder), a VAE-based method capable of integrating and normalising unpaired, partially annotated data streams, thus addressing these challenges.

UVAE separates the confounding factor variability from the shared latent space, transforming heterogeneous datasets into a unified, homogeneous data stream, while performing simulatenous normalisation, merging, and class inference using stable non-adversarial learning objectives.






□ HyLight: Strain aware assembly of low coverage metagenomes

>> https://www.biorxiv.org/content/10.1101/2023.12.22.572963v1

HyLight, a novel approach to push the limits of strain-aware metagenome assembly in a substantial manner. HyLight is based on de novo hybrid assembly, characterized by integrating both long / short, and next-generation sequencing reads during the assembly process.

HyLight is rooted in a "cross hybrid" strategy: it assembles long reads using short reads as auxiliary source of data, and vice versa assembles short reads assisted by long read information. HyLight employs overlap graphs as the driving underlying data structure.

HyLight realizes that the presence of long reads renders usage of de Bruijn graphs obsolete. While this is understood for long read assemblies as overlap graphs have regained a prominent role when processing long reads this may be somewhat surprising when considering short reads.

HyLight incorporates a filtering step that identifies mistaken (strain-unaware) overlaps and removes them from the graphs. The filtering step prevents the incorrect compression of strain-specific variation into contigs that mistakenly connect sequence from different strains.






□ BATH: Sensitive and error-tolerant annotation of protein-coding DNA

>> https://www.biorxiv.org/content/10.1101/2023.12.31.573773v1

BATH, a tool for highly sensitive annotation of protein-coding DNA based on direct alignment of that DNA to a database of protein sequences or profile hidden Markov models (pHMMs).

BATH is built on top of the HMMER3 code base, and its core functionality is to provide full HMMER3 sensitivity w/ automatic management of 6-frame codon translation. BATH introduces novel frameshift-aware algorithms to detect frameshift-inducing nucleotide insertions / deletions.





□ GCNFORMER: graph convolutional network and transformer for predicting lncRNA-disease associations

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05625-1

GCNFORMER, a novel convolutional network and transformer-based LDA prediction model that constructs a graph relationship adjacency matrix based on the intraclass and interclass relationships between lncRNA, miRNA and disease.

In GCNFORMER model, graph convolutional network can effectively capture the topology and interactions in lncRNA-disease association network, while transformer can extract the contextual information under the complex relationships.





□ scLANE: Interpretable trajectory inference with single-cell Linear Adaptive Negative-binomial Expression testing

>> https://www.biorxiv.org/content/10.1101/2023.12.19.572477v1

scLANE testing, a negative-binomial generalized linear model (GLM) framework for modeling nonlinear relationships while accounting for correlation structures inherent to multi-sample scRNA-seq experiments.

The scLANE framework is an extension of the Multivariate Adapative Regression Splines (MARS) method, which builds nonlinear models out of piecewise linear components. scLANE can be used downstream of any pseudotemporal ordering or RNA velocity estimation method.

Truncated power basis splines are chosen empirically per-gene, per-lineage, providing results that are specific to each gene's dynamics across each biological subprocess - an improvement on methods that use a common number of equidistant knots for all genes.

The coefficients generated by scLANE carry the same multiplicative interpretation as any GLM, providing a quantitative measure and significance test of the relationship of pseudotime with gene expression over empirically selected pseudotime intervals from each lineage.





□ GSDensity: Pathway centric analysis for single-cell RNA-seq and spatial transcriptomics data

>> https://www.nature.com/articles/s41467-023-44206-x

GSDensity uses multiple correspondence analysis (MCA) to co-embed cells and genes into a latent space and quantifies the overall variation of pathway activity levels across cells by estimating the density of the pathway genes in the latent space.

GSDensity calculates pathway activity for each cell using network propagation in a nearest-neighbor cell-gene graph, with pathway genes used as seeds for random walks.






□ Hamiltonian truncation tensor networks for quantum field theories

>> https://scirate.com/arxiv/2312.12506

Hamiltonian truncation tensor networks uses matrix product operator representations of interactions in momentum space, thus avoiding the issues of lattice discretisation and reducing significantly the computational cost of simulation compared to exact diagonalisation.

Hamiltonian truncation defines the Hilbert space basis and construct the interacting part. For the mS model the free part is a massive boson model, which in momentum space reduces to an infinite set of independent harmonic oscillator modes.





□ Boolean TQFTs with accumulating defects, sofic systems, and automata for infinite words

>> https://arxiv.org/abs/2312.17033

They established a relationship between automata and one-dimensional Boolean Topological Quantum Field Theories (TQFTs), as well as the universal construction for Boolean topological theories in one dimension.

It is clear that it has a well-defined evaluation, independent of how the word is chopped into several intervals with finitely-many defects and one interval with infinitely-many defects, when presenting the floating interval as the composition of elementary morphisms.

To define a TQFT valued in the category of free B-modules, one needs suitable versions of automata and infinite words (w-automata) to account for various types of boundary behaviour at inner endpoints of cobordisms.

A Z-invariant subset of Σ^Z is called an infinite language (a language of infinite words). An infinite language is called closed if the corresponding subset is closed in Σ^Z. Closed infinite languages are in a bijection with shift spaces.





□ Quantification of cell phenotype transition manifolds with information geometry

>> https://www.biorxiv.org/content/10.1101/2023.12.28.573500v1

A novel approach to quantitatively analyze low-dimensional manifolds from single cell data. Transform each single cell's sequencing data into a multivariate Gaussian distribution, calculate the Fisher information of each cell and quantify the manifold of Cell Phenotype Transition.

Using a vector field learning method that is trained with sparse vector data pairs to learn a vector value function in a function Hilbert space.

We can define the Fisher metric on pre-defined variables such as eigengenes, using the reproducing kernel Hilbert space method (RKHS) or neural networks with backward propagation.

As RNA velocity reflects the direction of single cell along the path of CPT in the gene expression space, the information velocity of single cell represents the speed of information variation along the transition path of Cell Phenotype Transition.






□ Four-Dimensional-Spacetime Atomistic Artificial Intelligence Models

>> https://pubs.acs.org/doi/10.1021/acs.jpclett.3c01592

The 4D-spacetime GICnet model, which for the given initial conditions (nuclear positions and velocities at time zero) can predict nuclear positions and velocities as a continuous function of time up to the distant future.

Such models of molecules can be unrolled in the time dimension to yield longtime high-resolution molecular dynamics trajectories with high efficiency and accuracy.

4D-spacetime models can make predictions for different times in any order and do not need a stepwise evaluation of forces and integration of the equations of motions at discretized time steps, which is a major advance over traditional, cost-inefficient molecular dynamics.





□ Complexity And Ergodicity In Chaos Game Representation Of Genomic Sequences

>> https://www.biorxiv.org/content/10.1101/2023.12.30.573653v1

The Chaos Game Representation (CGR) transforms a DNA sequence into a visual representation that exhibits personalized characteristics unique to that specific sequence.

An ergodic system explores all accessible states and, in the long run, provides a representative sample of its entire state space. In the analysis of biological sequences like DNA or protein sequences, ergodic theory facilitates the exploration of the distribution of elements.

A DNA sequence can be transformed into a sequence of Bernoulli trials, specifically, a sequence composed of two symbols Xy and X2, where each nucleotide corresponds to an element of the transformed sequence.

CGR visually represents DNA sequences in a fractal-like pattern. In the chaos game representation of genomic sequences, each nucleotide is associated with a specific position in a coordinate system. The algorithm proceeds by iteratively plotting points based on the sequence.






□ A mathematical perspective on Transformers

>> https://arxiv.org/abs/2312.10794

Transformers are in fact flow maps on P(R^d), the space of probability measures over R^d. Transformers evolve a mean-field interacting particle system. Every particle follows the flow of a vector field which depends on the empirical measure of all particles.

The structure of these interacting particle systems allows one to draw concrete connections to established topics in mathematics, including nonlinear transport equations, Wasserstein gradient flows, collective behavior mod-els, and optimal configurations of points on spheres.





□ Time Vectors: Time is Encoded in the Weights of Finetuned Language Models

>> https://arxiv.org/abs/2312.13401

Time vectors, a simple tool to customize language models to new time periods.
Time vectors are created by finetuning a language model on data from a single time, and then subtracting the weights of the original pretrained model.

Time vectors specify a direction in weight space that, as our experiments show, improves performance on text from that time period. Time vectors specialized to adjacent time periods appear to be positioned closer together in a manifold.





□ SECE: accurate identification of spatial domain by incorporating global spatial proximity and local expression proximity

>> https://www.biorxiv.org/content/10.1101/2023.12.26.573377v1

SECE, an accurate spatial domain identification method for ST data. In contrast to the existing approaches, SECE incorporates global spatial proximity and local expression proximity of data to derive spatial domains.

The spatial embedding (SE) obtained by SECE enables downstream analysis including low-dimensional visualization and trajectory inference.

SECE utilizes Partition-based Graph Abstraction (PAGA) at the domain level and Monocle3 at the single-cell level. Moreover, when applied to ST data with single-cell resolution, SECE can accurately assign cell type labels by clustering cell type-related embedding.





□ SOAPy: a Python package to dissect spatial architecture, dynamics and communication

>> https://www.biorxiv.org/content/10.1101/2023.12.21.572725v1

SOAPy (Spatial Omics Analysis in Python) performs multiple tasks for dissecting spatial organization, incl. spatial domain, spatial expression tendency, spatiotemporal expression pattern, co-localization of paired cell types, multi-cellular niches, and cell-cell communication.

SOAPy employs tensor decomposition to extract components from the three-order expression tensor ("Time-Space-Gene"), revealing hidden patterns and reducing the complexity of data explanation.





□ scPML: pathway-based multi-view learning for cell type annotation from single-cell RNA-seq data

>> https://www.nature.com/articles/s42003-023-05634-z

scPML, utilizing well-labeled gene expression data, learns latent cell-type-specific patterns for annotating cells in test data. scPML initially employs various pathway datasets to model multiple cell-cell graphs to learn kinds of relationships among cells for a training dataset.

Pathway datasets divide genes into various gene sets based on specific biological processes, which reflect cell heterogeneity on the level of biological functions and minimize the impact of dropout events as a gene has limited effect on the entire gene set.

Structural information is learned from cell-cell graphs using self-supervised convolutional neural networks in scPML to produce denoised low-dimensional representations for cells.

scPML attempts to find a common representation which can be reconstructed to according embeddings and has the quality of separability. After obtaining the common latent representations, scPML uses a classifier to assign labels.






□ Pair-EGRET: enhancing the prediction of protein-protein interaction sites through graph attention networks and protein language models

>> https://www.biorxiv.org/content/10.1101/2023.12.25.572648v1

Pair-EGRET, an edge-aggregated graph attention network that leverages the features extracted from pre-trained transformer-like models to accurately predict PPI sites.

Pair-EGRET works on a k-nearest neighbor graph, representing the three-dimensional structure of a protein, and utilizes the cross-attention mechanism for accurate identification of interfacial residues of a pair of proteins.





□ ChimericFragments: Computation, analysis, and visualization of global RNA networks

>> https://www.biorxiv.org/content/10.1101/2023.12.21.572723v1

ChimericFragments, a computational platform for the analysis and interpretation of RNA-RNA interaction datasets starting from raw sequencing files. ChimericFragments enables rapid computation of RNA-RNA pairs, RNA duplex prediction, and a graph-based, interactive visualization of the results.

ChimericFragments employs a new algorithm based on the complementarity of chimeric fragments around the ligation site, which boosts the identification of bona fide RNA duplexes.

ChimericFragments shows the aggregate of all detected ligation sites for each interacting transcript, allowing for the identification of preferred base-pairing sequences in regulatory RNAs and their targets.





□ GAPS: Geometric Attention-based Networks for Peptide Binding Sites Identification by the Transfer Learning Approach

>> https://www.biorxiv.org/content/10.1101/2023.12.26.573336v1

GAPS employs a transfer learning strategy, leveraging pre-trained information on protein-protein binding sites to enhance the training for recognizing protein-peptide binding sites, while considering the similarity between proteins and peptides.

The atom-based geometric information makes the GAPS model granularity smaller, increasing the likelihood of capturing inherent biological information among amino acid residues, and it also ensures the model's translation-invariance and rotation-equivariance.





□ Optimal distance metrics for single-cell RNA-seq populations

>> https://www.biorxiv.org/content/10.1101/2023.12.26.572833v1

A reusable framework for evaluating distance metrics for single-cell gene expression data. To mimic how distance metrics would be used in model evaluation or dataset analysis, they quantify their sensitivity and robustness when identifying differences between populations.

The control relative percentile (CRP) is defined as the percentage of perturbed conditions with a larger distance to the reference control set than the control sets to each other, averaged across five control sets.





□ COBRA: Higher-order correction of persistent batch effects in correlation networks

>> https://www.biorxiv.org/content/co10.1101/2023.12.28.573533v1

COBRA (Co-expression Batch Reduction Adjustment), a method for computing a batch-corrected gene co-expression matrix based on estimating a conditional covariance matrix.

COBRA estimates a reduced set of parameters expressing the co-expression matrix as a function of the sample covariates, allowing control for continuous and categorical covariates.





□ Multidimensional Soliton Systems

>> https://arxiv.org/abs/2312.17096

A remarkable feature of multidimensional solitons is their ability to carry vorticity; however, 2D vortex rings and 3D vortex tori are subject to strong splitting instability.

Therefore, it is natural to categorize the basic results according to physically relevant settings which make it possible to maintain stability of fundamental (non-topological) and vortex solitons against the collapse and splitting, respectively.

The present review is focused on schemes that were recently elaborated in terms of Bose-Einstein condensates and similar photonic setups.

These are two-component systems with spin-orbit coupling, and ones stabilized by the beyond-mean-field Lee-Huang-Yang effect.The latter setting has been implemented experimentally, giving rise to stable self-trapped quasi-2D and 3D "quantum droplets".





□ Node Features of Chromosome Structure Network and Their Connections to Genome Annotation

>> https://www.biorxiv.org/content/10.1101/2023.12.29.573476v1

Constructing chromosome structure networks (CSNs) from bulk Hi-C data and calculated a set of site-resolved (node-based) network properties of these CSNs. These network properties are useful for characterizing chromosome structure features.

Semi-local network properties are more capable of characterizing genome annotations than diffusive or ultra-local node features.

For example, local square clustering coefficient can be a strong classifier of lamina-associated domains (LADs), whereas a path-based network property, closeness centrality, does not vary concordantly with LAD status.





□ RepeatOBserver: tandem repeat visualization and centromere detection

>> https://www.biorxiv.org/content/10.1101/2023.12.30.573697v1

RepeatOBserver, a new tool for visualizing tandem repeats and clustered transposable elements and for identifying potential natural centromere locations, using a Fourier transform of DNA walks.

RepeatOBserver can identify a broad range of repeats (3-20,000bp long) in genome assemblies without any a priori knowledge of repeat sequences or the need for optimizing parameters.





□ AntiNoise: Genomic background sequences systematically outperform synthetic ones in de novo motif discovery for ChIP-seq data

>> https://www.biorxiv.org/content/10.1101/2023.12.30.573742v1

The synthetic approach performs nucleotides shuffling that abolishes the enrichment of any motifs. This procedure radically destroys in the foreground sequences the enrichment of k-mers of any length.

These k-mers represent either specific or non-specific motifs; they compete between each other at the next step of de novo motifs search.

Maximal number of attempts NA to find matching background sequences in the genome. If a given number NA of last attempts to find any at least one more background sequence are unsuccessful, the algorithm terminates.