# lens, align.

## Lang ist Die Zeit, es ereignet sich aber Das Wahre.

### Mitus Lumen.

- emit language syntax. -

□ Fluctuation theorems with retrodiction rather than reverse processes

>> https://avs.scitation.org/doi/10.1116/5.0060893

The everyday meaning of (ir)reversibility in nature is captured by the perceived “arrow of time”: if the video of the evolution played backward makes sense, the process is reversible; if it does not make sense, it is irreversible.

The reverse process is generically not the video played backward: to cite an extreme example, nobody conceives bombs that fly upward to their airplanes while cities are being built from rabble.

In the case of controlled protocols in the presence of an unchanging environment, the reverse process is implemented by reversing the protocol. If the environment were to change, the connection between the physical process and the associated reverse one becomes thinner.

The retrodiction channel of an erasure channel is the erasure channel that returns the reference prior—a result that can be easily extended to any alphabet dimension.

PROCESSES VERSUS INFERENCES: fluctuation relations are intimately related to statistical distances (“divergences”) and that Bayesian retrodiction arises from the requirement that the fluctuating variable can be computed locally.

□ The Metric Dimension of the Zero-Divisor Graph of a Matrix Semiring

>> https://arxiv.org/pdf/2111.07717v1.pdf

The metric dimensions of graphs corresponding to various algebraic structures. The metric dimension of a zero-divisor graph of a commutative ring, a total graph of a finite commutative ring, an annihilating-ideal grah of a finite ring, a commuting graph of a dihedral group.

Antinegative semirings are also called antirings. The simplest example of an antinegative semiring is the binary Boolean semiring B, the set {0,1} in which addition and multiplication are the same as in Z except that 1 + 1 = 1.

For infinite entire antirings S, the metric dimension of Γ(Mn(S)) is infinite. Therefore, it shall limit themselves to studying finite semirings. For every Λ ⊆ Nn × Nn at most one zero-divisor matrix with its pattern of zero and non-zero entries prescribed by Λ is not in W.

□ CONTEXT, JUDGEMENT, DEDUCTION

>> https://arxiv.org/pdf/2111.09438v1.pdf

an abstract definition of type constructor featuring the usual formation, introduction, elimination and computation rules. In proof theory they offer a deep analysis of structural rules, demystifying some of their properties, and putting them into context.

Discussing the internal logic of a topos, a predicative topos, an elementary 2-topos et similia, and show how these can be organized in judgemental theories.

□ Scasa: Isoform-level Quantification for Single-Cell RNA Sequencing

Scasa, an isoform-level quantification method for high-throughput single-cell RNA sequencing by exploiting the concepts of transcription clusters and isoform paralogs.

Scasa compares well in simulations against competing approaches including Alevin, Cellranger, Kallisto, Salmon, Terminus and STARsolo at both isoform- and gene-level expression.

Scasa takes advantage of the efficient preprocessing provided by existing pseudoaligners such as Kallisto-bustools or Alevin to produce a read-count equivalent-class matrix. Scasa splits the equivalence class output by cell and applies the AEM algorithm to multiple cells.

□ corral: Single-cell RNA-seq dimension reduction, batch integration, and visualization with correspondence analysis

>> https://www.biorxiv.org/content/10.1101/2021.11.24.469874v1.full.pdf

Correspondence Cnalysis (CA) for dimension reduction of scRNAseq data, which is a performant alternative to PCA. Designed for use with counts, CA is based on decomposition of a chi-squared residual matrix and does not require log-transformation of scRNAseq counts.

CA using the Freeman-Tukey chi-squared residual was most performant overall in scRNAseq data. Variance stabilizing transformations applied in conjunction with standard CA and the use of “power deflation” smoothing both improve performance in downstream clustering tasks.

corralm, a CA-based method for multi-table batch integration of scRNAseq data in shared latent space. The adaptation of correspondence analysis for to the integration of multiple tables is similar to the method for single tables with additional matrix concatenation operations.

corralm employs indexed residuals, by dividing the standardized residuals by the square root of expected proportion to reduce the influence of column with larger masses (library depth). And applies CA-style processing to continuous data with the Hellinger distance adaptation.

□ Fuzzy set intersection based paired-end short-read alignment

>> https://www.biorxiv.org/content/10.1101/2021.11.23.469039v1.full.pdf

a new algorithm for aligning both reads in a pair simultaneously by fuzzily intersecting the sets of candidate alignment locations for each read. SNAP with the fuzzy set intersection algorithm dominates BWA and Bowtie, having both better performance and better concordance.

Fuzzy set intersection avoids doing expensive evaluations of many candidate alignments that would eventually be dismissed because they are too far from any plausible alignments for the other end of the pair.

□ ScLRTC: imputation for single-cell RNA-seq data via low-rank tensor completion

>> https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-021-08101-3

scLRTC imputes the dropout entries of a given scRNA-seq expression. It initially exploits the similarity of single cells to build a third-order low-rank tensor and employs the tensor decomposition to denoise the data.

ScLRTC reconstructs the cell expression by adopting the low-rank tensor completion algorithm, which can restore the gene-to-gene and cell-to-cell correlations. scLRTC is demonstrated to be also effective in cell visualization and in inferring cell lineage trajectories.

□ FDJD: RNA-Seq Based Fusion Transcript Detection Using Jaccard Distance

>> https://www.biorxiv.org/content/10.1101/2021.11.17.469019v1.full.pdf

Converting the RNA categorical space into a compact binary array called binary fingerprints, which enables us to reduce the memory usage and increase efficiency. The search and detection of fusion candidates are done using the Jaccard distance.

FDJD (Fusion Detection using the Jaccard Distance) exhibits superior accuracy compared to popular alternative fusion detection methods. FDJD generates fusion candidates using both split reads and discordantly aligned pairs which are produced by the STAR alignment step.

□ Inspector: Accurate long-read de novo assembly evaluation

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02527-4

Inspector, a reference-free long-read de novo assembly evaluator which faithfully reports types of errors and their precise locations. Notably, Inspector can correct the assembly errors based on consensus sequences derived from raw reads covering erroneous regions.

Inspector generates read-to-contig alignment and performs downstream assembly evaluation. Inspector can report the precise locations and sizes for structural and small-scale assembly errors and distinguish true assembly errors from genetic variants.

□ Characterizing Protein Conformational Spaces using Dimensionality Reduction and Algebraic Topology

>> https://www.biorxiv.org/content/10.1101/2021.11.16.468545v1.full.pdf

Linear dimensionality reduction like PCA and its variants may not capture the complex, non-linear nature of pro- tein conformational landscape. Dimensionality reduction techniques are broadly classified based on the solution space they generate, as convex and non-convex.

Even after the conformational space is sampled, it should be filtered and clustered to extract meaningful information.

The structures represented by these conformations are then analyzed by studying their high dimension topological properties to identify truly distinct conformations and holes in the conformational space that may represent high energy barriers.

□ scCODE: an R package for personalized differentially expressed gene detection on single-cell RNA-sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.11.18.469072v1.full.pdf

DE methods together with gene filtering have profound impact on DE gene identification, and different datasets will benefit from personalized DE gene detection strategies.

scCODE (single cell Consensus Optimization of Differentially Expressed gene detection) produces consensus DE gene results.

scCODE summarizes the top (default as all) DE genes from each of the strategy selected. The principle of consensus optimization is that the DE genes with higher frequency of observation by different analysis strategies are more reliable.

□ HDMC: a novel deep learning based framework for removing batch effects in single-cell RNA-seq data

This framework employs an adversarial training strategy to match the global distribution of different batches. This provides an improved foundation to further match the local distributions with a Maximum Mean Discrepancy based loss.

HDMC divides cells in each batch into clusters and uses a contrastive learning method to simultaneously align similar cluster pairs / keep noisy pairs apart from each other. It allows to obtain clusters w/ all cells of the same type, and avoid clusters w/ cells of different type.

□ COBREXA.jl: constraint-based reconstruction and exascale analysis

COBREXA.jl provides a ‘batteries-included’ solution for scaling analyses to make efficient use of high-performance computing (HPC) facilities, which allows to be realistically applied to pre-exascale-sized models.

COBREXA formulates optimization problems and is compatible w/ JuMP solvers. the building blocks are designed so that the constructed workflows that explores flux variability in many variants, its distributed execution, and collection of many results in a multi-dimensional array.

□ Built on sand: the shaky foundations of simulating single-cell RNA sequencing data

>> https://www.biorxiv.org/content/10.1101/2021.11.15.468676v1.full.pdf

Most simulators are unable to accommodate complex designs without introducing artificial effects; they yield over-optimistic performance of inte- gration, and potentially unreliable ranking of clustering methods; and, it is generally unknown.

By definition, simulations generate synthetic data. On the one hand, conclusions drawn from simulation studies are frequently criticized, because simulations cannot completely mimic (real) experimental data.

□ DiagAF: A More Accurate and Efficient Pre-Alignment Filter for Sequence Alignment

>> https://ieeexplore.ieee.org/document/9614999/

DiagAF uses a new lower bound of edit distance based on shift hamming masks. The new lower bound makes use of fewer shift hamming masks comparing with state-of-art algorithms such as SHD and MAGNET.

DiagAF has the features: faster; lower false positive rate; zero false negative rate; can deal with alignments with un-equal lengths; can pre-align a string to multiple candidate in a single time run. DiagAF can align sequences with early termination for true alignments.

□ Explainability methods for differential gene analysis of single cell RNA-seq clustering models

>> https://www.biorxiv.org/content/10.1101/2021.11.15.468416v1.full.pdf

The absence of “ground truth” information about the DE genes makes the evaluation on real-world datasets is a complex task, usually requiring additional biological experiments for validation.

a comprehensive study to compare the performance of dedicated DE methods, with that of explainability methods typically used in machine learning, both model agnostic: SHAP, permutation importance, and model-specific: NN gradient-based methods.

The gradient method achieved the highest accuracy on the scziDesk and scDeepCluster while on contrastive-sc the results are comparable to the other top performing methods.

contrastive-sc employs high levels of NN dropout as data augmentation and thus learns a sparse representation of the input data, penalizing by de- sign the capacity to learn all relevant features.

□ MAGUS+eHMMs: Improved Multiple Sequence Alignment Accuracy for Fragmentary Sequences

MAGUS is fairly robust to fragmentary sequences under many conditions, and that using a two-stage approach where MAGUS is used to align selected “backbone sequences” and the remaining sequences are added into the alignment using ensembles of Hidden Markov Models.

MAGUS+eHMMs, matches or improves on both MAGUS and UPP, particularly when aligning datasets that evolved under high rates of evolution and that have large fractions of fragmentary sequences.

□ FastQTLmapping: an ultra-fast package for mQTL-like analysis

>> https://www.biorxiv.org/content/10.1101/2021.11.16.468610v1.full.pdf

FastQTLmapping is a computationally efficient, exact, and generic solver for exhaustive multiple regression analysis involving extraordinarily large numbers of dependent and explanatory variables with covariates.

FastQTLmapping can afford omics data containing tens of thousands of individuals and billions of molecular loci.

FastQTLmapping accepts input files in text format and in Plink binary format. The output file is in text format and contains all test statistics for all regressions, with the ability to control the volume of the output at preset significance thresholds.

□ ZARP: An automated workflow for processing of RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2021.11.18.469017v1.full.pdf

ZARP (Zavolan-Lab Automated RNA-seq Pipeline) can run locally or in a cluster environment, generating extensive reports not only of the data but also of the options utilized.

ZARP requires two distinct input files: A tab-delimited file with sample-specific information, such as paths to the sequencing data (FASTQ), transcriptome annotation (GTF) and experiment protocol- and library-preparation specifications like adapter sequences or fragment size.

To provide a high-level topographical/functional annotation of which gene segments (e.g., CDS, 3’UTR, intergenic) and biotypes (e.g., protein coding genes, rRNA) are represented by the reads in a given sample, ZARP includes ALFA.

□ VIVID: a web application for variant interpretation and visualisation in multidimensional analyses

>> https://www.biorxiv.org/content/10.1101/2021.11.16.468904v1.full.pdf

VIVID, a novel interactive and user-friendly platform that automates mapping of genotypic information and population genetic analysis from VCF files in 2D and 3D protein structural space.

VIVID is a unique ensemble user interface that enables users to explore and interpret the impact of genotypic variation on the phenotypes of secondary and tertiary protein structures.

□ Spliceator: multi-species splice site prediction using convolutional neural networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04471-3

Spliceator is based on the Convolutional Neural Networks technology and more importantly, is trained on an original high quality dataset containing genomic sequences from organisms ranging from human to protists.

Spliceator achieves overall high accuracy compared to other state-of-the-art programs, including the neural network-based NNSplice, MaxEntScan that models SS using the maximum entropy distribution, and two CNN-based methods: DSSP and SpliceFinder.

□ GSA: an independent development algorithm for calling copy number and detecting homologous recombination deficiency (HRD) from target capture sequencing

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04487-9

Genomic Scar Analysis (GSA) could effectively and accurately calculate the purity and ploidy of tumor samples through NGS data, and then reflect the degree of genomic instability and large-scale copy number variations of tumor samples.

Evaluating the rationality of segmentation and genotype identification by the GSA algorithm and compared with other two algorithms, PureCN and ASCAT, found that the segmentation result of GSA algorithm was more logical.

□ A computationally efficient clustering linear combination approach to jointly analyze multiple phenotypes for GWAS

>> https://www.biorxiv.org/content/10.1101/2021.11.22.469509v1.full.pdf

The Clustering Linear Combination (CLC) method works particularly well with phenotypes that have natural groupings, but due to the unknown number of clusters for a given data,

the final test statistic of CLC method is the minimum p-value among all p-values of the CLC test statistics obtained from each possible number of clusters.

Computationally Efficient CLC (ceCLC) to test the association between multiple phenotypes and a genetic variant. ceCLC uses the Cauchy combination test to combine all p-values of the CLC test statistics obtained from each possible number of clusters.

□ Figbird: A probabilistic method for filling gaps in genome assemblies

>> https://www.biorxiv.org/content/10.1101/2021.11.24.469861v1.full.pdf

Figbird, a probabilistic method for filling gaps in draft genome assemblies using second generation reads based on a generative model for sequencing that takes into account information on insert sizes of read pairs and sequencing errors.

Figbird uses an iterative approach based on the expectation-maximization (EM) algorithm. The method is based on a generative model for sequencing proposed in CGAL and subsequently used to develop a scaffolding tool SWALO.

□ TSEBRA: transcript selector for BRAKER

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04482-0

TSEBRA uses a set of arbitrarily many gene prediction files in GTF format together with a set of files of heterogeneous extrinsic evidence to produce a combined output.

TSEBRA uses extrinsic evidence in the form of intron regions or start/stop codon positions to evaluate and filter transcripts from gene predictions.

□ VG-Pedigree: A Complete Pedigree-Based Graph Workflow for Rare Candidate Variant Analysis

>> https://www.biorxiv.org/content/10.1101/2021.11.24.469912v1.full.pdf

VG-Pedigree, a pedigree-aware workflow based on the pangenome-mapping tool of Giraffe and the variant-calling tool DeepTrio using a specially-trained model for Giraffe-based alignments.

VG-Pedigree improves mapping and variant calling in both SNVs and INDEL variants over those produced by alignments created using BWA-MEM to a linear-reference and Giraffe mapping to a pangenome graph containing data from the 1000 Genomes Project.

□ Detecting fabrication in large-scale molecular omics data

>> https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0260395

Just as has been previously shown in the financial sector, digit frequencies are a powerful data representation when used in combination with machine learning to predict the authenticity of data. Fraud detection methods must be updated for sophisticated computational fraud.

The Fabrication detection methods in biomedical research and show that machine learning can be used to detect fraud in large-scale omic experiments. the Benford-like digit frequency method can be generalized to any tabular numeric data.

□ monaLisa: an R/Bioconductor package for identifying regulatory motifs

>> https://www.biorxiv.org/content/10.1101/2021.11.30.470570v1.full.pdf

monaLisa (MOtif aNAlysis with Lisa), an R/Bioconductor package that implements approaches to identify relevant transcription factors from experimental data.

monaLisa uses randomized lasso stability selection. monaLisa further provides helpful functions for motif analyses, including functions to predict motif matches and calcu- late similarity between motifs.

□ BreakNet: detecting deletions using long reads and a deep learning approach

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04499-5

BreakNet first extracts feature matrices from long-read alignments. Second, it uses a time-distributed CNN to integrate and map the feature matrices to feature vectors.

BreakNet employs a BLSTM model to analyse the produced set of continuous feature vectors in both the forward and backward directions. a classification module determines whether a region refers to a deletion.

□ Variance in Variants: Propagating Genome Sequence Uncertainty into Phylogenetic Lineage Assignment

>> https://www.biorxiv.org/content/10.1101/2021.11.30.470642v1.full.pdf

a probabilistic matrix representation of individual sequences which incorporates base quality scores as a measure of uncertainty that naturally lead to resampling and replication as a framework for uncertainty propagation.

With the matrix representation, resampling possible base calls according to quality scores provides a bootstrap- or prior distribution-like first step towards genetic analysis.

This framework involves converting the uncertainty scores into a matrix of probabilities, and repeatedly sampling from this matrix and using the resultant samples in downstream analysis.

□ Macarons: Uncovering complementary sets of variants for predicting quantitative phenotypes

Macarons, a fast and simple algorithm, to select a small, complementary subset of variants by avoiding redundant pairs that are likely to be in linkage disequilibrium.

Macarons features two simple, interpretable parameters to control the time/performance trade-off: the number of SNPs to be selected (k), and maximum intra-chromosomal distance (D, in base pairs) to reduce the search space for redundant SNPs.

□ Detecting Spatially Co-expressed Gene Clusters with Functional Coherence by Graph-regularized Convolutional Neural Network

The graph-regularized CNN models the expressions of a gene over spatial locations as an image of a gene activity map, and naturally utilizes the spatial localization information by performing convolution operation to capture the nearby tissue textures.

The model further exploits prior knowledge of gene relationships encoded in PPI network as a regularization by graph Laplacian of the network to enhance biological interpretation of the detected gene clusters.

□ deepMNN: Deep Learning-Based Single-Cell RNA Sequencing Data Batch Correction Using Mutual Nearest Neighbors

>> https://www.frontiersin.org/articles/10.3389/fgene.2021.708981/full

deepMNN identifies mutual nearest neighbor (MNN) pairs across different batches in a PCA subspace. A residual-based batch correction network was then constructed and employed to remove batch effects based on these MNN pairs.

The overall loss of deepMNN was designed as the sum of a batch loss and a weighted regularization loss. The batch loss was used to compute the distance between cells in MNN pairs in the PCA subspace, while the regularization loss was to make the output of the network similar to the input.