lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

εν αρχη ην ο λογος.

2023-03-13 03:13:13 | Science News

(Art by joeryba.eth)

私たちが直面する問題は2種類に分けられる。それは「己の限界」と「他者の檻」である。全ての主観者が『反復』するプロセスを織り込んで、2つの問題は常に背中合わせとなる。自らが解決した問題は常に他者を囚え続け、鏡のようにその逆が成り立つ。檻から出た先は檻であり、入れ子のように循環する。




□ Φ-SO: Deep symbolic regression for physics guided by units constraints: toward the automated discovery of physical laws

>> https://arxiv.org/abs/2303.03192

Φ-SO, a Physical Symbolic Optimization framework for recovering analytical symbolic expressions from physics data using deep reinforcement learning techniques by learning units constraints.

Φ-SO restricts the freedom of the equation generator, and balanced units are proposed by construction, thus greatly reducing the search space. It enables the algorithm to zero-out the probability of forbidden symbols that would result in expressions that violate units rules.





□ scPheno: Extraction of biological signals by factorization enables the reliable analysis of single-cell transcriptomics

>> https://www.biorxiv.org/content/10.1101/2023.03.04.531126v1

scPheno, a deep auto-regressive factor model that is used to extract the biological signals imbedded in transcriptome, identify gene expression variations associated with each of the phenotypes, and re-build the accumulative effect of multiple phenotypes on cell states.

scPheno will factorize gene expression pertaining to a phenotypic factor and project cells onto a latent variable space, where the latent variable specifies a hidden cell state and cells of the same hidden states will cluster together.

The deep factor model will infer the factorized latent variable spaces. The factorization neural networks and the reconstruction neural network can be coupled to predict gene expression in relation to any factor combination.





□ INSnet: a method for detecting insertions based on deep learning network

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05216-0

INSnet divides the reference genome into continuous sub-regions and takes five features for each locus through alignments between long reads and the reference genome. Next, INSnet uses a depthwise separable convolutional network.

INSnet uses two attention mechanisms, the convolutional block attention module (CBAM) and efficient channel attention (ECA) to extract key alignment features in each sub-region. INSnet uses a gated recurrent unit (GRU) network to further extract more important SV signatures.





□ LEMUR: Analysis of multi-condition single-cell data with latent embedding multivariate regression

>> https://www.biorxiv.org/content/10.1101/2023.03.06.531268v1

A new statistical model for differential expression analysis (or ANOVA) of multi-condition single-cell data that combines the ideas of linear models and principal compo- nent analysis (PCA).

Latent embedding multivariate regression (LEMUR) is based on a parametric mapping of latent space representations into each other and uses a design matrix to encode categorical and continuous covariates.





□ The Network Zoo: a multilingual package for the inference and analysis of gene regulatory networks

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02877-1

The Network Zoo, a platform that harmonizes the codebase for these methods, in line with recent similar efforts, and provides implementations in R, Python, MATLAB, and C. The netZoo codebase has helped develop an ecosystem of online resources for GRN inference and analysis.

netZoo integrates PANDA, LIONESS, and MONSTER to infer TF-gene targeting to explore how regulatory changes affect disease phenotype, and used DRAGON to integrate nine types of genomic information and find multi-omic markers that are associated with drug sensitivity.





□ RGT: a toolbox for the integrative analysis of high throughput regulatory genomics data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05184-5

Regulatory Genomics Toolbox (RGT) was programmed in an oriented-object fashion and its core classes provided functionalities to handle typical regulatory genomics data: regions and signals.

RGT built distinct regulatory genomics tools, i.e., HINT for footprinting analysis, TDF for finding DNA–RNA triplex, THOR for ChIP-seq differential peak calling, motif analysis for TFBS matching and enrichment, and RGT-viz for regions association tests and data visualization.

THOR is a Hidden Markov Model-based approach to detect and analyze differential peaks in two sets of ChIP-seq data from distinct biological conditions with replicates. Triplex Domain Finder (TDF) characterizes the triplex-forming potential between RNA and DNA regions.





□ phytools 2.0: An updated R ecosystem for phylogenetic comparative methods (and other things)

>> https://www.biorxiv.org/content/10.1101/2023.03.08.531791v1

The phytools library has now grown to be very large – consisting of hundreds of functions, a documentation manual that’s over 200 pages in length, and tens of thousands of lines of computer code.

For Mk model-fitter (which here will be the phytools function fitMk), and for the other discrete character methods of the phytools R package, the input phenotypic trait data will typically takes the form of a character or factor vector.





□ NextDenovo: An efficient error correction and accurate assembly tool for noisy long reads

>> https://www.biorxiv.org/content/10.1101/2023.03.09.531669v1

NextDenovo, a highly efficient error correction and CTA-based assembly tool for noisy long reads. NextDenovo can rapidly correct reads; these corrected reads contain fewer errors than other comparable tools and are characterized by fewer chimeric alignments.

NextDenovo uses the BOG algorithm to remove edges for non-repeat nodes. The graph usually contained some linear paths connecting some complex subgraphs. All paths were broken at the node connecting with multi-paths, and contigs were outputted from these broken linear paths.





□ vcfdist: Accurately benchmarking phased small variant calls in human genomes

>> https://www.biorxiv.org/content/10.1101/2023.03.10.532078v1

vcfdist, an alignment-based small variant calling evaluator that standardizes query and truth VCF variants to a consistent representation, requires local phasing of both input VCFs, and gives partial credit to variant calls which are mostly (but not exactly) correct.

A novel variant clustering algorithm reduces downstream computation while discovering long range variant dependencies. A novel alignment distance based metrics which are independent of variant representation, and measure the distance b/n the final diploid truth / query sequences.





□ scEvoNet: a gradient boosting-based method for prediction of cell state evolution

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05213-3

scEvoNet, a method that builds a cell type-to-gene network using the Light Gradient Boosting Machine (LGBM) algorithm overcoming different domain effects (different species/different datasets) and dropouts that are inherent for the scRNA-seq data.

ScEvoNet builds the confusion matrix of cell states and a bipartite network connecting genes and cell states. It allows a user to obtain a set of genes shared by the characteristic signature of two cell states even between distantly-related datasets.





□ NGenomeSyn: an easy-to-use and flexible tool for publication-ready visualization of syntenic relationships across multiple genomes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad121/7072460

NGenomeSyn, an easy-to-use and flexible tool, for publication-quality visualization of syntenic relationships (user-defined or generated by our custom script) and genomic features (e.g. repeats, structural variations, genes) on tens of genomes with high customization.

NGenomeSyn allows its user to adjust default options for genome and link styles defined in the configuration file and simply adjusts options of moving, scaling, and rotation of target genomes, yielding a rich layout and publication-ready figure.





□ containX: Coverage-preserving sparsification of overlap graphs for long-read assembly

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad124/7074174

ContainX heuristics are promising in terms of improving assembly quality by avoiding coverage gaps. The string graph model filters out contained reads during graph construction.

containX is a prototype implementation of an algorithm that decides which contained reads can be dropped during overlap graph sparsfication. Reads which are substrings of longer reads are typically referred to as contained reads.

Hifiasm retained fewer contained reads than ContainX but it failed to resolve a majority of coverage gaps. The unitig graph of Hifiasm has the least number of junction reads because it does additional graph pruning which is necessary for computing longer unitigs.





□ LoMA: Localized assembly for long reads enables genome-wide analysis of repetitive regions at single-base resolution in human genomes

>> https://pubmed.ncbi.nlm.nih.gov/36895025/

LoMA constructs a CS spanning a target region. This process is initiated by finding overlaps of raw reads using pairwise all-to-all alignment of minimap2, followed by a layout of overlapped reads. It divides the layout into multiple blocks to make partial consensus sequences.

LoMA captures haplotype structures based on SVs and produces haplotype-resolved CSs. LoMA predicts heterozygous loci in the region based on the extent of deviation from the binomial distribution, and the reads derived from each estimated haplotype are gathered.





□ HiFiCNV : Copy number variant caller and depth visualization utility for PacBio HiFi reads

>> https://www.pacb.com/blog/hificnv/

HiFiCNV can generate several CNV related track files which can be loaded into IGV for visualization and assessment of its variant calls. HiFiCNV detected all large CNVs from this dataset, and 90% of those calls had high overlap accuracy when compared to the reported CNV.

Segmentation is performed by a Viterbi parse of the depth bins assuming the bin depth represents a Poisson sampling from a mean depth based on haploid depth. The haploid depth is computed from the zero-excluded mean depth of this chromosome set.





□ ReCo: automated NGS read-counting of single and combinatorial CRISPR gRNAs.

>> https://www.biorxiv.org/content/10.1101/2023.03.09.530923v1

ReCo! finds gRNA read counts (ReCo) in fastq files and runs as a standalone script or a python package. It can be used for single and combinatorial CRISPR-Cas libraries that have been sequenced with single-end or paired-end sequencing strategies.

ReCo works with conventionally cloned CRISPR-Cas libraries and 3Cs/3Cs-MPX libraries. ReCo can process multiple samples in a single run. It automatically determines the constant regions flanking the gRNAs, and utilizes Cutadapt to trim the fastq files.





□ StonPy: a tool to parse and query collections of SBGN maps in a graph database

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad100/7075543

The StonPy library allows users to store SBGN-ML maps into a running Neo4j database, and conversely retrieve them into SBGN-ML. StonPy includes a completion module that allows users to build valid SBGN maps from query results representing parts of maps automatically.

SBGN arcs are optionally modelled using additional Neo4j relationships that mimic the structure of the SBGN map. StonPy brings new capabilities for storing and analyzing large collections of CellDesigner and SBGN maps using Neo4j and Cypher.





□ SLEMM: million-scale genomic predictions with window-based SNP weighting

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad127/7075542

SLEMM (Stochastic-Lanczos-Expedited Mixed Models) uses the Stochastic Lanczos REML and SNP effects for large datasets. SLEMM is fast enough for million-scale genomic predictions.

SLEMM with SNP weighting had overall the best predictive ability among a variety of genomic prediction methods including GCTA’s empirical BLUP, BayesR, KAML, and LDAK’s BOLT and BayesR models.





□ scDeepInsight: a supervised cell-type identification method for scRNA-seq data with deep learning

>> https://www.biorxiv.org/content/10.1101/2023.03.09.531861v1

scDeepInsight can directly annotate the query dataset based on the model trained on the reference dataset. scDeepInsight does preprocessing of scRNA-seq data, including quality control and integration through batch normalization.

scDeepInsight is a single-cell labeling model based on supervised learning, so a reference dataset is also required. DeepInsight is utilized to convert the processed non-image data into images.





□ A general minimal perfect hash function for canonical k-mers on arbitrary alphabets with an application to DNA sequences

>> https://www.biorxiv.org/content/10.1101/2023.03.09.531845v1

A minimal perfect hash function of canonical k-mers on alphabets of arbitrary size, i.e., a mapping to the interval [0, σk /2−1]. The approach is introduced for canonicalization under reversal and extended to canonicalization under reverse complementation.

The encoding is based on the observation that there are fewer canonical k-mers than there are k-mers in general. A mapping is only required if k-mer x is canonical, i.e., x is lexicographically smaller than or equal to x^−1.





□ scBubbletree: quantitative visualization of single cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2023.03.09.531263v1

scBubbletree, a new scalable method for visualization of scRNA-seq data. The method identifies clusters of cells of similar transcriptomes and visualizes such clusters as “bubbles” at the tips of dendrograms, corresponding to quantitative summaries of cluster properties.

scBubbletree stacks bubble trees w/ further cluster-associated information. scBubbletree relies on the gap statistic method. scBubbletree can cluster scRNA-seq data in two ways, namely by graph-based community detection (GCD) algorithms such as Louvain or Leiden, and by k-means.





□ Panpipes: a pipeline for multiomic single-cell data analysis.

>> https://www.biorxiv.org/content/10.1101/2023.03.11.532085v1

Panpipes, a set of workflows designed to automate the analysis of multimodal single-cell datasets by incorporating widely used Python-based tools to efficiently perform QC, preprocessing, integration, clustering, and reference mapping at scale in the multiomic setting.

Panpipes generates a cluster matching metric, the Adjusted Rand Index, for global concordance evaluation. Panpipes can aid building unimodal or multimodal references and enables the user to query multiple references simultaneously using scArches.





□ plasma: Partial LeAst Squares for Multiomics Analysis

>> https://www.biorxiv.org/content/10.1101/2023.03.10.532096v1

plasma, a novel two-step algorithm to find models that can predict time-to-event outcomes on samples from multiomics data sets even in the presence of incomplete data. These components will be automatically associated with the outcome.

plasma uses partial least squares (PLS) for both steps, using Cox regression to learn the single omics models and linear regression. The plasma components are learned in a way that maximizes the covariance in the predictors and the response.





□ eOmics: an R package for improved omics data analysis

>> https://www.biorxiv.org/content/10.1101/2023.03.11.532240v1

eOmics combines an ensemble framework with limma, improving its performance on imbalanced data. It couples a mediation model with WGCNA, so the causal relationship among WGCNA modules, module features, and phenotypes can be found.

eOmics has some novel functional enrichment methods, capturing the influence of topological structure on gene set functions. It contains multi-omics clustering and classification functions to facilitate ML tasks. Some basic functions, such as ANOVA analysis, are also available.





□ Biomappings: Prediction and Curation of Missing Biomedical Identifier Mappings

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad130/7077133

Biomappings, a framework for semi-automatically creating and maintaining mappings in a public, version-controlled repository.

Biomappings combines multiple contributions: (i) a "curation cycle" workflow for creating mappings, (ii) an extensible pipeline for automatically predicting missing mappings between resources, and automatically detecting inconsistencies.

Biomappings currently makes available 9,274 curated mappings and 40,691 predicted ones, providing previously missing mappings between widely used identifier resources covering small molecules, cell lines, diseases, and other concepts.





□ fraguracy: overlapping bases in read-pairs from a fragment indicate accuracy.

>> https://github.com/brentp/fraguracy

Many factors can be predictive of the likelihood of an error. The dimensionality is a consideration because if the data is too sparse, prediction is less reliable. For each combination, while iterating over the bam, it stores the number of errors and the number of total bases in each bin.

fraguracy calculates real error rates using overlapping paired-end reads in a fragment. This avoids some bias. It does limit to the (potentially) small percentage of bases that overlap and it will sample less at the beginning of read 1 and the end of read2.





□ Genes2Genes: Gene-level alignment of single cell trajectories informs the progression of in vitro T cell differentiation

>> https://www.biorxiv.org/content/10.1101/2023.03.08.531713v1

Genes2Genes overcomes current limitations and is able to capture sequential matches and mismatches between a reference and a query at single gene resolution, highlighting distinct clusters of genes with varying patterns of gene expression dynamics.

Genes2Genes utilizes a Bayesian information-theoretic Dynamic Programming alignment algorithm that accounts for matches, warps and indels by combining the classical Gotoh’s biological sequence alignment algorithm and Dynamic Time Warping.





□ GenoPipe: identifying the genotype of origin within (epi)genomic datasets

>> https://www.biorxiv.org/content/10.1101/2023.03.14.532660v1

The three core modules of GenoPipe: EpitopeID, DeletionID, and StrainID were developed to identify major genotypical determinants of cellular identity. GenoPipe can detect genotype perturbations at realistic and practical sequencing depths as defined by ENCODE.

The DeletionID module models the background of a genomic experiment to identify depleted regions of the genome to predict genomic deletions. The StrainID uses existing SNP or variant calls databases of common cell lines to match a cell’s genetic identity inherent to each dataset.

The EpitopeID module identifies the presence and approximate location of specific DNA sequences within the genome. The algorithm functions by first aligning the raw sequencing data (i.e., FASTQ) against a curated DNA sequence database (tagDB) of common protein epitopes.





□ BioConvert: a comprehensive format converter for life sciences

>> https://www.biorxiv.org/content/10.1101/2023.03.13.532455v1

BioConvert aggregates existing software within a single framework and complemented them with original code when needed. It provides a common interface to make the user experience more streamlined instead of having to learn tens of them.

BioConvert supports about 50 formats and 100 direct conversions in areas such as alignment, sequencing, phylogeny, and variant calling. BioConvert can also be utilized by developers as a universal benchmarking framework for evaluating and comparing numerous conversion.





□ Fast Approximate IsoRank for Scalable Global Alignment of Biological Networks

>> https://www.biorxiv.org/content/10.1101/2023.03.13.532445v1

A new IsoRank approximation, which exploits the mathematical properties of IsoRank's linear system to solve the problem in quadratic time with respect to the maximum size of the two PPI networks.

A computationally cheaper refinement is proposed to this initial approximation so that the updated result is even closer to the original IsoRank formulation.

In synthetic experiments, they create random graphs using the Erd ̋os R ́enyi and Barab ́asi-Albert models, and ask IsoRank to recover the graph isomorphism between the graphs and a random node permutation.





□ IntLIM 2.0: identifying multi-omic relationships dependent on discrete or continuous phenotypic measurements

>> https://academic.oup.com/bioinformaticsadvances/article-abstract/3/1/vbad009/7022005

IntLIM 2.0 uncovers phenotype-dependent linear associations between two types of analytes. IntLIM 2.0 extends IntLIM 1.0 to support generalized analyte measurement data types, continuous phenotypic measurement, covariate correction, model validation and unit testing.

IntLIM 2.0 supports model validation using cross-validation and random permutation models.





□ NanoSquiggleVar: A method for direct analysis of targeted variants based on nanopore sequencing signals

>> https://www.biorxiv.org/content/10.1101/2023.03.15.532860v1

NanoSquiggleVar can directly identify targeted variants from the nanopore sequencing electrical signal without the requirement of base calling, sequence alignment, or variant detection with downstream analysis.

In each sequencing iteration, the signal is sliced into fragments by a moving window of 1-unit step size. Dynamic time warping is used to compare the signal squiggles to the detected variants. NanoSquiggleVar can only determine the existence of a mutation and not its frequency.





□ HiDecon: Accurate estimation of rare cell type fractions from tissue omics data via hierarchical deconvolution

>> https://www.biorxiv.org/content/10.1101/2023.03.15.532820v1

HiDecon, a penalized approach with constraints from both “parent” and “children” cell types to make full use of a hierarchical tree structure. The hierarchical tree is readily available from well-studied cell lineages or can be learned from hierarchical clustering of scRNA-seq.

The basic intuition of HiDecon is that there exists a summation relationship b/n the estimation results of adjacent layers. HiDecon implements the sum constraint penalties from the upper and lower layers to aggregate estimates across layers for more accurate cellular fraction.






□ Implementing Dynamic Time Warping (DTW) with Neural Networks and analyzing single-cell RNA data involves creating a custom model architecture with GPT-4.




Yubais RT

昔のAI観ではまず「知性そのもの」みたいなのをコンピュータ内に作って、それと人間が会話するためのインターフェースを別途作るようなイメージだったんだが、インターフェースであるはずの言語に知性っぽいものが内包されていたんじゃないか、と現状を見ていて思う



Oblivion.

2023-03-13 03:12:03 | Science News




□ InClust+: the multimodal version of inClust for multimodal data integration, imputation, and cross modal generation

>> https://www.biorxiv.org/content/10.1101/2023.03.13.532376v1

inClust+ extends the inClust by adding two new modules, namely, the input-mask module in front of encoder and the output-mask module behind decoder. It could integrate multimodal data profiled from different cells in similar populations or from a single cell.

The inClust+ encodes the scRNA and MERFISH data into latent space respectively. After covariates (modalities) removal by vector subtraction, the samples from different modalities were mixed together and clustered according to their cell types.





□ RNA-MSM: Multiple sequence-alignment-based RNA language model and its application to structural inference

>> https://www.biorxiv.org/content/10.1101/2023.03.15.532863v1

While BERT (Bidirectional Encoder Representations from Transformers)-like language models have been developed for RNA, they are ineffective at capturing the evolutionary information from homologous sequences because unlike proteins, RNA sequences are less conserved.

RNA MSA-transformer language model (RNA-MSM) takes the multiple aligned sequences as an input, and outputs corresponding embeddings and attention maps. RNA-MSM can be directly mapped with high accuracy to 2D base pairing probabilities and 1D solvent accessibilities.






□ Quantum computing algorithms: getting closer to critical problems in computational biology

>> https://academic.oup.com/bib/article/23/6/bbac437/6758194

QiBAM basically extends Grover’s search algorithm to allow for errors in the alignment between reads and the reference sequence stored in a quantum memory. The qubit complexity is equal to O(M · log2A + log2 N − M ).

Longest diagonals patterns in the matrix, possibly not perfectly shaped owing to mismatches and short insertions/deletions, highlight the regions of highest similarity and can be detected w/ a quantum pattern recognition. The overall time complexity of the method is O(log2(NM)).

Quantum solutions for the de novo assembly problems are based on strategies for efficiently solving the Hamiltonian path in OLC graphs.

The iterative application of the time evolution operators relative to the cost and mixing Hamiltonian approximates the adiabatic transition between the ground state of the mixing Hamiltonian and the ground state of the cost Hamiltonian that represents the optimal solution.





□ On quantum computing and geometry optimization

>> https://www.biorxiv.org/content/10.1101/2023.03.16.532929v1

This work attempts to explore a few ways in which classical data, relating to the Cartesian space representation of biomolecules, can be encoded for interaction with empirical quantum circuits not demonstrating quantum advantage.

Using the quantum circuit for random state generation in a variational arrangement together with a classical optimizer, this work deals with the optimization of spatial geometries with potential application to molecular assemblies.

Dihedral data is used with a quantum support vector classifier to introduce machine learning capabilities. Aditionally, empirical rotamer sampling is demonstrated using quantum Monte Carlo simulations for side-chain conformation sampling.





□ DTWax: GPU-accelerated Dynamic Time Warping for Selective Nanopore Sequencing

>> https://www.biorxiv.org/content/10.1101/2023.03.05.531225v1

Subsequence Dynamic Time Warping (sDTW) is a two-dimensional dynamic programming algorithm tasked with finding the best map of the whole of the input query squiggle in the longer target reference.

DTWax, a GPU-accelerated sDTW software for nanopore Read Until to save time and cost of nanopore sequencing and compute. DTWax uses use floating point operations and Fused-Multiply-Add operations. DTWax achieves ∼1.92X sequencing speedup and ∼3.64X compute speedup.





□ Quantum algorithm for position weight matrix matching

>> https://www.biorxiv.org/content/10.1101/2023.03.06.531403v1

The PWM matching is applied to a long genome DNA sequence of million bases such that every segment i in the DNA sequence is assigned a score WM(ui ...ui+m−1) and they search Psol, segments with scores higher than the threshold wth .

The PWM matching quantum algorithm based on the naive iteration method. For any sequence with length n and any K PWMs for sequence motifs with length m, given the oracles to get the specified entry It can find n matches with high probability making queries to the oracles.





□ scMCs: a framework for single cell multi-omics data integration and multiple clusterings

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad133/7079796

scMCs uses the omics-independent deep autoencoders to learn the low-dimensional representation of each omics. scMCs utilizes the contrastive learning strategy, and fuses the individuality and commonality features into a compact co-embedding representation for data imputation.

scMCs applies multi-head attention mechanism on the co-embedding representation to generate multiple salient subspaces, and reduce the redundancy between subspaces. scMCs optimizes a Kullback Leibler (KL) divergence based clustering loss in each salient subspace.





□ CLASSIC: Ultra-high throughput mapping of genetic design space

>> https://www.biorxiv.org/content/10.1101/2023.03.16.532704v1

CLASSIC (combining long- and short- range sequencing to investigate genetic complexity), a generalizable genetic screening platform that combines long- and short-read NGS modalities to quantitatively assess pooled libraries of DNA constructs of arbitrary length.

Due to the random assignment of barcodes to assembled constructs, each variant in a CLASSIC library is associated with multiple unique barcodes that generate independent phenotypic meas- urements, leading to greater accuracy than a one-to-one construct-to-barcode library.





□ EnsembleTR : A deep population reference panel of tandem repeat variation

>> https://www.biorxiv.org/content/10.1101/2023.03.09.531600v1

EnsembleTR, which takes TR genotypes output by existing tools (currently ExpansionHunter, adVNTR, HipSTR, and GangSTR) as input, and outputs a consensus TR callset by converting TR genotypes to a consistent internal representation and using a voting-based scheme.

They apply EnsembleTR to genotype 1.7 million TRs based on the hg38 reference genome across deep PCR-free WGS for 3,202 individuals from the 1000GP2 and PCR+ WGS data for 348 individuals from H3Africa Project.

EnsembleTR then identifies overlapping TR regions genotyped by two or more tools, infers a mapping between alternate allele sets reported by each method, and outputs a consensus genotype and quality score for each call.





□ Direct Estimation of Parameters in ODE Models Using WENDy: Weak-form Estimation of Nonlinear Dynamics

>> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10002818/

WENDy is a highly robust and efficient method for parameter inference in differential equations. Without relying on any numerical differential equation solvers, WENDy computes accurate estimates and is robust to large (biologically relevant) levels of measurement noise.

WENDy is competitive with conventional forward solver-based nonlinear least squares methods in terms of speed and accuracy. For both higher dimensional systems and stiff systems, WENDy is typically both faster and more accurate than forward solver-based approaches.





□ miloDE: Sensitive cluster-free differential expression testing.

>> https://www.biorxiv.org/content/10.1101/2023.03.08.531744v1

miloDE exploits the notion of overlapping neighborhoods of homogeneous cells, constructed from graph-representation of scRNA-seq data, and performs testing within each neighborhood. Multiple testing correction is performed either across neighborhoods or across genes.

As input, the algorithm takes a set of samples with given labels (case or control) alongside a joint latent embedding. Next, miloDE generates a graph recapitulating the distances between cells and define neighbourhoods using the 2nd-order kNN graph.





□ GPMeta: a GPU-accelerated method for ultrarapid pathogen identification from metagenomic sequences

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbad092/7077155

GPMeta can rapidly and accurately remove host contamination, isolate microbial reads, and identify potential disease-causing pathogens. GPMeta is much faster than existing CPU-based tools, being 5-40x faster than Kraken2 and Centrifuge and 25-68x faster than Bwa and Bowtie2.

GPMeta offers GPMetaC clustering algorithm, a statistical model for clustering and rescoring ambiguous alignments to improve the discrimination of highly homologous sequences.





□ SpaSRL: Spatially aware self-representation learning for tissue structure characterization and spatial functional genes identification

>> https://www.biorxiv.org/content/10.1101/2023.03.13.532390v1

spatially aware self-representation learning (SpaSRL), a novel method that achieves spatial domain detection and dimension reduction in a unified framework while flexibly incorporating spatial information.

SpaSRL enhances and decodes the shared expression between spots for simultaneously optimizing the low-dimensional spatial components (i.e., spatial meta genes) and spot-spot relations through a joint learning model that can transfer spatial information constraint from each other.

SpaSRL can improve the performance of each task and fill the gap between the identification of spatial domains and functional (meta) genes accounting for biological and spatial coherence on tissue.





□ compare_genomes: a comparative genomics workflow to streamline the analysis of evolutionary divergence across genomes

>> https://www.biorxiv.org/content/10.1101/2023.03.16.533049v1

compare_genomes, a transferable and extendible comparative genomics workflow built using the Nextflow framework and Conda package management system.

compare_genomes provides a wieldy pipeline to test for non-random evolutionary patterns which can be mapped to evolutionary processes to help identify the molecular basis of specific features or remarkable biological properties of the species analysed.





□ LBConA: a medical entity disambiguation model based on Bio-LinkBERT and context-aware mechanism

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05209-z

LBConA first Bio-LinkBERT, which is capable of learning cross-document dependencies, to obtain embedding representations of mentions and candidate entities. Then, cross-attention is used to capture the interaction information of mention-to-entity and entity-to-mention.

Encoding the context of mentions using ELMo, which captures lexical information, and computing the context score using a self-attention mechanism to obtain contextual cues about disambiguation.





□ nPoRe: n-polymer realigner for improved pileup-based variant calling

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05193-4

Defining copy number INDELs as n-polymers (3+ exact copies of the same repeat unit), with a differing number of copies from the expected reference. For example, AAAA→AAAAA and ATATAT→ATAT meet this definition, but ATAT→ATATAT, AATAATAAAT→AATAAT, and ATATAT→ATATA do not.

nPoRe’s algorithm is directly designed to reduce alignment penalties for n-polymer copy number INDELs and improve alignment in low-complexity regions. It extends Needleman-Wunsch affine gap alignment by new gap penalties for more accurately aligning repeated n-polymer sequences.





□ PhyloSophos: a high-throughput scientific name mapping algorithm augmented with explicit consideration of taxonomic science

>> https://www.biorxiv.org/content/10.1101/2023.03.17.533059v1

PhyloSophos, a high-throughput scientific name processor designed to provide connections between scientific name inputs and a specific taxonomic system. PhyloSophos is conceptually a mapper that returns the corresponding taxon identifier from a reference of choice.

PhyloSophos can refer to multiple available references to search for synonyms and recursively map them into a chosen reference. It also corrects common Latin variants and vernacular names, subsequently returns proper scientific names and its corresponding taxon identifiers.





Singular Genomics RT

>> https://singulargenomics.com/g4/reagents/

We’ve designed a selection of kits for the G4 with multiple configurations depending on read length and size requirements for maximum system flexibility and cost efficiency.

Explore the capabilities of the F2, F3, and Max Read Kits for your application





□ Robust classification using average correlations as features (ACF)

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05224-0

In contrast to the KNN classifier, ACF intrinsically considers all cross-correlations between classes, without limiting itself to certain elements of CTrain. DBC incorporates cross-correlations but relies on a fixed claiming-scheme and weighted Kullback–Leibler decision rules.

For ACF, the baseline classifier may instead be chosen depending on the data and can be further adapted, e.g. increasing the depth of decision trees. The modularity of ACF allows to integrate deep-learning based methods, such as a Multi-Layer Perceptron as baseline classifier.





□ aenmd: Annotating escape from nonsense-mediated decay for transcripts with protein-truncating variants

>> https://www.biorxiv.org/content/10.1101/2023.03.17.533185v1

aenmd predicts escape from NMD for combinations of transcripts and PTC-generating variants by applying a set of NMD-escape rules, which are based on where the PTC is situated within the mutant transcript.

Variant-transcript pairs with a PTC conforming to any of the above rules will be annotated to escape NMD, but results for all rules are reported individually by aenmd; this allows users to focus on subsets of rules.





□ seqspec: A machine-readable specification for genomics assays

>> https://www.biorxiv.org/content/10.1101/2023.03.17.533215v1

seqspec, a machine-readable specification for libraries produced by genomics assays that facilitates standardization of preprocessing and enables tracking and comparison of genomics assays.

seqspec defines a machine-readable file format, based on YAML. Reads are annotated by Regions which can be nested and appended to create a seqspec. Regions are annotated with a variety of properties that simplify the downstream identification of sequenced elements.





□ C.Origami: Cell-type-specific prediction of 3D chromatin organization enables high-throughput in silico genetic screening

>> https://www.nature.com/articles/s41587-022-01612-8

C.Origami, a multimodal deep neural network that performs de novo prediction of cell-type-specific chromatin organization using DNA sequence and two cell-type-specific genomic features—CTCF binding and chromatin accessibility.

C.Origami enables in silico experiments to examine the impact of genetic changes on chromatin interactions. The accuracy of C.Origami allows systematic identification of cell-type-specific mechanisms of genomic folding through in silico genetic screening (ISGS).





□ Seqpac: A framework for sRNA-seq analysis in R using sequence-based counts

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btad144/7082956

Seqpac is designed to preserve sequence integrity by avoiding a feature-based alignment strategy that normally disregards sequences that fail to align to a target genome.

Using an innovative targeting system, Seqpac process, analyze and visualize sample or sequence group differences using the PAC object. Seqpac uses a strategy for sRNA-seq analysis that preserves the integrity of the raw sequence making the data lineage fully traceable.





□ The hidden factor: accounting for covariate effects in power and sample size computation for a binary trait

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad139/7082519

When performing power estimation or replication sample size calculation for a continuous trait through linear regression, covariate effects are implicitly accounted for through residual variance.

When analyzing a binary trait through logistic regression, covariate effects must be explicitly specified and included in power and sample size computation, in addition to the genetic effect of interest.

SPCompute is used for accurate and efficient power and sample size computation for a binary trait that takes into account different types of non-genetic covariates E, and allows for different types of G-E relationship.





□ OutSingle: A Novel Method of Detecting and Injecting Outliers in RNA-seq Count Data Using the Optimal Hard Threshold for Singular Values

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad142/7083276

OutSingle (Outlier detection using Singular Value Decomposition), an almost instantaneous way of detecting outliers in RNA-Seq GE data. It uses a simple log-normal approach for count modeling.

OutSingle uses Optimal Hard Threshold method for noise detection, which itself is based on Singular Value Decomposition. Due to its SVD/OHT utilization, OutSingle’s model is straightforward to understand and interpret.





□ ReConPlot – an R package for the visualization and interpretation of genomic rearrangements

>> https://www.biorxiv.org/content/10.1101/2023.02.24.529890v2

ReConPlot (REarrangement and COpy Number PLOT), an R package that provides functionalities for the joint visualization of SCNAs and SVs across one or multiple chromosomes.

ReConPlot is based on the popular ggplot2 package, thus allowing customization of plots and the generation of publication-quality figures with minimal effort. ReConPlot facilitates the exploration, interpretation, and reporting of complex genome rearrangement patterns.





□ MetaLLM: Residue-wise Metal ion Prediction Using Deep Transformer Model

>> https://www.biorxiv.org/content/10.1101/2023.03.20.533488v1

MetaLLM, a metal binding site prediction technique, by leveraging the recent progress in self-supervised attention-based (e.g. Transformer) large language models (LLMs) and a considerable amount of protein sequences.

MetaLLM uses a transformer pre-trained on an extensive database of protein sequences and later fine-tuned on metal-binding proteins for multi-label metal ions prediction. A 10-fold cross-validation shows more than 90% precision for the most prevalent metal ions.





□ escheR: Unified multi-dimensional visualizations with Gestalt principles

>> https://www.biorxiv.org/content/10.1101/2023.03.18.533302v1

Existing visualization methods create cognitive gaps on how to associate the disparate information or how to interpret the biological findings of this multi-dimensional information regarding their (micro- )environment or colocalization.

escheR leverages Gestalt principles to improve the design and interpretability of multi-dimensional data in 2D data visualizations, layering aesthetics to display multiple variables.





□ RExPRT: a machine learning tool to predict pathogenicity of tandem repeat loci

>> https://www.biorxiv.org/content/10.1101/2023.03.22.533484v1

RExPRT is designed to distinguish pathogenic from benign TR expansions. Leave-one-out cross validation results demonstrated that an ensemble approach comprised of SVM and extreme gradient boosted decision tree (XGB).

RExPRT uses GridSearchCV to fine-tune the SVM and XGB models. RExPRT incorporates information on the genetic architecture of a TR locus, such as its proximity to regulatory regions, TAD boundaries, and evolutionary constraints.





□ Cue: a deep-learning framework for structural variant discovery and genotyping

>> https://www.nature.com/articles/s41592-023-01799-x

Cue, a novel generalizable framework for SV calling and genotyping, which can effectively leverage deep learning to automatically discover the underlying salient features of different SV types and sizes.

Cue genotype SVs that can learn complex SV abstractions directly from the data. Cue converts alignments to images that encode SV-informative signals and uses a stacked hourglass convolutional neural network to predict the type, genotype and genomic locus of the SVs captured in each image.





□ FLONE: fully Lorentz network embedding for inferring novel drug targets

>> https://www.biorxiv.org/content/10.1101/2023.03.20.533432v1

FLONE, a novel hyperbolic Lorentz space embedding-based method to capture the hierarchical structural information in the DDT network. FLONE generates more accurate candidate target predictions given the drug and disease than the Euclidean translation-based counterparts.

FLONE enables a hyperbolic similarity calculation based on FuLLiT (fully Lorentz linear transformation), which essentially calculates the Lorentzian distance (i.e., similarity) between the hyperbolic embeddings of candidate targets and the hyperbolic representation.





□ Flexible parsing and preprocessing of technical sequences with splitcode

>> https://www.biorxiv.org/content/10.1101/2023.03.20.533521v1

splitcode can simultaneously trim adapter sequences, parse combinatorial barcodes that are variable in length and inconsistent in location within a read, and extract UMIs that are defined in location with respect to other technical sequences rather than at a set position within a read.

splitcode can seamlessly interface with other commandline tools, including other read sequencing read preprocessors as well as read mappers, by streaming the pre-processed reads into those tools.





□ Inference of single cell profiles from histology stains with the Single-Cell omics from Histology Analysis Framework (SCHAF)

>> https://www.biorxiv.org/content/10.1101/2023.03.21.533680v1

SCHAF discovers the common latent space from both modalities across different samples. SCHAF then leverages this latent space to construct an inference engine mapping a histology image to its corresponding (model-generated) single-cell profiles.





Oxford Nanopore RT

>> https://newstimes18.com/how-ai-is-transforming-genomics/

Analysing sequencing data requires accelerated compute & #datascience to read and understand the genome. Read why #AI, #deeplearning, #RNN- and CNN-based models are essential for #genomics.





□ 現在の職務内容、以前の分析・施策から開発寄りの立場に変わったのだけど、GPT-4は戦略のコアにこそ最大の恩恵を齎すもので、要件定義が重畳する既存の統合環境では代替プログラミングの生成効率は限定的。特定のコスト条件で環境設計させるか、インターフェース間にダイアグノーシス機能を構築するか。