lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Intangible.

2023-02-22 02:22:22 | Science News

知性とは、関わる人間との相性や環境に依って相互に作用するベクトル量である。一時的な均衡に囚われていても、それは自己淘汰に向かう過程かもしれない。



□ The omnitig framework can improve genome assembly contiguity in practice

>> https://www.biorxiv.org/content/10.1101/2023.01.30.526175v1

Simple omnitigs are walks having a non-branching core, such that all nodes to the right of the core have out-degree one, and all nodes to the left of the core have in-degree one. It significantly improve length and contiguity over unitigs, while almost reaching that of omnitigs.

Simple omnitigs remain safe even when there are multiple linear chromosomes, as long as no chromosome starts or ends inside them. They give a linear output-sensitive time algorithm for finding all simple omnitigs.





□ SHARE-Topic: Bayesian Inerpretable Modelling of Single-Cell Multi-Omic Data

>> https://www.biorxiv.org/content/10.1101/2023.02.02.526696v1

SHARE-Topic, a Bayesian generative model of multi-omic single cell data. SHARE-Topic identifies common patterns of co-variation between different ‘omic layers, providing interpretable explanations for the complexity of the data.

SHARE-Topic extends the cisTopic model of single-cell chromatin accessibility by coupling the epigenomic state with gene expression through latent variables. SHARE-Topic provides a low-dimensional representation of multi-omic data by embedding cells in a topic space.





□ Verkko: Telomere-to-telomere assembly of diploid chromosomes

>> https://www.nature.com/articles/s41587-023-01662-6

To resolve the most complex repeats, this project relied on manual integration of ultra-long Oxford Nanopore sequencing reads with a high-resolution assembly graph built from long, accurate PacBio HiFi reads.

Verkko begins with a multiplex de Bruijn graph built from long, accurate reads and simplifies this graph by integrating ultra-long reads and haplotype-paths. A phased, diploid assembly of both haplotypes, with many chromosomes automatically assembled from telomere to telomere.


Genome Gov

>> https://www.genome.gov/news/news-release/nih-software-assembles-complete-genome-sequences-on-demand

.@Genome_gov researchers have developed and released an innovative software tool called Verkko for assembling truly complete genome sequences from a variety of species! Verkko makes assembling complete genome sequences more affordable and accessible.







□ scBGEDA: Deep Single-cell Clustering Analysis via a Dual Denoising Autoencoder with Bipartite Graph Ensemble Clustering

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad075/7025496

scBGEDA preprocesses the high-dimensional sparse scRNA-seq data into compressed low-dimensional data. The second module is a single-cell denoising autoencoder based on a dual reconstruction loss that characterizes the scRNA-seq data by learning the robust feature representations.

scBGEDA comprises a bipartite graph ensemble clustering method used on the learned latent space to obtain the optimal clustering result. The scBGEDA algorithm encodes the scRNA-seq data in a discriminative representation, on which two decoders reconstruct the scRNA-seq data.





□ stRainy: assembly-based metagenomic strain phasing using long reads

>> https://www.biorxiv.org/content/10.1101/2023.01.31.526521v1

stRainy, an algorithm for phasing and assembly of closely-related strains. stRainy takes a sequence graph as input, identifies graph regions that represent collapsed strains, phases them and represents the results in an expanded and simplified assembly graph.

stRainy works with either a linear reference or a de novo assembly graph as input, and supports long reads. Because the strain variants are often unevenly distributed, regions of high and low heterozygosity may interleave in the assembly graph, which leads to tangles.





□ SATURN: Towards Universal Cell Embeddings: Integrating Single-cell RNA-seq Datasets across Species

>> https://www.biorxiv.org/content/10.1101/2023.02.03.526939v1

SATURN (Species Alignment Through Unification of Rna and proteiNs), a deep learning approach that integrates cross-species scRNA-seq datasets by coupling gene expression with protein embeddings generated by large protein language models.

SATURN introduces a concept of macrogenes defined as groups of functionally related genes. The strength of associations of genes to macrogenes are learnt to reflect the similarity of their corresponding protein embeddings.





□ PLANET: A Multi-Objective Graph Neural Network Model for Protein-Ligand Binding Affinity Prediction

>> https://www.biorxiv.org/content/10.1101/2023.02.01.526585v1

PLANET (Protein-Ligand Affinity prediction NETwork) was trained through a multi-objective process as multi-objective training has been proven useful for improving the performance and generalization of binding affinity prediction models.

PLANET is essentially a GNN model that captures protein–ligand interactions from the input structures, while deriving the intra-ligand distance matrix helps PLANET to capture 3D features from the 2D structural graph of the ligand.





□ Protein Sequence Design by Entropy-based Iterative Refinement

>> https://www.biorxiv.org/content/10.1101/2023.02.04.527099v1

An iterative sequence refinement pipeline, which can refine the sequence generated by existing sequence design. It retains reliable predictions based on the model’s confidence in predicted distributions, and decodes the residue type based on a partially visible environment.

Computing the entropy of the predicted distribution at each position and select the positions with low entropy, with the assumption that models are more confident with low-entropy predictions.

This method can remove a large portion of noise in the input residue environment, which improves both the generated sequences and the converging speed. The final prediction will be the averaged prediction from every iteration weighted by their entropy.





□ Metaphor - A workflow for streamlined assembly and binning of metagenomes

>> https://www.biorxiv.org/content/10.1101/2023.02.09.527784v1

Metaphor, a fully-automated workflow for GRM. Metaphor differs from GRM workflows by offering flexible approaches for the assembly and binning of the input data, and by combining multiple binning algorithms with a bin refinement step to achieve high quality genome bins.

Metaphor processes multiple datasets in a single execution, performing assembly and binning in separate batches for each dataset, and avoiding the need for repeated executions with different input datasets.





□ LEA: Latent Eigenvalue Analysis in application to high-throughput phenotypic profiling

>> https://www.biorxiv.org/content/10.1101/2023.02.10.528026v1

By quantifying the multi-dimensional eigenvalue difference, sorted eigenvalues can provide informative measurements along principal axes and facilitate a more complete analysis of data heterogeneity.

LEA learns robust latent representations with a residual-based encoder for reconstructing these single-cell images. LEA can refine the high-throughput cell-based drug analysis to single-cell and single-organelle granularity.





□ WMDS.net: a network control framework for identifying key players in transcriptome programs

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad071/7023921

The weight of WMDS.net (the weighted minimum dominating set network) integrates the degree of nodes in the network and the significance of gene co-expression difference between two physiological states into the measurement of node controllability of the transcriptional network.





□ NIAPU: Network-Informed Adaptive Positive-Unlabeled learning for disease gene identification

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac848/7023926

A set of network-based features to be used in a novel Markov diffusion-based multi-class labeling strategy. NIAPU is formed by the computation of the NeDBIT (Network diffusion and biology-informed topological) and the usage of APU (Adaptive Positive-Unlabelled label propagation).

The NIAPU classification is almost perfect since NeDBIT features allow those classes to be properly separated from the others since they grasp the topological aspects of the set of seed genes as a whole, assigning lower and lower weights to genes that are progressively far.





□ cvlr: finding heterogeneously methylated genomic regions using ONT reads

>> https://academic.oup.com/bioinformaticsadvances/article/3/1/vbac101/6998217

cvlr, a software which can be run from the command line on the output of Nanopore sequencing to cluster reads based on methylation patterns. Internally, the algorithm sees the data as a binary matrix, w/ n rows representing reads and d columns corresponding to genomic positions.

Reads are clustered (into k clusters) via a mixture of multivariate Bernoulli distributions. cvlr uses an EM algorithm. cvlr can be run to detect subpopulation of reads regardless of whether they are due to an allelic effect and does not need a preliminary phasing step.





□ A cloud-based pipeline for analysis of FHIR and long-read data

>> https://academic.oup.com/bioinformaticsadvances/article/3/1/vbac095/6994207

A full pipeline for working with both PacBio sequencing data and clinical FHIR data, from initial data to tertiary analysis. It performs variant calling on long-read PacBio HiFi data using Cromwell on Azure.

Both data formats are parsed, processed and merged in a single scalable pipeline which securely performs tertiary analyses using cloud-based Jupyter notebooks.





□ HiMAP2: Identifying phylogenetically informative genetic markers from diverse genomic resources

>> https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.13762

HiMAP2 is a tool designed to identify informative loci from diverse genomic and transcriptomic resources in a phylogenomic framework. HiMAP2 identifies informative loci for phylogenetic studies, but it can also be used more widely for comparative genomic tasks.

HiMAP2 facilitates exploration of the final filtered exons by incorporating phylogenetic inference of individual exon trees with RAxML-NG as well as the estimation of a species tree using ASTRAL.





□ ClusterSeg: A crowd cluster pinpointed nucleus segmentation framework with cross-modality datasets

>> https://www.sciencedirect.com/science/article/abs/pii/S1361841523000191

ClusterSeg tackles nuclei clusters, which consists of a convolutional-transformer hybrid encoder and a 2.5-path decoder for precise predictions of nuclei instance mask, contours, and clustered-edges.

The instance-level segmentation performance adopts the prevalent Aggregated Jaccard Index (AJI), to evaluate connected components instead of pixels which penalizes over-segmentation, under-segmentation, as well mis-segmentation.





□ Mowgli: Paired single-cell multi-omics data integration

>> https://www.biorxiv.org/content/10.1101/2023.02.02.526825v1

Multi-Omics Wasserstein inteGrative anaLysIs (Mowgli), a novel method for the integration of paired multi-omics data with any type and number of omics. Of note, Mowgli combines integrative Nonnegative Matrix Factorization (NMF) and Optimal Transport.

Mowgli employs integrative NMF, popular in computational biology due to its intuitive representation by parts and further enhances its interpretability. Mowgli uses the entropic regularization of Optimal Transport as a reconstruction loss.





□ CATE: A fast and scalable CUDA implementation to conduct highly parallelized evolutionary tests on large scale genomic data.

>> https://www.biorxiv.org/content/10.1101/2023.01.31.526501v1

CATE (CUDA Accelerated Testing of Evolution) is capable of conducting evolutionary tests such as Tajima’s D, Fu and Li's, and Fay and Wu’s test statistics, McDonald–Kreitman Neutrality Index, Fixation Index, and Extended Haplotype Homozygosity.

CATE attempts to solve the problem of latency in conducting evolutionary tests through two key innovations: a unique file hierarchy together with a novel search algorithm (CIS) and GPU level parallelisation with the Prometheus mode.

The Prometheus architecture focuses mainly on batch processing of multiple query regions at the same time, whereas in normal mode CATE will process only a single query region at a time.





□ KmerCamel🐫: Masked superstrings as a unified framework for textual 𝑘-mer set representations

>> https://www.biorxiv.org/content/10.1101/2023.02.01.526717v1

Masked superstrings combines the idea of representing 𝑘-mer sets via a string that contains the 𝑘-mers as substrings, with masking out positions of the newly emerged “false positive” 𝑘-mers. This allows to remove the limitation of using (𝑘 − 1)-long overlap only.

KmerCamel🐫, which first reads a user-provided FASTA file with genomic sequences, computes the corresponding 𝑘-mer set, computes a masked superstring using a user-specified heuristic and core data structure, and prints it in the enc2 encoding.





□ The local topology of dynamical network models for biology

>> https://www.biorxiv.org/content/10.1101/2023.01.31.526544v1

Network motifs/anti-motifs are local structures that appear unusually often/rarely in a network. Their likelihood is quantified based on their average occurrence in randomizations of the network that preserve the degree of each node.

Slight differences are present in the literature about the tresholds and the randomizations involved in the quantitative definition of a motif. This work only considers fully connected triads, i.e. fully connected subsets of three nodes.





□ SNEEP: A statistical approach to identify regulatory DNA variations

>> https://www.biorxiv.org/content/10.1101/2023.01.31.526404v1

SNEEP is fast method to identify regulatry non-coding SNPs (rSNPs) that modify the binding sites of Transcription Factors (TFs) for large collections of SNPs provided by the user.

A modified Laplace distribution can adequately approximate the empirical distributions. It can derive a p-value for the maximal differential TF binding score in constant time.





□ Helixer: de novo Prediction of Primary Eukaryotic Gene Models Combining Deep Learning and a Hidden Markov Model.

>> https://www.biorxiv.org/content/10.1101/2023.02.06.527280v1

Helixer takes DNA sequence as input, makes base-wise predictions for genic class and phase with pre-trained Deep Neural Networks, and processes these predictions with a Hidden Markov Model to into primary gene models.

The optimal scoring path through this Markov Model for a given underlying sequence and set of base-wise predictions is determined with the Viterbi algorithm. The system penalizes discrepancies where the state of the Markov Model differs from the base-wise predictions by Helixer.





□ DEFND-seq: Scalable co-sequencing of RNA and DNA from individual nuclei

>> https://www.biorxiv.org/content/10.1101/2023.02.09.527940v1

DNA and Expression Following Nucleosome Depletion sequencing (DEFND- seq), a scalable method for co-sequencing RNA and DNA from single nuclei that uses commercial droplet microfluidics to achieve a high-throughput.

DEFND-seq treats nuclei with lithium diiodosalicylate to disrupt the chromatin and expose genomic DNA. Tagmented nuclei are loaded into a microfluidic generator, which co-encapsulates nuclei, beads cont. genomic barcodes, and reverse transcription reagents into single droplets.





□ SeqScreen-Nano: a computational platform for rapid, in-field characterization of previously unseen pathogens.

>> https://www.biorxiv.org/content/10.1101/2023.02.10.528096v1

The SeqScreen-Nano pipeline is based on the SeqScreen pipeline with substantial additions to deal with the complexity of long-read sequences Briefly, it is built upon; Initialize, SeqMapper, Protein / Taxonomic Identification, Functional Annotation and, Report generation.

SeqScreen-Nano can identify Open Reading Frames (ORFs) across the length of raw ONT reads and then use the predicted ORFs for accu- rate functional characterization and taxonomic classification.





□ Olivar: fully automated and variant aware primer design for multiplex tiled amplicon sequencing of pathogen genomes

>> https://www.biorxiv.org/content/10.1101/2023.02.11.528155v1

Olivar, an end-to-end pipeline for rapid and automatic design of primers for PCR tiling. Olivar accomplishes this by introducing the concept of the risk of primer design at the single nucleotide level, enabling fast evaluation of thousands of potential tiled amplicon sets.

Olivar looks for designs that avoid regions with high-risk scores based on SNPs, non- specificity, GC contents, and sequence complexity. Olivar also implements the SADDLE algorithm to optimize primer dimers in parallel and provides a separate validation module.





□  Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering

>> https://peerj.com/articles/14779/

Complet+, a novel method to increase the completeness of clusters obtained using large-scale biological sequence clustering methods. Complet+ addresses a key problem with large-scale clustering methods, such as mmSeqs2 clustering and CD-HIT.

Complet+ utilizes the fast search capabilities of MMSeqs2 to identify reciprocal hits between the representative sequences, which may be used to reform clusters and reduce the number of singletons and small clusters and create larger clusters.





□ MDLCN: A multimodal deep learning model to infer cell-type-specific functional gene networks

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05146-x

MDLCN, a multimodal deep learning model, for predicting cell-type-specific FGNs by leveraging single-cell gene expression data with a global protein interaction network.

Gene expression signatures of a gene pair were first transformed to a co-expression matrix that captures the joint density of co-expression patterns of the gene pair across the cells in a particular cell type.

The co-expression matrix and the vector of proximity features were exploited as two modalities in the model, incl. a co-expression-processor modality to extract representations from the co-expression matrix and a proximity-processor modality to extract representations.





□ ChromDL: A Next-Generation Regulatory DNA Classifier

>> https://www.biorxiv.org/content/10.1101/2023.01.27.525971v1

ChromDL, a neural network architecture combining bidirectional gated recurrent units (BiGRU), CNNs, BiLSTM, which significantly improves upon a range of prediction metrics compared to its predecessors in TFBS, histone modification, and DNase-I hypersensitive site (DHS) detection.

ChromDL contains eleven layers. In total, the model contained 10,414,957 parameters, with 512 non-trainable parameters. ChromDL detects a significantly higher proportion of weak TFBS ChIP-seq peaks and demonstrates the potential to more accurately predict TF binding affinities.





□ wpLogicNet: logic gate and structure inference in gene regulatory networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad072/7039679

wpLogicNet proposes a framework to infer the logic gates among any number of regulators, with a low time-complexity. This distinguishes wpLogicNet from the existing logic-based models that are limited to inferring the gate between two genes or TFs.

wpLogicNet applies a Bayesian mixture model to estimate the likelihood of the target gene profile and to infer the logic gate a posteriori. In structure-aware mode, wpLogicNet reconstructs the logic gates in TF-gene or gene-gene interaction networks with known structures.





□ kakapo: Easy extraction and annotation of genes from raw RNA-seq reads

>> https://www.biorxiv.org/content/10.1101/2023.02.13.528395v1

kakapo (ka ̄ka ̄po ̄) is a python-based pipeline that allows users to extract and assemble one or more specified genes or gene families. It flexibly uses original RNA-seq read or GenBank SRA accession inputs without performing assembly of entire transcriptomes.

kakapo determines the genetic code for each sample, based on the sample
origin (NCBI TaxID) and the genomic source. kakapo can be employed to extract arbitrary loci, such as those commonly used for phylogenetic inference in systematics or candidate genes and gene families.





□ TRcaller: Precise and ultrafast tandem repeat variant detection in massively parallel sequencing reads

>> https://www.biorxiv.org/content/10.1101/2023.02.15.528687v1

TRcaller implements a novel algorithm for calling TR allele sequences from both short- and long-read sequences, generated from either whole genome and targeted sequences, and achieves greater accuracy and sensitivity than existing tools.

TRcaller uses an alignment strategy to define the boundaries of TRs. TRcaller takes an aligned sequence in indexed BAM format (with a BAI index) and a target TR loci file in BED format as input, and outputs the TR allele length/size, allele sequences, and supported read counts.





□ Five-letter seq: Simultaneous sequencing of genetic and epigenetic bases in DNA

>> https://www.nature.com/articles/s41587-022-01652-0

A whole-genome sequencing methodology capable of sequencing the four genetic letters in addition to 5mC and 5hmC to provide an accurate six-letter digital readout in a single workflow.

The processing of the DNA sample is entirely enzymatic and avoids the DNA degradation and genome coverage biases of bisulfite treatment. The five-letter seq workflow unambiguously resolves the four genetic bases and the epigenetic modifications, 5mC or 5hmC, termed hither to as modC.

Six-letter seq calls unmodC, 5mC and 5hmC when the true state is unmodC, 5mC and 5hmC. A critical requirement is to disambiguate 5mC from 5hmC without compromising genetic base calling within the same sample fragment.





□ baseLess: lightweight detection of sequences in raw MinION data

>> https://academic.oup.com/bioinformaticsadvances/article/3/1/vbad017/7036850

BaseLess reduces the MinION sequencing device to a simple species detector. As a trade-off, it runs on inexpensive computational hardware like single-board computers.

BaseLess deduces the presence of a target sequence by detecting squiggle segments corresponding to salient short sequences, k-mers, using an array of convolutional neural networks. baseLess can determine whether a read can be mapped to a given sequence or not.





□ REPAC: analysis of alternative polyadenylation from RNA-sequencing data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02865-5

REPAC, a novel framework to detect differential alternative polyadenylation (APA) using regression of polyadenylation compositions which can appropriately handle the compositional nature of this type of data while allowing for complex designs.





LEX.

2023-02-22 02:21:12 | Science News




□ siVAE: interpretable deep generative models for single-cell transcriptomes

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02850-y

siVAE is a deep neural network consisting of two pairs of encoder-decoder structures, one for cells and the other for features. The strategy siVAE uses to achieve interpretation is best understood by briefly reviewing why probabilistic PCA and factor analysis are interpretable.

siVAE is a variant of VAEs that infers a feature embedding space for the genomic features that is used to interpret the cell-embedding space. siVAE achieves interpretability without introducing linear constraints, making it strictly more expressive than LDVAE, scETM, and VEGA.





□ BiWFA: Optimal gap-affine alignment in O(s) space

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad074/7030690

BiWFA is the first gap-affine algorithm capable of computing optimal alignments in O(s) memory while retaining the WFA's time complexity of O(ns). BiWFA performs the WFA algorithm simultaneously in both directions on the strings: from start to end, and from end to start.

Each direction will only retain max{x,o+e} wavefronts in memory. This is insufficient to perform a full traceback. However, when they "meet" in the middle, we can infer a breakpoint in the alignment that divides the optimal score roughly in half.

Then, we can apply the same procedure on the two sides of the breakpoint recursively. BiWFA execution times are very similar, or even better, than those of the original WFA. Despite BiWFA requiring 2954× / 607× less memory when aligning ultra long MinION and PromethION sequences.





□ DeepBIO: an automated and interpretable deep-learning platform for high-throughput biological sequence prediction, functional annotation and visualization analysis

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkad055/7041952

DeepBIO is the first platform that supports not only sequence-level function prediction for any biological sequence data, but also allows nine base-level functional annotation tasks using deep-learning architectures, covering DNA / RNA methylation and protein binding specificity.

DeepBIO integrates over 40 deep-learning algorithms, incl. convolutional neural networks, advanced natural language processing models, and graph neural networks, which enables to train, compare and evaluate different architectures on any biological sequence data.





□ NanoSpring: Reference-free lossless compression of nanopore sequencing reads using an approximate assembly approach

>> https://www.nature.com/articles/s41598-023-29267-8

NanoSpring uses an approximate assembly approach. NanoSpring indexes the reads using MinHash which enables efficient lookup of reads overlapping a given sequence, effectively handling substitution, insertion, and deletion errors.

NanoSpring attempts to build contigs consisting of overlapping reads. The contig is built by greedily searching the MinHash index for reads that overlap with the current consensus sequence of the graph, and adding the candidate reads to the graph using minimap2 alignment.





□ Heuristics for the De Bruijn Graph Sequence Mapping Problem

>> https://www.biorxiv.org/content/10.1101/2023.02.05.527069v1

1. GSMP: algorithm that returns the sequence mapped in the graph in time O(m|V|log(m·|V|)+m·|E|);
2. GSMPac: algorithm that returns only the cost of the mapping in time O(|V|+m·|E|).

De Bruijn sequence Mapping Tool (BMT) converts a De Bruijn graph to a sequence graph and runs GSMP’s algorithm. They use the idea of anchors (k-mers that are present in s and Gk) and then fill with BMT all gaps between two sequential anchors.





□ NanoSTR: A method for detection of target short tandem repeats based on nanopore sequencing data

>> https://www.frontiersin.org/articles/10.3389/fmolb.2023.1093519/full

NanoSTR detects the target STR loci based on the length-number-rank (LNR) information of reads. NanoSTR can be used for genotyping based on long-read data with improved accuracy and efficiency compared with other existing methods, such as Tandem-Genotypes and TRiCoLOR.

NanoSTR largely circumvents the errors or failure of genotyping associated with nanopore sequencing data characteristics. Moreover, there is no need to establish a genomic background database or align the sequencing data against the human reference genome.





□ DeepPheWAS: an R package for phenotype generation and association analysis for phenome-wide association studies

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad073/7028485

DeepPheWAS creates clinically-curated composite phenotypes, and integrates quantitative phenotypes from primary care data, longitudinal trajectories of quantitative measures, disease progression, and drug response phenotypes.

DeepPheWAS can be applied to quantitative phenotypes derived from numerous data sources, incl. primary care data, and inclusion of complex variants, such as copy number variants with a wide range of copy numbers (multiallelic CNVs).





□ Bayesian multivariant fine mapping using the Laplace prior

>> https://onlinelibrary.wiley.com/doi/10.1002/gepi.22517

The Laplace prior can lead to higher posterior inclusion probability (PIPs) than either the Gaussian prior or FINEMAP, particularly for moderately sized fine-mapping studies.

Calculating the marginal likelihood with a Laplace prior requires either numerical integration or a Monte Carlo approach, which will make it slower than implementing the Gaussian prior.





□ scMINER: a mutual information-based framework for identifying hidden drivers from single-cell omics data

>> https://www.researchsquare.com/article/rs-2476875/v1

scMINER, a mutual information (MI)-based integrative computational framework, termed single-cell Mutual Information-based Network Engineering Ranger. ScMINER performs unsupervised clustering and reverse engineering of cell-type specific TF and SIG networks.

scMINER transforms the single-cell gene expression matrix into single-cell activity profiles and identify cluster-specific TF and SIG incl. hidden ones that show changes at the activity but not expression level. scMINER uncovers the regulon rewiring of drivers among cell types.





□ Petagraph: A large-scale unifying knowledge graph framework for integrating biomolecular and biomedical data

>> https://www.biorxiv.org/content/10.1101/2023.02.11.528088v1

Petagraph, a large-scale unified biomedical knowledge graphs (UBKG) that integrates biomolecular data into a schema incorporating the Unified Medical Language System (UMLS). Petagraph integrates biomedical data types into a UBKG environment of 200 cross-referenced ontologies.

Semantic Types are Petagraph nodes specified to assign types to different entities that are presented as (Concept-Code-Term) triplets to the graph. Petagraph was conceived as a knowledge graph for rapid feature selection to explore candidates for gene variant epistasis.





□ ConDecon: Clustering-independent estimation of cell abundances in bulk tissues using single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2023.02.06.527318v1

ConDecon, a deconvolution method for inferring cell abundances from gene expression data of bulk tissues without relying on cluster labels or cell-type specific gene expression signatures at any step.

ConDecon uses the gene expression count matrix and latent space. The goal of ConDecon is thus to learn a map h(X):X→Y b/n the space of possible rank correlation distributions and the space Y of possible probability distributions on the single-cell gene expression latent space.





□ Buttery-eel: Accelerated nanopore basecalling with SLOW5 data format

>> https://www.biorxiv.org/content/10.1101/2023.02.06.527365v1

SLOW5 s designed to resolve the inherent limitations in FAST59. In its compressed binary form (BLOW5), the new format is ~20-80% smaller than FAST5 and permits efficient parallel access by multiple CPU threads.

Buttery-eel, an open-source wrapper that enables SLOW5 data access by Guppy. Articulating a new advantage of SLOW5, namely its capacity for rapid sequential data access (as opposed to random access, explored previously9), which can be exploited to accelerate basecalling.





□ IS-Seq: a bioinformatics pipeline for integration sites analysis with comprehensive abundance quantification methods

>> https://www.biorxiv.org/content/10.1101/2023.02.06.527381v1

IS-seq can process data from paired-end sequencing of both restriction sites-based IS collection methods and sonication-based IS retrieval systems while allowing the selection of different abundance estimation methods, incl. read-based, Fragment-based and UMI-based systems.

The IS-Seq pipeline is designed to convert raw Illumina sequencing BCL files into a final table containing information of the genomic localization of integration sites (including annotation of the nearest gene) and their relative abundance per sample.







ChatGPTによるコーディングは、要件と学習データから抽象化された可読性の高い結果を出力するアルゴリズムに依拠し、その特性から時間軸上に評価点を置くことに意味はない。出力を常時パイロットすべきものであり、本質的に既存の代替手段ではなく、計算資源のコストにスケールする運用に価値がある。


□ MIT researchers found that massive neural nets (e.g. large language models) are capable of storing and simulating other neural networks inside their hidden layers, which enables LLM to adapt to a new task without external training:

>> https://news.mit.edu/2023/large-language-models-in-context-learning-0207

□ WHAT LEARNING ALGORITHM IS IN-CONTEXT LEARN- ING? INVESTIGATIONS WITH LINEAR MODELS

>> https://arxiv.org/pdf/2211.15661.pdf




□ Katie Link

>> https://twitter.com/katieelink/status/1622635429202898944

BioGPT-Large was just released by Microsoft 🤩

Trained from scratch on biomedical text, it's the current leader on the PubMedQA benchmark at 81% accuracy (human performance = 78%).

It's also freely available on the @huggingface hub to try out (and fine-tune)!





□ ARAX: a graph-based modular reasoning tool for translational biomedicine

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad082/7031241

ARAX enables users to encode translational biomedical questions and to integrate knowledge across sources to answer the user’s query and facilitate exploration of results.

ARAX can access to around 40 knowledge providers (which themselves access over 100 underlying knowledge sources) from a single reasoning tool, using a standardized interface and semantic layer.

ARAX combines answers returned from the Knowledge Providers into a single answer knowledge graph that is “canonicalized,” meaning that it does not contain semantically redundant nodes.





□ aweMAGs: a fully automated workflow for quality assessment and annotation of eukaryotic genomes from metagenomes

>> https://www.biorxiv.org/content/10.1101/2023.02.08.527609v1

Metashot/aweMAGs is written using Nextflow, a framework for building scalable scientific workflows using containers allowing implicit parallelism (i.e. capability of automatically execute tasks in parallel) on a wide range of computing platforms.

Metashot/aweMAGs takes a series of genomes/metagenomic bins in FASTA format and returns: a TSV file incl. the quality information (“Assembly quality stats”) for each bin; two directories, one containing the bins filtered according the completeness and contamination thresholds.





□ simCAS: an embedding-based method for simulating single-cell chromatin accessibility sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.02.13.528281v1

simCAS is an embedding-based method for simulating single-cell chromatin accessibility sequencing (scCAS) data. simCAS is a comprehensive and flexible simulator which provides three simulation modes: pseudo-cell-type mode, discrete mode and continuous mode.

For the pseudo-cell-type mode, the input of simCAS is the real scCAS data represented by a peak-by-cell matrix, and matched cell type information represented by a vector.

For the discrete or continuous mode, simCAS only requires the peak-by-cell matrix as the input data, followed by automatically obtaining the variation from multiple cell states. The output of simCAS is a synthetic peak-by-cell matrix with a vector of user-defined ground truths.





□ CausNet: generational orderings based search for optimal Bayesian networks via dynamic programming with parent set constraints

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05159-6

The main novel contribution in addition to providing software is the revision of the Silander algorithm 3 to incorporate possible parent sets, and use of ‘generational orderings’ for a much more efficient way to explore the search space.

Using BIC (Bayesian information criterion) and BGe (Bayesian Gaussian equivalent) scoring functions as 2 options for using Causnet. The BGe score is the posterior probability of the model hypothesis that the true distribution of the set of variables is faithful to the DAG model.





□ Generalizations of the Genomic Rank Distance to Indels

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad087/7039678

The rank-indel distance only uses insertions and deletions of entire chromosomes. The rank distance, on average, outperforms the DCJ- Indel distance in the Quartet metric, even though the rank distance exhibits greater variability for this metric.

As for the normalized RF metric, the similarity of the resulting trees with the ground-truth remains stable between 60% and 70% under the DCJ-Indel distance, on average, whereas the rank distance shows comparable results only for higher rates of indel events.





□ SynEcoSys: a multifunctional platform of large-scale single-cell omics data analysis

>> https://www.biorxiv.org/content/10.1101/2023.02.14.528566v1

SynEcoSys by Singleron Biotechnologies currently provides a massive collection of publicly available single-cell sequencing dataset, involving 46,326,175 cells from 731 datasets across multiple platforms and species.

The canonical cell type-specific marker genes from the SysEcoSys knowledgebase for the recommended cell types are used to verify the cell type results. The DB uses the BRENDA Tissue Ontology, Disease Ontology and Cell Ontology as references for the standardized terminologies.





□ LDmat: Efficiently Queryable Compression of Linkage Disequilibrium Matrices

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btad092/7043094

Linkage disequilibrium (LD) matrices can reach large sizes when they are derived from millions of individuals; hence moving, sharing, and extracting granular information from this large amount of data can be very cumbersome.

LDmat is a standalone tool to compress large LD matrices in an HDF5 file format and query these compressed matrices. It can extract submatrices corresponding to a sub-region of the genome, a list of select loci, and loci within a minor allele frequency range.





□ ConanVarvar: a versatile tool for the detection of large syndromic copy number variation from whole-genome sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05154-x

ConanVarvar, a software for joint calling of large, syndromic CNVs in batches of WGS samples using read depth. ConanVarvar annotates identified CNVs with information about associated syndromic conditions and generates plots showing the position of each variant on the chromosome.

ConanVarvar approximates read depth along chromosomes by splitting them into bins of fixed size with subsequent corrections for GC content and mappability. ConanVarvar performs segmentation of binned genomic intervals and assigns each segment an averaged copy number value.

ConanVarvar transforms the mean copy number of each segment to a different scale, so that potential deletions and duplications are further away from other segments; a K-means clustering algorithm then groups all transformed segments into “normal” and “CNV” categories.





□ Design and performance of a long-read sequencing panel for pharmacogenomics

>> https://www.biorxiv.org/content/10.1101/2022.10.25.513646v1

Not all genes could be fully phased. The main reasons for haploblocks to break are a lack of coverage and a lack of heterozygous variants. With probe optimization it might be possible to improve the phasing for the regions where haploblocks breakage is due to a lack of coverage.

With PacBio HiFi sequencing, more than 6.5kbp can be sequenced in one read when using a capture-based approach. A total of 27 samples were sequenced and panel accuracy was determined using benchmarking variant calls for 3 GIB samples and GeT-RM star(*)-allele calls.

GeT-RM star(*)-alleles are only based on a limited set of variants that are used in their variant to star (*)-allele translations. A CN neutral region is required, and this should be taken into account when designing a sequencing panel.





□ FixItFelix: improving genomic analysis by fixing reference errors

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02863-7

FixItFelix, an efficient remapping approach, together with a modified version of the GRCh38 reference genome that improves the subsequent analysis across these genes within minutes for an existing alignment file while maintaining the same coordinates.

FixItFelix has different modules for short-read, long-read DNA and RNA sequencing reads. FixItFelix extracts only the mappings of the regions of interest from the existing whole genome mapping BAM/CRAM and extracts sequences for those regions and finally realigns the sequences.





□ ecmtool: fast and memory efficient enumeration of elementary conversion modes

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad095/7049479

Integrating mplrs – a scalable parallel vertex enumeration method – into ecmtool. This speeds up computation, drastically reduces memory requirements, and enables ecmtool’s use in stan- dard and high-performance computing environments.

It replaced polco with a parallelized, and Lexicographic Reverse Search. LRS reverses the simplex method. It finds a vertex/ray on the polyhedron, moves along the edges of the polyhedron, and traces back all starting points that return that initial vertex in linear optimization.





□ Merizo: A rapid and accurate domain segmentation method using invariant point attention

>> https://www.biorxiv.org/content/10.1101/2023.02.19.529114v1

Merizo, a deep neural network-based method that conducts bottom-up domain segmentation in a proposal- free manner by using a 2-dimensional domain map directly as a learning objective.

Merizo makes use of the Invariant Point Attention (IPA) module introduced by AlphaFold2 [20], leveraging its ability to mix together sequence, pairwise and backbone information to directly encode a protein structure into a latent representation.

Merizo uses a small encoder-decoder network (approximately 20 million parameters). The IPA encoder in Merizo is composed of 4 non-weight-shared blocks, each with 16 attention heads and takes four inputs - three primary inputs and one additional input for positional encoding.





□ LISTER: Semi-automatic metadata extraction from annotated experiment documentation in eLabFTW

>> https://www.biorxiv.org/content/10.1101/2023.02.20.529231v1

LISTER (Life Science Experiments Metadata Parser), a methodological and algorithmic solution to disentangle the creation of metadata from ontology alignment and extract metadata from annotated template-based experiment documentation using minimum effort.

LISTER consists of three components: customized eLabFTW entries using specific hierarchies, templates, and tags; a ‘container’ concept in eLabFTW, making metadata of a particular container content extractable along with its underlying, related containers.





□ S1000: A better taxonomic name corpus for biomedical information extraction

>> https://www.biorxiv.org/content/10.1101/2023.02.20.528934v1

S1000, a re-annotated and expanded high- quality corpus for species, strain and genera names. S1000 uses a corpus for species NER, which builds upon S800. S800 was chosen as a starting point, since it already fulfills the criteria of species name diversity and representation.

The S1000 corpus contains more than seven times as many unique names as the LINNAEUS corpus. The high diversity of names was one of the key motivators for choosing S800 as a starting point, and increase it even more have paid off, as is clear from the corpus statistics.





□ LAVAA: Lightweight Association Viewer Across Ailments

>> https://geneviz.aalto.fi/LAVAA/

The LAVAA volcano plot tool allows researchers to view not only the significance of PheWAS results of a variant, but also enables one to quickly see different directions and magnitudes of effect across phenotypes.





□ GFA-dead-end-counter: a tool for counting dead ends in GFA assembly graphs

>> https://github.com/rrwick/GFA-dead-end-counter





□ ADPG: Biomedical entity recognition based on Automatic Dependency Parsing Graph

>> https://www.sciencedirect.com/science/article/abs/pii/S1532046423000382

ADPG, a novel automatic dependency parsing approach to fuse syntactic structure information in an end-to-end way to recognize biomedical entities.

ADPG is base on a a multilayer Tree-Transformer structure to automatically extract the semantic representation and syntactic structure in long-dependent sentences, and then combines a multilayer graph attention neural network (GAT) to extract the dependency paths.





□ NETCORE: An efficiency-driven, correlation-based feature elimination strategy for small datasets

>> https://aip.scitation.org/doi/full/10.1063/5.0118207

The NETCORE (the network-based, correlation-driven redundancy elimination) algorithm is model-independent, does not require an output label, and is applicable to all kinds of correlation topographies within a dataset.

NETCORE translates the dataset into a correlation network, which is analyzed by conducting an iterative decision. NETCORE selects a subset of features that represent the full feature space on the basis of a correlation threshold while taking into account the multi-connectivity.





□ Pacybara: Accurate long-read sequencing for barcoded mutagenized allelic libraries

>> https://www.biorxiv.org/content/10.1101/2023.02.22.529427v1

Pacybara handle these issues by clustering long reads based on the similarities of (error-prone) barcodes while detecting the association of a single barcode with multiple genotypes. Pacybara also detects recombinant (chimeric) clones and reduces false positive indel calls.



Legion.

2023-02-22 02:21:10 | Science News




□ CellOracle: Dissecting cell identity via network inference and in silico gene perturbation

>> https://www.nature.com/articles/s41586-022-05688-9

CellOracle integrates multimodal data to build custom GRN models that are specifically designed to simulate shifts in cell identity following TF perturbation, providing a systematic and intuitive interpretation of context-dependent TF function in regulating cell identity.

CellOracle calculates the pseudotime gradient vector field and the inner-product score to generate perturbation score. These simulated values are converted into a vector map, which enables simulated changes in cell identity to be intuitively visualized w/in a low-dimension space.





□ GENIUS: GEnome traNsformatIon and spatial representation of mUltiomicS data

>> https://www.biorxiv.org/content/10.1101/2023.02.09.525144v1

The GENIUS framework is able to transform multi-omics data into images with genes displayed as spatially connected pixels and successfully extract relevant information with respect to the desired output.

All models were trained with Adagrad optimizer. The motivation behind the implemented network structure is to use an encoder in order to learn how to compact genomic information into a small vector, L, forcing the network to extract relevant information.

GENIUS is similar to an autoencoder; however, the reconstruction of the genome image is not penalized. GENIUS produces a latent representation of multi-omics data in a shape of a vector of a size 128 (L), which is later concatenated in a model when making final predictions.





□ GeneClust: A cofunctional grouping-based approach for non-redundant feature gene selection in unannotated single-cell RNA-seq analysis

>> https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbad042/7031680

GeneClust can group cofunctional genes in biological process and pathway into clusters, thus providing a means of investigating gene interactions and identifying potential genes relevant to biological characteristics of the dataset.

GeneClust groups genes based on their expression profiles, then selects genes with the aim of maximizing relevance, minimizing redundancy and preserving complementarity. GeneClust can work as a plug-in tool for feature selection with any existing cell clustering method.





□ On triangle inequalities of correlation-based distances for gene expression profiles

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05161-y

Variants of the absolute correlation distance are not the only distance measures that violate the triangle inequality. The function regards positive / negative correlation equally, giving a value close to zero to highly correlated profiles, and a value of one to uncorrelated.

The robustness of dr-based clustering is also supported by evaluation based on the number of times that a class “dissolved”. That makes dr a good option when measuring correlation-based distances, which have comparable accuracy, higher robustness.





□ SPADAN: A Novel Strategy for Dynamic Modelling of Genome-Scale Interaction Networks

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad079/7056637

SPADAN constructs genome-scale dynamic models, filling the gap between large-scale static and small-scale dynamic modeling strategies. SPADAN allows for holistic quantitative predictions which are critical for the simulation of therapeutic interventions in precision medicine.

SPADAN determines the consequence of interactions in terms of activation or inhibition of the target protein. The ODE systems that SPADAN operates on are mostly nonlinear.





□ PanGenome Research Tool Kit (PGR-TK): Multiscale Analysis of Pangenome Enables Improved Representation of Genomic Diversity For Repetitive And Clinically Relevant Genes

>> https://www.biorxiv.org/content/10.1101/2022.08.05.502980v2

PGR-TK uses minimizer anchors to generate pangenome graphs at different scales without more computational intensive sequence-to-sequence alignment or explicitly calling variants with respect to a reference. PGR-TK uses an algorithm to decompose tangled pangenome graphs.

PGR-TK projects the linear genomics sequence onto the principal bundles. Pangenome-level decomposition provides utilities similar to the A-de Bruijn graph approach for identifying repeats and conserved segmental duplications, but for the whole human pangenome collection at once.





□ BOSS-RUNS: Dynamic, adaptive sampling during nanopore sequencing using Bayesian experimental design

>> https://www.nature.com/articles/s41587-022-01580-z

BOSS-RUNS (Benefit-Optimising Short-term Strategy for Read Until Nanopore Sequencing), an algorithmic framework and software to generate dynamically updated decision strategies. They quantify uncertainty at each genome position with real-time updates from data already observed.

BOSS-RUNS leads to an increase in the sequencing yield of on-target regions, specifically at positions of highest uncertainty, and can effectively mitigate abundance bias or other sources of non-uniform coverage—for example, from enrichment library preparation procedures.





□ BioNAR: An Integrated Biological Network Analysis Package in Bioconductor

>> https://www.biorxiv.org/content/10.1101/2023.02.08.527636v1

BioNAR supports step-by-step analysis of biological/biomedical networks with the aim of quantifying and ranking each of the network’s vertices based on network topology and clustering.

BioNAR directly supports calculation of the following network vertex centrality measures: degree (DEG), betweenness (BET), clustering coefficient (CC), semilocal centrality (SL), mean shortest path (mnSP), page rank (PR) and standard deviation of the shortest path (sdSP).

BioNAR supports the Modularity-Maximisation, incl. 'Fast-Greedy' algorithm, process driven agglomerative random walk algorithm 'Walktrap', and coupled Potts/Simulated Annealing algorithm 'SpinGlass', the 'Leading-Eigenvector' and Spectral algorithms, and the 'Louvain' algorithm.





□ CAGECAT: The CompArative GEne Cluster Analysis Toolbox for rapid search and visualisation of homologous gene clusters

>> https://www.biorxiv.org/content/10.1101/2023.02.08.527634v1

By leveraging remote BLAST databases, which always provide up-to-date results, CAGECAT can yield relevant matches that aid in the comparison, taxonomic distribution, or evolution of an unknown query.

The service is extensible and interoperable and implements the cblaster and clinker pipelines to perform homology search, filtering, gene neighbourhood estimation, and dynamic visualisation of resulting variant BGCs.





□ scMAGS: Marker gene selection from scRNA-seq data for spatial transcriptomics studies

>> https://www.sciencedirect.com/science/article/abs/pii/S0010482523000999

scMAGS uses a filtering step in which the candidate genes are extracted before the marker gene selection step. For the selection of marker genes, cluster validity indices, the Silhouette index or the Calinski-Harabasz index (for large datasets) are utilized.

scMAGS selects marker genes that are exclusive to each cell type such that the corresponding marker genes are highly expressed in a specific cell type while being lowly expressed (or having zero expression) in other cell types.





□ BRIDGEcereal: Streamline unsupervised machine learning to survey and graph indel-based haplotypes from pan-genomes

>> https://www.biorxiv.org/content/10.1101/2023.02.11.527743v1

BRIDGEcereal, a webapp for surveying and graphing indel-based haplotypes for genes of interest from publicly accessible pan- genomes through streamlining two unsupervised machine learning algorithms.

BRIDGEcereal uses Clustering HSPs for Ortholog Identification via Coordinates and Equivalence (CHOICE) algorithm that identifies and extracts the segment harboring the ortholog from each assembly.

The second algorithm, Clustering via Large-Indel Permuted Slopes (CLIPS) groups assemblies sharing the same set of indels to graph a concise haplotype plot to visualize potential large indels, their impacts on the gene, and relationships among haplotypes.





□ SCAMPP+FastTree: improving scalability for likelihood-based phylogenetic placement

>> https://academic.oup.com/bioinformaticsadvances/article/3/1/vbad008/7009227

The first step in pplacer is to estimate the numerical model parameters on the backbone tree, such as branch lengths defining expected numbers of substitutions and the substitution rate matrix for the Generalized Time Reversible model.

The replacement of RAxML by FastTree for numeric parameter estimation consistently enables pplacer to scale to larger backbone trees (though not quite matching the scalability of APPLES-2 or pplacer-SCAMPP-RAxML), and that pplacer-FastTree is similar in accuracy to pplacer-RAxML.

pplacer-SCAMPP-FastTree, has the same scalability as APPLES-2, improves on the scalability of pplacer- FastTree and achieves better accuracy than the comparably scalable methods.





□ ASGARD: A Single-cell Guided Pipeline to Aid Repurposing of Drugs

>> https://www.nature.com/articles/s41467-023-36637-3

ASGARD defines a drug score to predict drugs for multiple diseased cell clusters within each patient. The benchmarking results show that the performance of ASGARD on single drugs is more accurate and robust than other pipelines handling bulk and single-cell RNA-Seq data.

ASGARD repurposes drugs for disease by fully accounting for the cellular heterogeneity. In ASGARD, every cell cluster in the diseased sample is paired to that in the normal sample, according to “anchor” genes that are consistently expressed between diseased and normal cells.






□ Shiny.gosling Examples and How to Run Them: Genomics Visualizations in R Shiny

>> https://appsilon.com/shiny-gosling-examples-genomics-in-r/




□ Dorado: A LibTorch Basecaller for Oxford Nanopore Reads

>> https://github.com/nanoporetech/dorado





□ On the Effectiveness of Compact Biomedical Transformers

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad103/7056640

Introducing six lightweight models, namely, BioDistilBERT, BioTinyBERT, BioMobileBERT, DistilBioBERT, TinyBioBERT, and CompactBioBERT which are obtained either by knowledge distillation from a biomedical teacher or continual learning on the Pubmed dataset.

MobileBERT uses a 128-dimensional embedding layer followed by 1D convolutions to up-project its output to the desired hidden dimension expected by the transformer blocks. MobileBERT reduces the hidden size and the computational cost of multi-head attention / feed-forward blocks.





□ MeganServer: facilitating interactive access to metagenomic data on a server

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad105/7056641





□ Latent dirichlet allocation for double clustering (LDA-DC): discovering patients phenotypes and cell populations within a single Bayesian framework

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05177-4

A novel approach to stratify observations and huge-dimensional features within a single probabilistic framework, i.e., to identify patients phenotypes and cell types simultaneously.

LDA-DC unifies clustering methods within one Bayesian framework to group cells into different cellular phenotypes from quantitative data, and stratify patients based on the clustered cells.





□ SpliceVault predicts the precise nature of variant-associated mis-splicing

>> https://www.nature.com/articles/s41588-022-01293-8

SpliceVault, a web portal to access 300K-RNA (and 40K-RNA in hg19), which quantifies natural variation in splicing and potently predicts the nature of variant-associated mis-splicing.

Default settings display 300K-RNA Top-4 output according to the optimized parameters w/ the option to return all events, customize the number of events returned, distance scanned for cryptic splice sites, maximum number of exons skipped / list tissue-specific mis-splicing events.





□ LLaMA: Open and Efficient Foundation Language Models

>> https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/

LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. LLaMA is trained on trillions of tokens. It is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets.

LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla70B and PaLM-540B. LLaMA tokenizes the data with the byte- pair encoding (BPE) algorithm.





□ Brane actions for coherent ∞-operads

>> https://arxiv.org/pdf/2302.12206.pdf

Proving the Mann–Robalo’s construction of the brane action [MR18] extends to general coherent ∞-operads, with possibly multiple colors and non-contractible spaces of unary operations. Lurie’s and Mann–Robalo’s models for such spaces are equivalent.

The space of extensions in the sense of Lurie is not in general equivalent to the homotopy fiber of the associated forgetful morphism, but rather to its homotopy quotient by the ∞-groupoid of unary operations.

In many applications, it is useful to "invert" the wrong-way morphisms appearing in the spans to obtain an algebra structure in a more tractable ∞-category, such as that of chain complexes.





□ Haptools: a toolkit for admixture and haplotype analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad104/7058928

Haptools is a collection of tools for simulating and analyzing genotypes and phenotypes while taking into account haplotype information. Haptools supports fast simulation of admixed genomes (with simgenotype), visualization of admixture tracks (with karyogram).

Simulating haplotype- and local ancestry-specific phenotype effects (with transform and simphenotype), and computing a variety of common file operations and statistics in a haplotype-aware manner.





□ Centrifuge+: improving metagenomic analysis upon Centrifuge

>> https://www.biorxiv.org/content/10.1101/2023.02.27.530134v1

Centrifuge is especially applied for ONT shotgun sequencing analysis and is now included as a step in WIMP, which is a quantitative analysis tool for real-time species identification based on the MinIon released by ONT.

Centrifuge+, which modified the statistical model of Centrifuge and improved metagenomic analysis. In the modified statistical model, the influence of similarities among species in the reference database is described by unique mapping rate when analyzing the ambiguous reads.





□ SCMcluster: a high-precision cell clustering algorithm integrating marker gene set with single-cell RNA sequencing data

>> https://academic.oup.com/bfg/advance-article-abstract/doi/10.1093/bfgp/elad004/7058188

SCMcluster integrates two cell marker databases(CellMarker database and PanglaoDB database)with scRNA-seq data for feature extraction, and constructs an ensemble clustering model(including SNN-Cliq and SOM) based on the consensus matrix.





□ CeDAR: incorporating cell type hierarchy improves cell type-specific differential analyses in bulk omics data

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02857-5

CeDAR incorporates the cell type hierarchy in cell type-specific differential analysis in bulk data. For each feature, CoDAR defines binary random variables to represent its underlying DE/DM states in all cell types, each with a prior probability.

CeDAR is robust to the specification of cell type hierarchy, for example, when the true structure is not bifurcating or just has a single layer.





□ SALON ontology for the formal description of sequence alignments

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05190-7

The Sequence Alignment Ontology (SALON) is an OWL 2 ontology that supports automated reasoning for alignments validation and retrieving complementary information from public databases under the Open Linked Data approach.

SALON defines a full range of controlled terminology in the domain of sequence alignments. SALON can be further exploited by defining SWRL rules, which automatically determine if a sequence alignment is plausible based on its global assigned score.




□ RPTRF: A rapid perfect tandem repeat finder tool for DNA sequences

>> https://www.sciencedirect.com/science/article/abs/pii/S0303264723000448

The Rapid Perfect Tandem Repeat Finder (RPTRF), minimizes the need for excess character comparison processing by indexing the input file and significantly helps to accelerate and prepare the output without artifacts by using an interval tree in the filtering section.






□ Interpretable Meta-learning of Multi-omics Data for Survival Analysis and Pathway Enrichment

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad113/7067742

A meta-learning approach that uses multi-omics datasets to train a hazard predictive model for cancer survival analysis. Applying an advanced variable importance analysis method - DeepLIFT, and compare pathway enrichment for transcriptomics and multi-omics data.

After running the pre-trained meta-learning model from survival analysis on each target cancer type data, they sorted the genes by DeepLIFT scores and set the first gene from each enrichment set as the anchor gene.

In this process, a standard for how near they look around the anchor gene becomes necessary, which we refer to as the window size. If a gene is within ± the window size from the anchor gene, and consider the two genes’ DeepLIFT scores to be similar.





□ linemodels: clustering effects based on linear relationships

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad115/7067743

linemodels estimates the membership probabilities of the variables in the given models, by taking into account the uncertainty in the effect estimates and the possible correlation of the two effect estimators.

The linemodels package further allows for optimisation of any set of model parameters using an EM-algorithm and estimation of the proportion parameters of the underlying mixture model using a Gibbs sampler.





□ upSPLAT a method for cost-effective, large-scale pooled sequencing library preparation applicable to diverse sample types

>> https://www.scilifelab.se/wp-content/uploads/2023/02/upSPLAT-a-method-for-cost-effective-large-scale-pooled-sequencing-library-preparation-applicable-to-diverse-sample-types.pdf

Ultra-pooled SPLAT (upSPLAT), a flexible, low-cost library preparation workflow for pooled sequencing of large numbers of barcoded samples. The method is an adaptation of the in house developed ‘Splinted Ligation Adapter Tagging’ library prep technique.





□ Feature selection followed by a residuals-based normalization simplifies and improves single-cell gene expression analysis

>> https://www.biorxiv.org/content/10.1101/2023.03.02.530891v1

A simple feature selection method that relies on a regression-based approach to estimate dispersion coefficients for the genes based on the observed counts.

The variation in the counts of the latter are expected to reflect the biases introduced by the unwanted sources, and therefore they can be used to arrive at more reliable estimates of the cell-specific size factors.

A residuals-based normalization method that reduces the impact of sampling depth differences between the cells and simultaneously ensures variance stabilization by relying on a monotonic non-linear transformation.





□ CustOmics: A versatile deep-learning based strategy for multi-omics integration

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010921

CustOmics is a hierarchical mixed-integration that consists of an autoencoder for each source that creates a sub-representation that will then be fed to a central variational autoencoder.

CustOmics benefits from two training phases. The first phase will act as a normalization process: each source will train separately to learn a more compact representation that synthesizes its information with less noise.

This will help the integration as we will lose all imbalance issues between the sources and avoid losing focus when a source has an inferior dimensionality or weaker signal than the others.

The second phase will constitute a simple joint integration between the learned sub-representations, while still training all the encoders to fine-tune those representations as some signals are enhanced in the presence of other sources.





□ NDEx IQuery: a multi-method network gene set analysis leveraging the Network Data Exchange

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad118/7070501

NDEx IQuery addresses the unmet needs described above, providing functionality that complements or extends existing resources. It combines novel sources of pathways/networks, and its integration with the NDEx provides the capability to store and share analysis results.

The NDEx IQuery web application performs four separate gene set analyses based on a diverse range of pathways/networks from NDEx and presents the results in four dedicated tabs: Curated Pathways, Pathway Figures, INDRA GO, and Interactomes.





□ Genomepy: genes and genomes at your fingertips

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad119/7070503

Genomepy can search genomic data on NCBI, Ensembl, UCSC and GENCODE, and inspect available gene annotations to enable an informed decision. The selected genome and gene annotation can be downloaded and preprocessed with sensible, yet controllable, defaults.

Genomepy uses and extends on packages incl. pyfaidx, pandas and MyGene.info to rapidly work w/ gene and genome sequences and metadata. Similarly, genomepy has been incorporated into other packages, such as pybedtools and CellOracle.





□ QuaC: A Pipeline Implementing Quality Control Best Practices for Genome Sequencing and Exome Sequencing Data

>> https://www.biorxiv.org/content/10.1101/2023.03.06.531383v1

QuaC integrates and standardizes QC best practices at Center. It performs three major steps: (1) runs several QC tools using data produced by the read alignment (BAM) and small variant calling (VCF) as input and optionally accepts QC output for raw sequencing reads (FASTQ).





Janu.

2023-02-22 00:52:28 | ホテル


2023年開業予定の高級外資系ホテル。

Janu Tokyo
The Unbound Collection by Hyatt Hotel in Tokyo
Bulgari Hotel Tokyo
The Tokyo Edition Ginza

ジャヌとアンバウンドは特に楽しみ!すぐにでも泊まりたいけれど、秋以降だと微妙に休暇シーズンから外れそうで怖い。夜景プールでガチ泳ぎするんだ…










Beyond forest.

2023-02-21 21:21:21 | ホテル


Beyond the forest, hidden by snow, lay a paradise waiting to unfold. 今日のお宿は『星のやリゾート 奥入瀬渓流ホテル』お部屋に露天風呂がついてます。「氷瀑の湯」も壁面が氷柱の壁ですごかったけど撮影禁止で残念。明日は朝からスノートレッキング&川下り🥶眠れるかどうか不安。




ここ『星のやリゾート 奥入瀬渓流ホテル』の露天風呂大浴場、『氷瀑の湯』はマジですごいです(※撮影厳禁のため公式画像)。まるで氷洞の中にいるような氷の壁が圧巻。水面にゆらめく光を見ながら思索に耽るのも良いです😌❄️ ちなみに海外のお客さんめっちゃいます🌏💫









Víkingur Ólafsson – Bartók: 3 Hungarian Folksongs from the Csìk, Sz. 35a

2023-02-21 21:21:12 | art music

□ Víkingur Ólafsson – Bartók: 3 Hungarian Folksongs from the Csìk, Sz. 35a

>> https://dg.lnk.to/FromAfarTW

Celebrated for his innovative programming and award-winning recordings, Icelandic pianist Víkingur Ólafsson is offering a window into his musical life story with his new album, From Afar. The highly personal double album reflects Ólafsson’s musical DNA, from childhood memories growing up in Iceland to his international career and contemporary inspirations. Recorded on both upright and grand pianos, the album captures two distinct sound worlds with works by Bach, Mozart, Schumann, Brahms and Bartók, alongside Icelandic and Hungarian folk songs, a world premiere by Thomas Adès, transcriptions by Ólafsson himself, and interconnecting pieces composed by his hero, 96 year-old Hungarian composer and pianist György Kurtág. Exploring such evocative themes as home, childhood and family, the album features Hungarian and Icelandic folk songs, nature-inspired works, interwoven homages and three previously unreleased transcriptions by Ólafsson.





Elevated.

2023-02-20 02:20:02 | 写真



自分の仕事における得意分野は、統計学的推論やArithmetric(算術的)な手法の実務への応用で、AIなんかがブームになるずっと前から結果を残してきたし、私のスキームが全国基準で採用されたこともある。ただ、主に中高年の上役の理解を得られず本当に苦しんで来たし、今なお根強い障壁を感じる。

今回無理が祟ったのは、昨年の4月から新たな環境に移ったのが要因。一企業体の企画・製造の中枢において、意思決定に関わるデータの分析・施策の考案を担ったけど、その方向性で現役の担当者と完全に衝突した。現場のやり方は30年は遅れているけれど、新たな方法論を受け入れる余裕は無かった。

私は予め今の現場に来る前に『自分のやり方で現状を変えてほしい』と依頼されて来たのだけど、いざ実務に取り掛かると、待っていたのは旧態然の押し付けと、おそらく排他目的と見られるハラスメントだった。根拠のないデマや一方的な言いがかりに毎日悩まされた。睡眠障害が始まったのがその頃。

幸い周囲は皆良い人ばかりで、常に私を気にかけてくれたし、その人物は過去にもパワハラ関連で数件トラブルの前例がある常習者で、会社も以前から問題として把握していた。そのおかげか私の一時療養はスムーズに認められたのだけど、今件に関しては正直、決着点をどこに置くのか難しいところ。





The Banshees of Inisherin.

2023-02-19 02:34:56 | 映画


□ 『イニシェリン島の精霊』

>> https://www.searchlightpictures.com/the-banshees-of-inisherin/

Directed by Martin McDonagh
Music by Carter Burwell
Cinematography by Ben Davis

外在する闘いと内在する諍い。隔絶が齎す不和の連鎖と、不作為な代償。知性とは、関わる人間との相性や環境に依って相互に作用するベクトル量である。一時的な均衡に囚われていても、それは自己淘汰に向かう過程かもしれない。自然や信仰は素知らぬ顔をして、ただ冷ややかにやり過ごす。


誰かを拒絶したいと願う人ほど、誰かに拒絶される痛みを知る人ほど、この人情の機微を楽しめるだろう。そんな人は殆どそこら中にいて、人徳だのコンプラだの言われなくたって何が良くないかは分かってる。誰かが言ったことが知らぬ間に己の口実になり、考えにすり替わる。それが死神の囁きだとしても。

相手を勝手に自分の土俵に上げておいて、これまた勝手に自分に都合の良いルールで裁こうとするのって、いつの世も普遍的な諍いの悪法なのだけど、コントロールしたい相手を縛るためだけの口実が、己の信条や価値をも貶めることに自覚的でいられる人間は少ないように思える。










Quietude.

2023-02-19 02:22:22 | art music


□ 『Andante Piano presents: Quietude (Volume One)』

>> https://www.instagram.com/andantepianomusic/

アムステルダムに拠点を置くネオ・クラシカル・レーベルからのピアノ曲集。新進気鋭のピアニスト達による瑞々しい『ヒーリング』の定義。静謐さを湛えた曲調と、水の雫のように澄んだ音色が心地よく、ベッドで聴くと良い感じに眠気が…🛌😴


□ Daniel Bror Palm - Echoes





HomePod2.

2023-02-19 01:02:03 | デジタル・インターネット


□ 『HomePod (第2世代) MQJ73J/A Midnight』

>> https://www.apple.com/homepod-2nd-generation

音の解像度が格段にクリアで、ドルビーアトモスでは高音の分離が良く、耳に飛び込んでくる感覚。ただDENON DHT-S517のAtmosと聴き比べて、ホーンやストリングスの厚みとまとまりはDENONが設計思想で勝る印象。シアター使いはS517で継続。ステレオペアに期待。




Richter: Spring 1 (Levitation Mix)
今回、HomePodのDolby Atmosベンチマークに使用した楽曲の一つ。