2023年5月のブログ記事一覧-lens, align.

『EO』

>> https://www.imdb.com/title/tt19652910/

(Poland / 2022)
Directed by Jerzy Skolimowski
Written by Ewa Piaskowska / Jerzy Skolimowski
Music by Pawel Mykietyn
Cinamtography by Michal Dymek

『EO』(Poland) 鮮烈かつ鋭利なまでの映像美。ただ愛する主人を探して、よるべのない茫漠とした外界に踏み出すロバの辿る道程に、胸が張り裂けそうになる。鏡のような無垢の瞳に映すのは愛する者の面影か、人の世の不条理か。善意と憐れみさえ枷となり、この物語の語り部が誰なのかを我々に突きつける

スコリモフスキ監督の大胆で前衛的な映像美学が透徹されており、水や草原、森や人工物と、その中に佇むEOの対比の演出に、光を捻じ曲げ、対象物を比喩に置き換え、時間すらも反転して見せる。冷たく悲壮感に満ちたスコアも、EOの暗い旅路を雄弁に彩っている

ダムの橋の真ん中で立ち止まり、時間が反転する（水流が遡行する）シーンの他、風車による運命の暗喩など、多くの場面で視覚的メタファーが用いられる。無慈悲なエントロピー。柵の外には『機械仕掛け』の世界があり、導き手を喪失した者を運命へと運ぶ歯車が、軋みながら蠢いている。 #EO

蘇生シーンから始まりエンディングに至るまでの循環構造をキリスト教的メタファーと指摘する批評も頷けるのだけど、実はもっと普遍的な、寓話性の反復を示唆するものだと感じる。蘇生による幕開けは、本作の原案である『バルタザールどこへ行く』の主題を再生することの表明でもある

□ Pawel Mykietyn - The Beginning | EO (Original Motion Picture Soundtrack)

□ Pawel Mykietyn - Skier | EO (Original Motion Picture Soundtrack)

Exokind.

2023-05-22 19:20:21 | Music20

□ Burial - Streetlands

It's some of the enigmatic producer's most affecting ambient work.

“Streetlands” brings a coherent arc of tension, light and dark, traces of horror and the essential, indispensable portion of hauntology into harmony.

□ Burial - Exokind [HDB150]

The best films of the First Half of 2023.

2023-05-22 19:08:09 | 映画

The best films of the First Half of 2023.

1.EO
2.Return to Dust (小さき麦の花)
3.Benedetta
4.რას ვხედავთ როდესაც ცას ვუყურებთ? (ジョージア、白い橋のカフェで逢いましょう)
5.GotG3

2023上半期　映画ベスト5 (5月時点)

まさかの『EO』ぶち抜き。打ちのめされたし、まだ余韻から抜け出せない。『小さき麦の花』もロバが名演。GotG3やPigといい、昨年から動物に泣かされてばっかり

Petrichor - Dark Female Vocals

2023-05-21 20:21:22 | Music20

□ Petrichor - Dark Female Vocals

>> https://youtube.com/playlist?list=PLp9W_3yKBh5LBFyylBB7DlInVdLHH4t2H
>> https://music.apple.com/playlist/petrichor-dark-female-vocals/pl.u-LpRjFbeRqV

01. Valravn / Koder på snor (Faun Remix)
02. CLANN / “Closer”
03. Dabin / “Lilith” (feat. Apashe & Madi)
04. Aurora / “The Seed”
05. Ólafur Arnalds / “Back To The Sky”
06. Apparat / “Goodbye”
07. KALANDRA /"Borders"
08. Floes / “Last Night”
09. Delerium / “Ray” (feat. Kristy Thirsk)
10. Thomas Bergersen / “In Orbit” (feat. Cinda M.)
11. ODESZA / “This Version Of You” (feat. Julianna Barwick)
12. PRAANA / “Insight”

ジトジト雨の降り続く季節が近いので、梅雨用のプレイリスト作ってみました。ダークで民族音楽音楽的要素もある、神秘的な女性ヴォーカルもの。

على متن مركبة "دراجون”

2023-05-21 20:08:08 | Science News

خطوات وننطلق بطموحنا #نحو_الفضاء.. الرائدان @Astro_Rayyanah و @AstroAli11 سيتجهان على متن مركبة "دراجون" إلى محطة الفضاء الدولية بعد قليل.. 🇸🇦 pic.twitter.com/twM2wa01jw
— الهيئة السعودية للفضاء (@saudispace) May 21, 2023

هذه الصورة من الفضاء هي أجمل تعبير عن الأخوّة بين #عُمان و #الإمارات 🇴🇲 🇦🇪الجغرافيا تجمعنا .. العادات تجمعنا .. والمحبة المتبادلة تجمعنا عُمان منا ونحن منهم ❤️ pic.twitter.com/U6WaPyCEbG
— Sultan AlNeyadi (@Astro_Alneyadi) May 24, 2023

متحمس لاستقبال الزملاء من طاقم Ax-2 بيجي ويتسون وجون شوفنر وعلي القرني وريانة برناوي.. ساعات ونلتقي هنا.. ساعات ويجتمع العلمان الإماراتي والسعودي جنبًا إلى جنب في الفضاء 🇸🇦🇦🇪 توصلون بالسلامة وبالتوفيق إن شاء الله .. pic.twitter.com/DDiZ1fiwGT
— Sultan AlNeyadi (@Astro_Alneyadi) May 22, 2023

G7.

2023-05-21 19:07:07 | 国際・政治

□ Ukrainian Air Force

□ Defence of Ukraine

Coming this fall!
The greatest air force blockbuster of all time!
F-16s in Ukraine's skies!
We shall defend our skies!

We are adding strength to Ukraine. On the eve of the G7 Summit, I held meetings with @GiorgiaMeloni 🇮🇹, @RishiSunak 🇬🇧, @narendramodi 🇮🇳, @CharlesMichel 🇪🇺, @EmmanuelMacron 🇫🇷, @OlafScholz 🇩🇪. Peace Formula. Protection of people. Joint strengthening of international law.

🇺🇦🇺🇦🇺🇦… pic.twitter.com/ZcNU96uhbu
— Володимир Зеленський (@ZelenskyyUa) May 20, 2023

The ongoing war in Ukraine has been a major concern worldwide. But for me, it's not a matter of politics or economy, but of humanity and human values.

- PM Shri @narendramodi Ji#G7HiroshimaSummit pic.twitter.com/qogKDe3aK9
— Sambit Patra (@sambitswaraj) May 20, 2023

With the Prime Minister of the Cook Islands H.E @MarkBrownPM representing the Pacific Islands Forum at the #G7Summit. Discussing the blue economy, ratification of the @wto Fisheries Subsidies Agreement and diversification of Pacific Island economies. @ForumSEC pic.twitter.com/6XALoddqEa
— Ngozi Okonjo-Iweala (@NOIweala) May 20, 2023

Talk rubbish.

2023-05-21 07:07:07 | 日記・エッセイ・コラム

これまでのキャリアはずっと管理・運営寄りだったので、今みたいにがっつりエンジニア（技術系総合職）の現場に携わる機会は初めて。今までの環境よりも合理性が重視されたコミュニケーションは円滑で快適な一方、一過性かつ抽象性の高いコンバージェンスを即興的な議論で求められるハードルがある

Stalemate.

2023-05-21 06:06:06 | 日記・エッセイ・コラム

既存の制度・構造問題の是非を問う場合、否定意見は主に既得権益への批判であることに対し、擁護意見は既存・慣行制度の成立過程に合理性がビルドインされていることを盾として自己言及に終始する場合が多いため、両者のキークエスチョンが交わらないまま延々と議論が加熱することがある。

Orpheus.

2023-05-15 05:15:05 | Science News

(Art by ekaitsa)

□ ORFeus: A Computational Method to Detect Programmed Ribosomal Frameshifts and Other Non-Canonical Translation Events

>> https://www.biorxiv.org/content/10.1101/2023.04.24.538127v1

ORFeus uses a hidden Markov model to infer translation patterns from ribo-seq data that is inherently noisy and sparse. The model identifies changes in reading frame and additional upstream or downstream reading frames, making it suitable for detection of many alternative translation events.

ORFeus can identify novel or extended ORFs (including uORFs and dORFs) with either canonical or non-canonical start codons, as well as programmed ribosomal frameshifts and stop codon readthrough events. For each transcript, ORFeus returns the most probable state path.

□ scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI

>> https://www.biorxiv.org/content/10.1101/2023.04.30.538439v1

scGPT, a single-cell foundation model by GPT on over 10 million cells. scGPT uses an in-memory data structure to store hundreds of datasets that allow fast access. The learned gene embedding maps decode known pathways by grouping together genes that are functionally relevant.

With zero-shot learning, the pre-trained model is able to reveal meaningful cell clusters on unseen datasets. With finetuning in a few-shot learning setting, the model achieves state-of-the-art performance on a wide range of downstream tasks.

scGPT employes the generative self-supervised objective to iteratively predict GE values of unknown tokens from known tokens in an auto-regressive manner. scGPT's embedding architecture can easily extend to multiple sequencing modalities, batches, and perturbation states.

□ REVNANO: Reverse Engineering DNA Origami Nanostructure Designs from Raw Scaffold and Staple Sequence Lists

>> https://www.biorxiv.org/content/10.1101/2023.05.03.539261v1

REVNANO, a constraint programming solver that recovers the (approximate) staple-scaffold contact map from origami sequences. REVNANO uses graph layout techniques to convert the topological contact map into an approximate geometric origami schematic.

REVNANO leverages the unique physical features of origami nanostructures as heuristics. DNA, RNA or hybrid scaffolded origami are all supported. The quality of the REVNANO solution is quantified by taking the base hamming distance between the ground truth contact map.

□ UnitedNet: Explainable multi-task learning for multi-modality biological data analysis

>> https://www.nature.com/articles/s41467-023-37477-x

UnitedNet has an encoder-decoder-discriminator structure and is trained by joint group identification / cross-modal prediction. Its structure does not presume that the data distributions are known - instead implicitly approximates the statistical characteristics of each modality.

UnitedNet uses SHapley Additive exPlanations algorithm and indicates the relevance relationship between gene expression and DNA accessibility with cell-type specificity. UnitedNet fuses these codes into shared latent codes using an adaptive weighting scheme.

□ AirLift: A Fast and Comprehensive Technique for Remapping Alignments between Reference Genomes

>> https://www.biorxiv.org/content/10.1101/2021.02.16.431517v2

AirLift, a methodology and tool for quickly, comprehensively, and accurately remapping a read data set that had previously been mapped to an older reference genome to a newer reference genome.

AirLift provides BAM-to-BAM remapping results on which downstream analysis can be immediately performed. AirLift Index exploits the similarity b/n two references to quickly identify candidate locations that a read should be remapped to based on its original mapping.

□ DELVE: Feature selection for preserving biological trajectories in single-cell data

>> https://www.biorxiv.org/content/10.1101/2023.05.09.540043v1

DELVE (dynamic selection of locally covarying features), an unsupervised feature selection method for identifying a representative subset of dynamically-expressed molecular features that recapitulates cellular trajectories.

DELVE uses a bottom-up approach to mitigate the effect of unwanted sources of variation confounding inference, and instead models cell states from dynamic feature modules that constitute core regulatory complexes.

□ Designing molecular RNA switches with Restricted Boltzmann machines

>> https://www.biorxiv.org/content/10.1101/2023.05.10.540155v1

Restricted Boltzmann machines (RBM), a simple two-layer machine learning model, capture intricate sequence dependencies induced by secondary and tertiary structure, as well as the switching mechanism, resulting in a model that can be used for the design of allosteric RNA.

The hidden units of the RBM must extract features shared by the data sequences and thus likely to be important for their biological function. Conservation of probability mass implies that regions of sequence space not populated by data sequences must be penalized.

The RBM is able to model complex interactions. After marginalizing over the hidden units configurations, effective interactions arise between the visible units. RBM can represent schematically a three-body interaction, arising from the three connections of the summed hidden unit.

□ metapaths: similarity search in heterogeneous knowledge graphs via meta paths

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad297/7152274

Once informative meta paths for a given KG have been defined, these meta paths define the semantics of the relationships between nodes in the KG, thereby enabling heterogeneous graph convolutional and graph attention networks for downstream machine learning analyses.

The primitives of the metapaths package identify the neighbors of a specified node with a given type by querying either an edge t or, for efficiency, an adjacency list precomputed from the edge list.

The meta path traversal function accepts an origin node, a destination node, and a specified meta path; then, via the neighbor identification functions, it starts at the origin node and recursively expounds the sequence of node types until the destination node is reached.

□ EvoAug: improving generalization and interpretability of genomic deep neural networks with evolution-inspired data augmentations

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02941-w

Random transformation of DNA sequences can potentially alter their function in unknown ways. EvoAug pretrains sequence-based deep learning models for regulatory genomics data w/ evolution-inspired augmentations followed by a finetuning on the original, unperturbed sequence data.

EvoAug data augmentations introduce a modeling bias to learn invariances of the (un)natural symmetries generated by the augmentations.

Random insertions and deletions assume that the distance between motifs is not critical, whereas random inversions and translocations promote invariances to motif strand orientation and the order of motifs.

□ ProteinSGM: Score-based generative modeling for de novo protein design

>> https://www.nature.com/articles/s43588-023-00440-3

ProteinSGM, a continuous-time score-based generative model that generates high-quality de novo proteins. ProteinSGM learns to generate four matrices that fully describes a protein's backbone, which are used as smoothed harmonic constraints in the Rosetta minimization protocol.

ProteinSGM generates variable-length structures with a mean ＜ -3.9 REU per residue, indicative of native-like structures. It provides an alternative approach that uses MinMover for backbone minimization, and ProteinMPNN and OmegaFold for sequence design and structure prediction.

□ CEBRA: Learnable latent embeddings for joint behavioural and neural analysis

>> https://www.nature.com/articles/s41586-023-06031-6

CEBRA is a nonlinear dimensionality reduction method newly developed to explicitly leverage auxiliary (behaviour) labels and/or time to discover latent features in time series data—in this case, latent neural embeddings.

CEBRA can be used for supervised and self-supervised analysis, thereby directly facilitating hypothesis- and discovery-driven science. It produces both consistent embeddings across subjects and can find the dimensionality of neural spaces that are topologically robust.

□ The categorical basis of dynamical entropy

>> https://arxiv.org/abs/2301.09205

The focus of topological Dynamical systems theory is to derive properties of the system. The objects that are usually in consideration are invariant behavior such as attractors, invariant sets and omega-limit sets, and asymptotic properties such as invariant measures and entropy.

A category-theoretic view of topological dynamical entropy, which reveals that the common limit is a consequence of the structural assumptions on these notions. One of the key tools developed is that of a qualifying pair of functors, which ensure a limit preserving property.

The diameter and Lebesgue number of open covers of a compact space, form a qualifying pair of functors. The various notions of complexity are expressed as functors, and natural transformations between these functors lead to their joint convergence to the common limit.

□ A draft human pangenome reference

>> https://www.nature.com/articles/s41586-023-05896-x

Flagger detects different types of misassemblies within a phased diploid assembly. The pipeline works by mapping the HiFi reads to the combined maternal and paternal assembly in a haplotype-aware manner.

Flagger identifies coverage inconsistencies within these read mappings. Coverage is calculated across the genome and a mixture model is fit to account for reliably assembled haploid sequence and various classes of unreliably assembled sequence.

□ Squigulator: simulation of nanopore sequencing signal data with tunable noise parameters

>> https://www.biorxiv.org/content/10.1101/2023.05.09.539953v1

Squigulator generates simulated nanopore signal data based on an input reference genome or transcriptome sequence, or directly from a set of basecalled reads.

Squigulator uses an idealised 'pore model' that specifies the predicted current signal reading associated with every possible DNA or RNA k-mer, as appropriate to the specific nanopore protocol being emulated.

Squigulator generates sequential signal values corresponding to sequential k-mers in the provided reference sequence. squigulator transforms the data using Gaussian noise functions in both the time and amplitude domains to produce realistic, rather than ideal, signal reads.

□ Ariadne: Synthetic Long Read Deconvolution Using Assembly Graphs

>> https://www.biorxiv.org/content/10.1101/2021.05.09.443255v3

Ariadne, a novel assembly graph-based SLR deconvolution algorithm, that can be used to extract single-species read-clouds from SLR datasets to improve the taxonomic classification and de novo assembly of complex populations, such as metagenomes.

Ariadne leverages the linkage information encoded in the full de Bruin-based assembly graph generated by a de novo assembly tool such as cloudSPAdes to generate up to 37.5-fold more read clouds containing only reads from a single fragment.

□ Merizo: a rapid and accurate domain segmentation method using invariant point attention

>> https://www.biorxiv.org/content/10.1101/2023.02.19.529114v2

Network inputs to the IPA encoder are the single and pairwise representations and backbone frames in the style of AlphaFold2. The IPA encoder comprises six weight-shared blocks, each containing a single IPA block with RoPE positional encoding, and a bi-GRU transition block.

In the Masked transformer decoder, learnable domain mask embeddings dare concatenated to the single representation and passed through a 10-layer MHA stack with ALiBi positional encoding.

The predicted domain mask tensor is split according to the predicted domain and is passed through a two-layer biGRU, followed by projection into one dimension to produce a single ploU value for each domain. ndom represents the number of predicted domains.

□ Evolutionary graph theory on rugged fitness landscapes

>> https://www.biorxiv.org/content/10.1101/2023.05.04.539435v1

A unifying theory of how heterogenous structure shapes evolutionary dynamics. Even a simple extension to a two-mutational landscape can exhibit evolutionary dynamics not observed in deme-based models and that cannot be predicted using single-mutation results.

This model can be applied to understand the evolutionary trajectory of cellular systems with complex architectures. Heterogenous structure can affect fitness landscape crossing by allowing intermediate mutants to persist for longer, until the final beneficial mutation occurs.

□ The Compositional Structure of Bayesian Inference

>> https://arxiv.org/abs/2305.06112

A compositional Bayesian inversion of Markov kernels in isolation, using a suitable axiomatisation of a category of Markov kernels. It builds categories whose morphisms are pairs of a Markov kernel and an associated 'Bayesian inverter', which is itself built compositionally.

Symmetric monoidal categories with compatible families of copy and delete morphisms have been identified as an expressive language for synthetically representing concepts from probability theory.

A categorical translation of Bayes allows for a general definition of a Bayesian inverse to a morphism in a Markov category. The category of Bayesian lenses is constructed as a fired category that is closely related to the families fibration, in the semantics of dependent types.

□ CoCoNat: a novel method based on deep-learning for coiled-coil prediction

>> https://www.biorxiv.org/content/10.1101/2023.05.08.539816v1

CoCoNat encodes sequences with the combination of two state-of-the- art protein language models and implements a three-step deep learning procedure concatenated with a Grammatical-Restrained Hidden Conditional Random Field (GRHCRF) for CCD identification and refinement.

CoCoNat makes use of residue embeddings obtained with large-scale protein Language Models (pLMs) to represent proteins in training and testing sets. CoCoNat adopts a 15 residue long sliding window, takes as input, where each residue is represented with a 2304-feature vector.

□ snATAK: Assessing the multimodal tradeoff

>> https://www.biorxiv.org/content/10.1101/2021.12.08.471788v2

snATAK incorporates kallisto and other tools in a workflow that facilitates the preprocessing of snATAC-seq data from numerous technologies in minimal computing environments. snATAK can be used for allele-specific analysis of multimodal data, even in the absence of genotype data.

snATACK consists of first mapping reads to a reference genome using Minimap2. snATAK identifies putative open chromatin regions with Genrich. A kallisto pseudoalignment index is made and reads are remapped using kalisto. The snATAK output is compatible with the Signac and ArchR.

□ GenPhys: From Physical Processes to Generative Models

>> https://arxiv.org/abs/2304.02637

GenPhys (Generative Models from Physical Processes), a frame-work that can convert physical Partial differential equations (PDEs) to generative models. Diffusion models and Poisson flow generative models leverage the diffusion equation and the Poisson equation.

There exists non s-generative model which can also provide useful generative modeling, such as the case in quantum machine learning with dynamics based on the Schrödinger equation and quantum circuits.

□ Learning Decision Trees with Gradient Descent

>> https://arxiv.org/abs/2305.03515

Gradient-based decision trees (GDTs), a novel approach for learning hard, axis-aligned Decision Trees (DTs) with gradient descent. The proposed method uses backpropagation with a straight-through operator on a dense DT representation to jointly optimize all tree parameters.

GDTs are less prone to overfitting. GDT optimizes the gradient descent algorithm by exploiting common stochastic gradient descent techniques, including mini-batch calculation and momentum using the Adam optimizer with weight averaging.

□ LatentDiff: A Latent Diffusion Model for Protein Structure Generation

>> https://arxiv.org/abs/2305.04120

Latent Diff generates a novel protein backbone structure. They first sample multivariate Gaussian noise and use the learned latent diffusion model to generate 3D positions and node embeddings in the latent space.

Latent Diff uses a pre-trained equivariant 3D autoencoder to transform protein backbones into a more compact latent space, and models the latent distribution with an equivariant latent diffusion model.

□ Sequence UNET: High-throughput deep learning variant effect prediction

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02948-3

Sequence UNET is trained to directly predict variant frequency or to classify low frequency variants, as a proxy for deleteriousness, and then fine-tuned for pathogenicity prediction.

Sequence UNET uses a fully convolutional architecture. Convolutional kernels also naturally integrate information from nearby amino acids. The model outputs a matrix of per position features and can therefore be trained to predict various positional properties.

□ aaHash: recursive amino acid sequence hashing

>> https://www.biorxiv.org/content/10.1101/2023.05.08.539909v1

aaHash, a recursive hashing algorithm tailored for amino acid sequences. This algorithm utilizes multiple hash levels to represent biochemical similarities between amino acids. aaHash performs ~10X faster than generic string hashing algorithms in hashing adjacent k-mers.

aaHlash builds on ntHash, a rolling hash algorithm for DNA/RNA sequences, and adapts it for amino acid sequences. aaHash also supports using different levels of hashes together to create a multi-level pattern, mimicking the functionality of spaced seeds.

□ BGWAS: Bayesian variable selection in linear mixed models with nonlocal priors for genome-wide association studies

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05316-x

BGWAS uses a novel nonlocal prior for linear mixed models (LMMs). The screening step fits as many LMMs as the number of SNPs using a mixture of a Dirac delta at zero and a nonlocal prior, and estimates the probability of the Dirac delta component.

BGWAS uses a pMOM nonlocal prior for LMMs that uses the full Fisher information matrix. BGWAS either uses complete enumeration or searches the model space with a genetic algorithm.

□ AIONER: All-in-one scheme-based biomedical named entity recognition using deep learning

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad310/7160912

AIONER, a new NER tagger that takes full advantage of various existing datasets for recognizing multiple entities simultaneously, despite their inherent differences in scope and quality, through a novel all-in-one (AIO) scheme.

The AIO scheme utilizes a small dataset recently annotated with multiple Entity types as a bridge to integrate multiple datasets annotated with a subset of entity types, thereby recognizing multiple entities at once, resulting in improved accuracy and robustness.

□ NanoPack2: Population scale evaluation of long-read sequencing data

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btad311/7160911

The cramino, chopper, kyber, and phasius tools are written in Rust and available as executable binaries without requiring installation or managing dependencies. Binaries build on musl are available for broad compatibility.

Phasius is developed to visualize the results of read phasing, which shows in a dynamic genome browser style the length and interruptions between contiguously phased blocks from a large number of individuals together with genome annotation, for example, segmental duplications.

□ copMEM2: Robust and scalable maximum exact match finding

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad313/7160910

copMEM2, a multi-threaded MEM finding tool, targeting the execution speed and reducing the memory, as well as incorporating an improvement to speed up its processing by orders of magnitude when the pair of genomes is highly similar.

copMEM2 allows to compute all MEMs of minimum length 50 between the human and mouse genomes in 59s, using 10.40 GB of RAM and 12 threads, being at least a few times faster than its main contenders. On a pair of human genomes, hg18 and hg19, the results are 324s and 16.57 GB.

□ Integration of a multi-omics stem cell differentiation dataset using a dynamical model

>> https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1010744

A hierarchical dynamical model that allowed us to integrate all data sets. This model was able to explain mRNA-protein discordance for most genes and identified instances of potential microRNA-mediated regulation.

Overexpression or depletion of microRNAs identified by the model, followed by RNA sequencing and protein quantification, were used to follow up on the predictions of the model.

□ Improving variant calling using population data and deep learning

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05294-0

A population-aware DeepVariant models with a new channel encoding allele frequencies. This model reduces variant calling errors, improving both precision and recall in single samples, and reduces rare homozygous and pathogenic clinvar calls cohort-wide.

The relative advantage of the population-aware models increase at lower coverage, suggesting that population information is most valuable in difficult examples, where read-level information alone may not be sufficient for confident calling.

□ DeSide: A unified deep learning approach for cellular decomposition of bulk tumors based on limited scRNA-seq data

>> https://www.biorxiv.org/content/10.1101/2023.05.11.540466v1

The DeSide architecture considers only non-cancerous cells during the training process, indirectly calculating the proportion of cancerous cells.

DeSide avoids directly handling the often more variable heterogeneity of cancerous cells, and instead leverages scRNA-seq data from three different cancer types to empower the DNN model with a robust generalization capability across diverse cancers.

□ A Superior Thumb Drive: Optimizing DNA Stability for DNA Data Storage

>> https://www.biorxiv.org/content/10.1101/2023.05.11.540302v1

While methods to achieve DNA stability for hundreds or even millennia are possible, they call for completely enclosing DNA inside a silica matrix.

For instance, for an Archival Storage system whose DNA is enclosed in silica, the probability of strand loss or breakage is much lower, thereby enabling the use of longer DNA strands and higher information densities.

Conversely, for Working or Short-Term Storage systems, shorter strand lengths and lower information density requirements would be more appropriate due to the higher likelihood of strand loss.

Tranquility.

2023-05-15 05:13:05 | Science News

(Art by ekaitza)

□ scSpace: Reconstruction of the cell pseudo-space from single-cell RNA sequencing data

>> https://www.nature.com/articles/s41467-023-38121-4

scSpace (single-cell spatial position associated co-embeddings), an integrative method that uses ST data as a spatial reference to reconstruct the pseudo-space. A space-informed clustering is conducted to identify spatially variable cell subpopulations within the scRNA-seq data.

scSpace uses a transfer component analysis (TCA), it enables eliminating the batch effect between single-cell and ST data and extracting the shared latent feature. TCA projects the scRNA-seq and spatial transcriptomics data into a Reproducing Kernel Hilbert Space.

□ DEGAP: Dynamic Elongation of a Genome Assembly Path

>> https://www.biorxiv.org/content/10.1101/2023.04.25.538224v1

DEGAP (Dynamic Elongation of a Genome Assembly Path), a novel gap-filling software that can resolve gap regions in genomes. DEGAP optimizes HiFi reads by identifying the differences b/n reads and provides ‘GapFiller’ or ‘CtgLinker’ modes to eliminate or shorten gaps in genomes.

DEGAP elongates all contigs with supplied HiFi data, assesses the potentially neighbored contigs. DEGAP adopts a cyclic elongation strategy that automatically and dynamically adjusts parameters according to the complexity of the sequences and selects the optimal extension path.

□ scDisInFact: disentangled learning for integration and prediction of multi-batch multi-condition single-cell RNA-sequencing data

>> https://www.biorxiv.org/content/10.1101/2023.05.01.538975v1

scDisInFact (single cell disentangled Integration preserving condition-specific Factors) learns latent factors that disentangle condition effects from batch effects, enabling it to simultaneously perform: batch effect removal, CKG detection, and perturbation prediction.

The disentangled latent space allows scDisInFact to perform the CKG detection and perturbation prediction, and to overcome the limitation of existing methods for each task. scDisInFact can remove batch effect while keeping the condition effect in gene expression data.

□ scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics

>> https://www.nature.com/articles/s41587-023-01772-1

The scDesign3 model is flexible to incorporate cell covariates (such as cell type, pseudotime, and spatial coordinates) via the use of generalized additive models, making the scDesign3 model fit well to various single-cell and spatial omics data a property confirmed by scDesign3's realistic simulation.

scDesign3 has a model alteration functionality enabled by its transparent probabilistic modeling: given the scDesign3 model parameters estimated on real data, users can alter the model parameters to reflect a hypothesis and generate the corresponding synthetic data that bear real data characteristics.

□ CellTypist v2.0: Automatic cell type harmonization and integration across Human Cell Atlas datasets

>> https://www.biorxiv.org/content/10.1101/2023.05.01.538994v1

CellTypist v2.0 accurately guantifies cell-cell transcriptomic similarities and enables robust and efficient cross-dataset meta-analyses. Cell types are placed into a relationship graph that hierarchically defines shared and novel cell subtypes.

CellTypist uses PCT, a multi-target regression tree algorithm. CellTypist defines semantic relationships among cell types / captures their underlying hierarchies, which are further leveraged to guide the downstream data integration at different levels of annotation granularities.

□ GATE: Moving Fast With Broken Data

>> https://arxiv.org/pdf/2303.06094.pdf

GATE, the Partition Summarization (PS) approach to data validation. The method creates a vector of statistics for each time step and performs a k-nearest neighbor algorithm against historical vectors to label the current time step's vector as anomalous or acceptable.

GATE significantly outperforms other methods in terms of mitigating false positives when ML pipelines have many correlated features because of GATE's clustering component, which only triggers an alert when an entire group of correlated features is anomalous.

□ ATOMRefine: Atomic protein structure refinement using all-atom graph representations and SE(3)-equivariant graph transformer

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad298/7152976

ATOMRefine, a deep learning-based, end-to-end, all-atom protein structural model refinement method. It uses a SE(3)-equivariant graph transformer network to directly refine protein atomic coordinates in a predicted tertiary structure represented as a molecular graph.

ATOMRefine enables the network to leverage sequence-based and spatial information from the entire protein structures to update node and edge features and catch the global and local structural variation from the initial model to the native structure iteratively.

□ Restrander: rapid orientation and QC of long-read cDNA data

>> https://www.biorxiv.org/content/10.1101/2023.05.02.539165v1

Restrander was faster than Oxford Nanopore Technologies’ existing tool Pychopper, and correctly restranded more reads due to its strategy of searching for polyA/T tails in addition to primer sequences from the reverse transcription and template-switch steps.

Each read from the reverse strand is replaced with reverse-complement, ensuring all reads in the output have the same orientation as the original transcripts. Restrander classifies artefactual reads for QC and ensure only high-quality reads are taken for downstream processing.

□ ROptimus: a parallel general-purpose adaptive optimisation engine

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad292/7152277

ROptimus, a general-purpose optimisation engine in R that can be plugged to any, simple or complex, modelling initiative through a few lucid interfacing functions, to perform a seamless optimisation with rigorous parameter sampling.

ROptimus features simulated annealing and replica exchange implementations equipped with adaptive thermoregulation to drive Monte Carlo optimisation process in a flexible manner, through constrained acceptance frequency but unconstrained adaptive pseudo temperature regimens.

□ Unifilar Machines and the Adjoint Structure of Bayesian Models

>> https://arxiv.org/abs/2305.02826

There is an adjunction between ‘dynamical’ and ‘epistemic’ models of a hidden Markov process. Concepts such as Bayesian filtering and conjugate priors arise as natural consequences of this adjunction.

Strongly representable Markov categories include BorelStoch (whose objects are standard Borel spaces and whose morphisms are Markov kernels) and the Kleisli category of the (real-valued) distribution monad, which is called Dist.

Unifilar machines outputs are stochastic but whose state updates are deterministic. Its state space consists of probability distributions over the hidden states of the system, and its dynamics are given by Bayesian updating.

□ StarCoder: A State-of-the-Art LLM for Code

>> https://huggingface.co/blog/starcoder

15B LLM with 8k context
Trained on permissively-licensed code
Acts as tech assistant
80+ programming languages
Open source and data
Online demos
VSCode plugin
1 trillion tokens

□ A Bayesian Noisy Logic Model for Inference of Transcription Factor Activity from Single Cell and Bulk Transcriptomic Data

>> https://www.biorxiv.org/content/10.1101/2023.05.03.539308v1

NLBayes: A noisy Boolean logic Bayesian model for TF activity inference from differential gene expression data and causal graphs. This approach provides a flexible framework to incorporate biologically motivated TF-gene regulation logic models.

NLBayes incorporates the prior information on causal regulatory interactions and makes posterior adjustments to further account for noise and determine the context-specific posterior network structure and active regulators through a Gibbs sampling procedure.

□ Dawnn: single-cell differential abundance with neural networks

>> https://www.biorxiv.org/content/10.1101/2023.05.05.539427v1

Dawnn uses a deep neural network model that has been trained to estimate the relative abundance of cells from each sample or condition in a cell’s neighbourhood. Dawnn predicts the probability w/ which each cell was drawn from a given sample or condition using simulated datasets.

Dawn controls the false discovery rate (FDR), the proportion of cells incorrectly cssified as belonging to regions exhibiting DA, using the Benjamini-Yekutieli procedure, a variant of the Benjamini-Hochberg procedure that does not assume independence between hypotheses.

□ Ribotin: rDNA consensus sequence builder

>> https://github.com/maickrau/ribotin

Ribotin inputs hifi or duplex, and optionally ultralong ONT. Extracts rDNA-specific reads based on k-mer matches to a reference rDNA sequence or based on a verkko assembly

Ribotin builds a DBG out of them, extracts the most covered path as a consensus and bubbles as variants. Optionally assembles highly abundant rDNA morphs using the ultralong ONT reads.

□ Aggregating network inferences: towards useful networks

>> https://www.biorxiv.org/content/10.1101/2023.05.05.539529v1

They suggest to combine edge frequencies directly to reconstruct the network. This approach ensures that only robust and reproducible edges are included in the consensus network.

The first consensus step relies on selecting edges w/ high inclusion frequency in the networks reconstructed from resampled data. The 2nd aggregation step is the inference of a consensus network considering each method advantages and counter balancing each estimation's default.

□ Foldseek: Fast and accurate protein structure search

>> https://www.nature.com/articles/s41587-023-01773-0

Foldseek discretizes the query structures into sequences over the 3Di alphabet and uses a pre-trained 3Di substitution matrix to search through the 3Di sequences of the target structures using the double-diagonal k-mer-based prefilter and gapless alignment prefilter modules.

Foldseek uses vectorized Smith–Waterman local alignment combining 3Di and amino acid substitution scores. Alternatively, a global alignment is computed with a 1.7-times accelerated TM-align.

□ ProteinGenerator: Joint Generation of Protein Sequence and Structure with RoseTTAFold Sequence Space Diffusion

>> https://www.biorxiv.org/content/10.1101/2023.05.08.539766v1

Beginning from random amino acid sequences, ProteinGenerator generates sequence and structure pairs by iterative denoising, guided by any desired sequence and structural protein attributes.

ProteinGenerator readily generates sequence-structure pairs satisfying the input conditioning criteria, and experimental validation showed that the designs were monomeric by size exclusion chromatography, had the desired secondary structure content by circular dichroism.

□ Improving de novo protein binder design with deep learning

>> https://www.nature.com/articles/s41467-023-38328-5

The physically based Rosetta approach frames both the folding and binding problems in energetic terms; for the approach to succeed, the designed sequence must have as its lowest energy state in isolation the designed monomer structure.

ProteinMPNN, a novel deep learning-augmented de novo protein binder design protocol. It shows retrospectively and prospectively that this improved protocol has nearly 10-fold higher success rate than the original energy-based method.

□ HMMerge: an ensemble method for multiple sequence alignment

>> https://academic.oup.com/bioinformaticsadvances/article/3/1/vbad052/7126611

HMMerge builds on the technique from its predecessor alignment methods, UPP and WITCH, which build an ensemble of profile HMMs to represent the backbone alignment and add the remaining sequences into the backbone alignment using the ensemble.

HMMerge builds a new ‘merged’ HMM from the ensemble, and then using that merged HMM to align the query sequences. We show that HMMerge is competitive with WITCH, with an advantage over WITCH when adding very short sequences into backbone alignments.

□ Correcting gradient-based interpretations of deep neural networks for genomics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02956-3

Even though DNNs can learn a function everywhere in Euclidean space, one-hot encoded DNA is a categorical variable that lives on a lower-dimensional simplex.

Random off-simplex function behavior can introduce a random gradient component orthogonal to the simplex, which manifest as spurious noise in the input gradients

This proposed gradient correction—subtracting the original gradient components by the mean gradients across components for each position—is general for all data with categorical inputs, including DNA, RNA, and protein sequences.

□ GKLOMLI: a link prediction model for inferring miRNA–lncRNA interactions by using Gaussian kernel-based method on network profile and linear optimization algorithm

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05309-w

GKLOMLI, a novel link prediction model based on Gaussian kernel-based method and linear optimization algorithm for inferring miRNA–lncRNA interactions. The Gaussian kernel-based method was employed to output two similarity matrixes of miRNAs and lncRNAs.

Based on the integrated matrix combined with similarity matrixes and the observed interaction network, a linear optimization-based link prediction model was trained for inferring miRNA–lncRNA interactions.

□ Estimating the mean in the space of ranked phylogenetic trees

>> https://www.biorxiv.org/content/10.1101/2023.05.08.539790v1

A simulation study to validate our method and compare it to other tree summary approaches such as the Maximum Clade Credibility (MCC) method. They assess suitability of a treespace for statistical analyses, e.g. its "smoothness" w/ respect to probability distributions over trees.

The RNNI space is a treespace of ranked phylogenetic trees, which are rooted binary trees where internal nodes are ordered according to times of the corre-ponding evolutionary events, assuming no co-occurrence.

The RNNI space is then defined as a graph where vertices are ranked trees and edges are representing either a rank or an NNI move that transforms one tree into another.

The CENTROID algorithm minimizes the sum of squared (SoS) distances b/n a summary tree and a given tree sample and stops when it finds a locally optimal tree, approximating a centroid tree. The algorithm proceeds iteratively by computing the SoS values for all neighbors.

□ Model selection and robust inference of mutational signatures using Negative Binomial non-negative matrix factorization

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05304-1

A Negative Binomial NMF with a patient specific dispersion parameter to capture the variation across patients and derive the corresponding update rules for parameter estimation.

A novel model selection procedure inspired by cross-validation to determine the number of signatures. It uses the Kullback–Leibler divergence which would favor the Poisson model. This means that a direct comparison b/n the cost values for Po-NMF / NBN-NMF is not feasible.

□ STAGEs: A web-based tool that integrates data visualization and pathway enrichment analysis for gene expression studies

>> https://www.nature.com/articles/s41598-023-34163-2

STAGEs (Static and Temporal Analysis of Gene Expression studies) is a web-based and high-throughput analysis pipeline with an intuitive user interface that allows systematic characterisation of static and temporal transcriptomic data.

STAGEs converts the ratio values to log2-transformed fold change values at backend, and the correlation matrix is generated by performing pairwise correlations of the log2-transformed fold changes between the different experimental conditions.

□ Insights from a genome-wide truth set of tandem repeat variation

>> https://www.biorxiv.org/content/10.1101/2023.05.05.539588v1

By identifying the subset of insertions and deletions that represent TR expansions or contractions with motifs between 2 and 50 base pairs, we obtained accurate genotypes for 139,795 pure and 6,845 interrupted repeats in a single diploid sample.

This approach did not require running existing genotyping tools on short read or long read sequencing data and provided an alternative, more accurate view of tandem repeat variation.

The Synthetic Diploid (SynDip) Benchmark provides genotypes for 5, 182,765 SNV, insertion and deletion variants, as well as a set of high-confidence regions spanning 2.71 gigabases where genotypes are highly accurate.

□ Butt-seq: a new method for facile profiling of transcription

>> https://genesdev.cshlp.org/content/early/2023/05/10/gad.350434.123.abstract

Butt-seq (bulk analysis of nascent transcript termini sequencing), which can produce libraries from purified nascent RNA in 6 h and from as few as 10,000 cells—an improvement of at least 10-fold over existing techniques.

Butt-seq shows that inhibition of the superelongation complex (SEC) causes promoter-proximal pausing to move upstream in a fashion correlated with subnucleosomal fragments.

□ NGBO: Introducing -omics metadata to biobanking ontology

>> https://www.biorxiv.org/content/10.1101/2023.05.09.539725v1

NGBO is based on available genomics standards (e.g., Minimum information about a microarray experiment (MIAME)), the College of American Pathologists (CAP) laboratory accreditation requirements, and the Open Biological and Biomedical Ontologies Foundry principles.

NGBO fills the need for semantically enabling the discovery and integration of omics datasets and realization of FAIR data representation, which will impact the efficiency of finding, integrating, and re-using biobanking data of interest.

□ Robust discovery of causal gene networks via measurement error estimation and correction

>> https://www.biorxiv.org/content/10.1101/2023.05.09.540002v1

A new framework for causal discovery that is robust against measurement noise by extending an established statistical approach CIT (Causal Inference Test).

RCD (Robust Causal Discovery) estimates measurement error from gene expression data and then incorporate it to get consistent parameter estimates that could be used with appropriately extended statistical tests of correlation or mediation done in the original CIT.

□ Simple Tidy GeneCoEx: A gene co-expression analysis workflow powered by tidyverse and graph-based clustering in R

>> https://acsess.onlinelibrary.wiley.com/doi/10.1002/tpg2.20323

Simple Tidy GeneCoEx detects co-expression modules enriched in specific cell types, which were used to discover candidate genes in a biosynthetic pathway for complex plant natural products.

Simple Tidy GeneCoEx detects modules that are, on average, equivalently tight or tighter than those detected by WGCNA. A potential reason underlying the differences in module tightness might be due to the module detection methods.

By default, WGCNA uses hierarchical clustering followed by tree cutting to detect modules. Simple Tidy GeneCoEx uses the Leiden algorithm to detect modules, which returns modules that are highly interconnected.

□ Fulgor: A fast and compact k-mer index for large-scale matching and color queries

>> https://www.biorxiv.org/content/10.1101/2023.05.09.539895v1

Fulgor is a colored compacted de Bruijn graph index for large-scale matching and color queries, powered by SSHash. Fulgor has a generic intersection algorithm that can work over any compressed color sets, provided that an iterator over each color supports two primitives - Next and NextGEQ(x).

Themisto, an index for alignment-free matching that substantially outperforms these prior methods in the context of indexing and mapping against large collections of genomes. Compared to Bifrost, Themisto uses practically the same space, but is faster to build and query.

Compared to the fastest variant of Metagraph, Themisto offers similar query performance, but is much more space-efficient; on the other hand, Themisto is much faster to query than Metagraph-BRWT, the most-space efficient variant of Metagraph.

□ RaPID-Query for Fast Identity by Descent Search and Genealogical Analysis

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad312/7160137

A new method, random projection-based identical-by-descent (IBD) detection (RaPID) query, is introduced to make fast genealogical search possible. RaPID-Query identifies IBD segments between a query haplotype and a panel of haplotypes.

By integrating matches over multiple PBWT indexes, RaPID- Query manages to locate IBD segments quickly with a given cutoff length while allowing mismatched sites.

□ CARMA is a new Bayesian model for fine-mapping in genome-wide association meta-analyses

>> https://www.nature.com/articles/s41588-023-01392-0

CARMA, a Bayesian model for fine-mapping that includes flexible specification of the prior distribution of effect sizes, joint modeling of summary statistics and functional annotations and accounting for discrepancies b/n summary statistics and external linkage disequilibrium in meta-analyses.

CARMA has higher power and lower false discovery rate (FDR) when including functional annotations, and higher power, lower FDR and higher coverage for credible sets in meta-analyses.

□ DeCOIL: Optimization of Degenerate Codon Libraries for Machine Learning-Assisted Protein Engineering

>> https://www.biorxiv.org/content/10.1101/2023.05.11.540424v1

DEgenerate Codon Optimization for Informed Libraries (DeCOIL), a generalized method which directly optimizes DC libraries to be useful for protein engineering: to sample protein variants that are likely to have both high fitness and high diversity in the sequence search space.

DeCOIL can be used to generate a designed library for screening based on computational predictors (ZS scores or ML models) at many possible points along the route to engineering a protein. DeCOIL enables protein engineering using ftMLDE with comparable outcomes.

□ moscot: Mapping cells through time and space

>> https://www.biorxiv.org/content/10.1101/2023.05.11.540374v1

moscot supports multimodal data throughout the framework by exploiting joint cellular representations. moscot improves scalability by adapting and demonstrating the applicability of recent methodological innovations to atlas-scale datasets.

moscot unifies previous single-cell applications of OT in the temporal and spatial domain and introduces a novel spatiotemporal application. All of this is achieved with a robust and intuitive API that interacts with the broader scverse ecosystem.

Equanimity.

2023-05-15 05:10:05 | Science News

(Art by ekaitza)

Announcing the Haven-1 and Vast-1 missions to low-Earth orbit. Launched by @SpaceX, Haven-1 is scheduled to be the world’s first commercial space station and will be visited by a crew of four aboard a Dragon spacecraft during Vast-1 → https://t.co/ToxFSiyQJj pic.twitter.com/YSPrM9Krtr
— VΛST (@vast) May 10, 2023

□ Mark

>> https://www.vastspace.com/roadmap

Very exciting timeline from Haven-1 in 2025 on F9 to 2030 Starship class space station/modules to 100m spinning station in the 2040’s.

Excellent plan and realistic timeline.

□ NextPolish2: a repeat-aware polishing tool for genomes assembled using HiFi long reads

>> https://www.biorxiv.org/content/10.1101/2023.04.26.538352v1

NextPolish2 can fix base errors in “highly accurate” draft assemblies without introducing overcorrections, even in regions with highly repetitive elements. Through the built-in phasing module, it can not only correct the error bases, but also maintain the original haplotype consistency.

NextPolish2 follows the Kmer Score Chain (KSC) algorithm of its previous version to perform an initial rough correction, and detect low-quality positions (LQPs) where the chosen alleles account for ≤ 0.95 of the total during a traceback procedure.

NextPolish2 repeats the above procedure until all conflict communities are resolved (the number of iterations can be adjusted according to user settings) and then use the KSC algorithm to generate a draft consensus sequence.

□ CODEC: Single duplex DNA sequencing with CODEC detects mutations with high sensitivity

>> https://www.nature.com/articles/s41588-023-01376-0

CODEC (Concatenating Original Duplex for Error Correction), a hybrid method that combines the massively parallel nature of NGS and the resolution of single-molecule sequencing by reading both strands of each DNA duplex with single NGS read pairs.

The CODEC structure can be built by replacing a typical adapter duplex with the CODEC adapter quadruplex, containing all elements required for NGS.

CODEC to physically concatenate the Watson strand with the reverse complement of the Crick strand into a single strand without forming a prohibitive hairpin or inverted repeat structure from two complementary sequences.

□ TRASH: Tandem Repeat Annotation and Structural Hierarchy

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad308/7159186

TRASH (Tandem Repeat Annotation and Structural Hierarchy) is a tool that identifies and maps tandem repeats in nucleotide sequence, without prior knowledge of repeat composition.

TRASH analyses a fasta assembly file, identifies regions occupied by repeats and then precisely maps them and their higher order structures.

TRASH searches for continuous, highly similar, tandemly arranged DNA repeats of a similar unit size. This excludes transposable elements and interspersed repeats from analysis and allows for precise definition of tandemly arranged repeats.

□ GraNA: Supervised biological network alignment with graph neural networks

>>

https://www.biorxiv.org/content/10.1101/2023.04.24.538184v1

GraNA, a deep learning framework for the supervised NA paradigm for the pairwise network alignment problem. GraNA utilizes within-network interactions and across-network anchor links for learning protein representations and predicting functional correspondence.

GraNA integrates sequence similarity edges as additional anchor links to guide the alignment and pre-computed network embeddings as node features to better encode the topological roles of network nodes.

□ Riboformer: A Deep Learning Framework for Predicting Context-Dependent Translation Dynamics

>> https://www.biorxiv.org/content/10.1101/2023.04.24.538053v1

Riboformer uses a transformer architecture that detects long-range dependencies in the regulation of elongation. Riboformer models the context-dependent changes in ribosome dynamics at codon resolution.

The transformer block consists of self-attention layers that gather the impact of distant codons based on their sequence representations, in contrast to convolutional neural network that relies on convolution operators to detect local sequence motifs.

Riboformer can be combined with in silico mutagenesis analysis to identify sequence motifs that contribute to ribosome stalling. It also utilizes a reference input to prevent the learning of noninformative signals due to the experimental bias.

□ CellANOVA: Signal recovery in single cell batch integration

>> https://www.biorxiv.org/content/10.1101/2023.05.05.539614v1

CellANOVA utilizes a “pool-of-controls”, applicable across diverse settings, to separate unwanted variation from biological variation. CellANOVA allows the recovery of subtle biological signals and corrects, to a large extent, the data distortion introduced by integration.

A control-pool is a set of samples whereby variation beyond what is preserved by the existing integration are not of interest to the study. The control-pool samples are utilized to estimate a latent linear space that captures cell- and gene-specific unwanted batch variations.

CellANOVA produces a batch corrected GE matrix which can be used for gene-pathway level downstream analyses. By using the control pool in the estimation of the batch variation space, CellANOVA recovers any variation in the non-control samples that lie outside this space.

□ ProteiNN: a Transformer-based model for end-to-end single-sequence protein structure prediction

>> https://www.biorxiv.org/content/10.1101/2023.04.26.538026v1

ProteiNN predicts protein secondary and tertiary structures directly from integer-encoded amino acid sequences. The model was trained and evaluated using the SideChainNet dataset, which provides the basis for complete model training.

The input to the module is a sequence of feature vectors mapped to these component spaces via linear transformations. The multi-head mechanism enables the model to learn relationships between amino acids in parallel.

ProteiNN uses a gating mechanism that modulates the information flow between the input and output, allowing the model to emphasize specific relationships and discard irrelevant information selectively.

□ DeepUMQA3: a web server for model quality assessment of protein complexes

>> https://www.biorxiv.org/content/10.1101/2023.04.24.538194v1

DeepUMQA and DeepUMQA2, new features were designed for complex structures, and the lDDT of each residue and the accuracy of interface residues were predicted using an improved deep neural network.

At the level of overall complex, the overall complex is regarded as a large monomer structure. DeepUMQA3 provides fast and accurate interface residue accuracy prediction and per-residue lDDT prediction services for protein complexes.

□ ecpc: an R-package for generic co-data models for high-dimensional prediction

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05289-x

ecpc originally accommodated various and possibly multiple co-data sources, including categorical co-data, i.e. groups of variables, and continuous co-data. Continuous co-data were handled by adaptive discretisation, potentially inefficiently modelling and losing information.

An extension to the method for generic co-data models, particularly for continuous co-data. At the basis lies a classical linear regression model, regressing prior variance weights on the co-data. Co-data variables are then estimated with empirical Bayes moment estimation.

□ MaxKAT: A maximum kernel-based association test to detect the pleiotropic genetic effects on multiple phenotypes

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btad291/7146028

MaxKAT reduces computational intensity greatly while maintaining high accuracy. Extensive simulations demonstrate that MaxKAT can properly control type I error rates and obtain remarkably higher power than KAT under most of the considered scenarios.

A generalized extreme value distribution is employed to calculate the statistical significance of MaxKAT under the null hypothesis. In addition, the proposed test can accommodate high-dimensional data and yield high power against various alternative hypotheses.

□ SeqImprove: Machine Learning Assisted Creation of Machine Readable Sequence Information

>> https://www.biorxiv.org/content/10.1101/2023.04.25.538300v1

SeqImprove is designed to aid authors in creating machine readable sequence data with complete metadata. It consists of a user-interface that was built using modular code. It can be reused by others to work as the front-end for their curation software.

As input, SeqImprove takes in a sequence file in the Synthetic Biology Open Language (SBOL) format or a link to a sequence stored in SynBioHub. It makes the information machine readable by using existing ontologies to structure the metadata.

□ CNV-ClinViewer: Enhancing the clinical interpretation of large copy-number variants online

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad290/7146044

CNV-ClinViewer enables real-time interactive exploration of large CNV datasets in a user-friendly designed interface and facilitates semi-automated clinical CNV interpretation following the ACMG guidelines by integrating the ClassifCNV tool.

The CNV-ClinViewer allows analysis of single or multiple CNVs, of the used to identify them. Minimal required information for each CNV, including whole chromosome trisomies and monosomies, is the chromosome, start, end and CNV type.

□ OrthoVenn3: an integrated platform for exploring and visualizing orthologous data across genomes

>> https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkad313/7146343

OrthoVenn3 provides gene family contraction and expansion analysis to support researchers better understanding the evolutionary history of gene families, as well as collinearity analysis to detect conserved and variable genomic structures.

OrthoVenn3 offers multiple out-puts, including the UpSet table, occurrence table, phylogenetic tree, and collinearity graph, which provides users with various perspectives on their data.

□ ELVAR: Cell-attribute aware community detection improves differential abundance testing from single-cell RNA-Seq data

>> https://www.biorxiv.org/content/10.1101/2023.04.28.538653v1

ELVAR uses cell attribute aware clustering when inferring differentially enriched communities within the single-cell manifold. ELVAR can detect disease relevant DA-shifts in other cell-types and biological conditions.

The improved sensitivity to detect DA-shifts, as displayed by ELVAR, was also seen when benchmarked against an analogous clustering-based DA-method that uses Louvain in place of EVA.

□ xQTLbiolinks: a comprehensive and scalable tool for integrative analysis of molecular QTLs

>> https://www.biorxiv.org/content/10.1101/2023.04.28.538654v1

xQTLbiolinks is a end-to-end bioinformatic tool for efficient mining and analyzing public and user-customized xQTLs data for the discovery of disease susceptibility genes.

xQTLbiolinks allows users to conveniently retrieve ×QTLs data and metainformation for further analysis through gene names/IDs, tissue names, or genomic regions of interest.

□ Combining LIANA and Tensor-cell2cell to decipher cell-cell communication across multiple samples

>> https://www.biorxiv.org/content/10.1101/2023.04.28.538731v1

Integrating LIANA and Tensor-cell2cell, which combined can deploy multiple existing methods and resources, to enable the robust and flexible identification of cell-cell communication programs across multiple samples.

In this protocol, the integration of the tools facilitates the choice of method to infer cell-cell communication and subsequently perform an unsupervised deconvolution to obtain and summarize biological insights.

□ Signed distance correlation (SiDCo): an online implementation of distance correlation and partial distance correlation for data-driven network analysis

>> https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btad210/7151065

SiDCo is a GUI-platform for calculation of distance correlation in omics data, measuring linear and non-linear dependences between variables, as well as correlation between vectors of different lengths, e.g., different sample sizes.

Distance correlations can be selected as one-to-one / one-to-all correlations, showing relationships b/n each / all other features one at a time. SiDCo uses partial distance correlation, calculated using the Gaussian Graphical model approach adapted to distance covariance.

□ ERStruct: a fast Python package for inferring the number of top principal components from whole genome sequencing data

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05305-0

ERStruct enables the inference of population structure using whole-genome sequencing data. By leveraging parallel computing and GPU acceleration, ERStruct achieves significant improvements in the speed of matrix operations for large-scale data.

In GOE.py, Monte Carlo method is used in the ERStruct algorithm to obtain the null distribution of our proposed ERStruct test statistic, which starts by generating multiple replications of high-dimensional Gaussian Orthogonal Ensemble matrices.

□ PascalX: a python library for GWAS gene and pathway enrichment tests

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad296/7151067

PascalX allows for scoring genes and annotated gene sets for enrichment signals based on data from, both, single GWAS and pairs of GWAS. The gene scores take into account the correlation pattern between SNPs.

They are based on the cumulative density function of a linear combination of χ2 distributed random variables, which can be calculated either approximately or exactly to high precision.

□ CZ CELLxGENE Discover Census

>> https://chanzuckerberg.github.io/cellxgene-census/

The Census provides efficient computational tooling to access, query, and analyze all single-cell RNA data from CZ CELLxGENE Discover.

Using a new access paradigm of cell-based slicing and querying, you can interact with the data through TileDB-SOMA, or get slices in AnnData or Seurat objects, thus accelerating your research by significantly minimizing data harmonization.

□ kimma: flexible linear mixed effects modeling with kinship covariance for RNA-seq data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad279/7152273

kimma supports DEG analyzes incl. covariance random effects. Kimma is an open-source R package that provides flexible linear mixed effects modeling for bulk RNA-seq data including univariate, multivariate, random, and covariance random effects as well as gene-level weights.

kimma utilizes a single function, kmFit, for modeling, ensuring consistent syntax, inputs, and outputs. Moreover, kimma provides post-hoc pairwise tests, model fit metrics like AIC, and fit warnings on a per gene basis.

□ CAGECAT: The CompArative GEne Cluster Analysis Toolbox for rapid search and visualisation of homologous gene clusters

>>

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05311-2

CAGECAT has been designed to provide rapid interoperability between these functions, where homologous clusters of interest can be selected to be used in subsequent analysis.

CAGECAT can yield relevant matches that aid in the comparison, taxonomic distribution, or evolution of an unknown query. The search module leverages the cblaster pipeline, which utilises remote BLAST searches via NCBI’s servers as well as accelerated local Hidden Markov Model.

□ cellsnake: a user-friendly tool for single cell RNA sequencing analysis

>> https://www.biorxiv.org/content/10.1101/2023.05.03.539204v1

Cellsnake allows parallelization and readily utilizes high performance computing (HPC) platforms. cellsnake provides metagenome analysis capabilities if unmapped reads are available.

cellsnake can utilize different scRNA-seq algorithms to simplify tasks such as automatic mitochondrial gene trimming, selection of optimal clustering resolution, doublet filtering, visualization of marker genes, enrichment analysis and pathway analysis.

□ Whole-genome long-read sequencing downsampling and its effect on variant calling precision and recall

>> https://www.biorxiv.org/content/10.1101/2023.05.04.539448v1

Defining read-based methodologies as those requiring alignment of individual sequencing reads to a reference genome and applying specific read-based variant-calling algorithms to these alignments to identify variants.

Assembly-based methods first generate ab initio a whole-genome assembly from LRS reads without guidance from a particular reference genome, and then proceed analogously by aligning this assembly to a reference genome to call variants using assembly-based calling algorithms.

□ HiPhase: Jointly phasing small and structural variants from HiFi sequencing

>> https://www.biorxiv.org/content/10.1101/2023.05.03.539241v1

HiPhase jointly phases SNVs, indels, and structural variants called from PacBio HiFi sequencing on diploid organisms. HiPhase uses two novel approaches to solve the phasing problem: dual mode allele assignment and a phasing algorithm based on the A* search algorithm.

HiPhase offers additional benefits: no down-sampling, multi-allelic variation, logic to span coverage gaps with supplementary alignments, innate multi-threading, built-in statistics gathering, and assigning aligned reads to a haplotype (“haplotagging”) while phasing.

□ scMayoMap: an easy-to-use tool for cell type annotation in single-cell RNA-sequencing data analysis

>> https://www.biorxiv.org/content/10.1101/2023.05.03.538463v1

ScMayoMap takes the standard cluster marker gene list as input and returns the cell type prediction results in a plot and the mapped gene list. scMayoMap allows assignment of multiple cell types to the same cluster if their evidence is similar.

scMayoMap can predict PBMC cell types with small errors, suggesting that marker-based approach is still a promising approach if applied properly.

□ DeepGNN: Semi-supervised learning improves regulatory sequence prediction with unlabeled sequences

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05303-2

DeepGNN, a paradigm shift with semi-supervised learning, which does not only exploit labeled sequences (e.g. human genome with ChIP-seq experiment), but also unlabeled sequences available in much larger amounts.

In parallel, the model takes as a secondary input the graph matrix connecting homologous sequences between species. An improvement would be to infer the homology matrix from the sequence embedding itself during training.

□ Challenges and considerations for reproducibility of STARR-seq assays

>> https://genome.cshlp.org/content/early/2023/05/02/gr.277204.122.long

A strong advantage of STARR-seg is its ability to screen random fragments of DNA from any source for enhancer activity. To this effect, DNA can be sourced from commercially available DNA repositories, from specific populations carrying non-coding mutations or SNPs to be assayed.

Cloning strategies such as In-fusion HD, Gibson assembly, and NEBuilder HiFi DNA Assembly allow for fast and one-step reactions that use complimentary overhang sequences on the inserts and the vector.

Highlighting the different challenges in performing STARR-seg, a particularly long and difficult assay with huge potential to identify detailed enhancer landscapes and validate enhancer function.

□ STEMSIM: a simulator of within-strain short-term evolutionary mutations for longitudinal metagenomic data

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad302/7156836

STEMSIM (short-term evolutionary mutations simulator), which can generate mutations incl. SNV and InDel with various frequency distributions within strains in raw metagenomic sequencing data under a specified nucleotide substitution model.

STEMSIM directly takes the output of CAMISIM as input data. Next, the raw sequencing reads are mapped to the original reference genomes to obtain the alignment files (sam/bam) by Bowtie2.

Then, the details of mutations are gerated according to the specified parameters, such as the number of nucleotide substitutions, and the distribution and trajectory of allele frequency.

□ scDist: Robust identification of perturbed cell types in single-cell RNA-seq data

>> https://www.biorxiv.org/content/10.1101/2023.05.06.539326v1

scDist estimates the distance between condition means in high-dimensional gene expression space for each cell type. scDist can recover biologically relevant between-group differences while also controlling for sample-level variability.

scDist is based on a linear mixed-effects model of single-cell GE counts. scDist uses an approximation for the between-group differences, based on a low-dimensional embedding, which results in a computationally convenient implementation that is substantially faster than Augur.

□ crosshap: Local haplotype visualization for trait association analysis

>> https://www.biorxiv.org/content/10.1101/2023.05.07.539781v1

crosshap performs density-based clustering of variants based on their linkage profiles to capture haplotype structures in local genomic regions. Tightly linked variants are clustered into MGs, and individuals are grouped into local haplotypes by shared allelic combinations.

Visualization tools are provided by crosshap for choosing optimal clustering parameters and producing intuitive crosshap figures that present information on the complex relationships between linked variants, haplotype combinations, and phenotypic/metadata traits of individuals.

□ SpatialData: an open and universal data framework for spatial omics

>> https://www.biorxiv.org/content/10.1101/2023.05.05.539647v1

SpatialData, a framework that establishes a unified and extensible multi-platform file-format, lazy representation of larger-than-memory data, transformations, and alignment to common coordinate systems.

SpatialData facilitates spatial annotations and cross-modal aggregation and analysis, the utility of which is illustrated via multiple vignettes including integrative analysis on a multi-modal Xenium and Visium breast cancer study.

	ブログを読むだけ。毎月の訪問日数に応じてポイント進呈
	【コメント募集中】goo blogスタッフの気になったニュース
	gooブロガーの今日のひとこと
	訪問者数に応じてdポイント最大1,000pt当たる！

lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Sea cruise.

Seafront.

Aquaframe.

Beside Seaside.

EO.