lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Farout.

2018-12-27 23:33:16 | Science News





□ La science condamne à l’obsolescence les instruments qui lui permettent de progresser. Cet héritage, souvent menacé, est protégé par des passionnés et des institutions.

>> https://www.lemonde.fr/sciences/article/2018/12/18/patrimoine-scientifique-ces-instruments-sauves-de-l-oubli_5399475_1650684.html

Petit échantillon de machines oubliées, dont la forme et la fonction nous intriguent aujourd’hui.




□ SBOL-OWL: An ontological approach for formal and semantic representation of synthetic genetic circuits:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/19/499970.full.pdf

SBOL-OWL, an ontology for a machine understandable definition of SBOL. This ontology acts as a semantic layer for genetic circuit designs. As a result, computational tools can understand the meaning of design entities in addition to parsing structured SBOL data. Semantic reasoning has huge potential to verify genetic circuit structures. constraints between any 2-DNA parts can be captured using SBOL-OWL, and can be easily integrated with the set of new terms in order to validate genetic circuits based on the order of DNA components.




□ SCeQTL: an R package for identifying eQTL from single-cell parallel sequencing data:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/19/499863.full.pdf

They find non-zero parts of scRNA-seq data fit the negative binomial distribution well similar to bulk RNA-seq data, but there can be a high probability of a gene being dropped out in the single-cell data. SCeQTL uses zero-inflated negative binomial regression to do eQTL analysis on single-cell data. It can distinguish two type of GE differences among different genotype groups, and can also be used for finding GE variations associated with other grouping factors like cell lineages.




□ gencore: an efficient tool to generate consensus reads for error suppressing and duplicate removing of NGS data:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/19/501502.full.pdf

This tool clusters the mapped sequencing reads and merges each cluster to generate one consensus read. If the data has unique molecular identifier (UMI), gencore uses it for identifying the reads derived from same original DNA fragment. This error-suppressing feature makes gencore very suitable for the application of detecting ultra-low frequency mutations from deep sequencing data.






□ Dashing: Fast and Accurate Genomic Distances with HyperLogLog:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/20/501726.full.pdf

Dashing uses the HyperLogLog sketch together with cardinality estimation methods that specialize in set unions and intersections. Dashing sketches genomes more rapidly than previous MinHash-based methods while providing greater accuracy across a wide range of input sizes and sketch sizes. It can sketch and calculate pairwise distances for over 87K genomes in under 6 minutes.

Dashing also uses Single Instruction Multiple Data (SIMD or “Vector”) instructions on modern general-purpose computer processors to exploit the finer-grained parallelism inherent in calculating the HyperLogLog estimate.






□ Cloud-BS: A MapReduce-based bisulfite sequencing aligner on cloud:

>> https://www.worldscientific.com/doi/abs/10.1142/S0219720018400280

Cloud-BS is an efficient Bisulfite Sequencing aligner designed for parallel execution on a distributed environment. Utilizing Apache Hadoop framework, the Cloud-BS splits sequencing reads into multiple blocks and transfers to distributed nodes. By designing each aligning procedure into a separate map and reduce tasks while internal key-value structure is optimized based on MapReduce programming model, the algorithm significantly improved aligning performance without sacrificing mapping accuracy.






□ BiSpark: a Spark-based highly scalable aligner for bisulfite sequencing data:

>> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6288881/

BiSpark is a highly parallelized bisulfite-treated read aligner algorithm that utilizes distributed environment to significantly improve aligning performance & scalability. BiSpark is designed based on the Apache Spark distributed framework and shows highly efficient scalability. implemented a highly-optimized load-balancing algorithm in the BiSpark provides re-distributing data almost evenly across the cluster nodes, achieving better scalability on a large-scale cluster.






□ Coherent chaos in a recurrent neural network with structured connectivity:

>> https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006309

applying a perturbative approach to solve the dynamic mean-field equations, showing that in this regime coherent fluctuations are driven passively by the chaos of local residual fluctuations. in this regime the dynamics depend qualitatively on the particular realization of the connectivity matrix: a complex leading eigenvalue can yield coherent oscillatory chaos while a real leading eigenvalue can yield chaos with broken symmetry. The level of coherence grows with increasing strength of structured connectivity until the dynamics are almost entirely constrained to a single spatial mode.






□ A Deep Learning Genome-Mining Strategy Improves Biosynthetic Gene Cluster Prediction:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/18/500694.full.pdf

DeepBGC, a novel utilization of deep learning and natural language processing (NLP) strategy, and employs a Bidirectional Long Short-Term Memory (BiLSTM) RNN and a word2vec-like word embedding skip-gram neural network we call pfam2vec. addressing the algorithmic limitation by implementing a deep learning approach using RNN & vector representations of pfam domains which together, unlike HMMs, are capable of intrinsically sensing short- & long-term dependency effects between adjacent and distant genomic entities.




□ CRIP: Predicting circRNA-RBP interaction sites using a codon-based encoding and hybrid deep neural networks:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/18/499012.full.pdf

In order to fully exploit the sequence information, propose a stacked codon-based encoding scheme and a hybrid deep learning architecture, in which a convolutional NN learns high-level abstract features and a recurrent NN learns long dependency in the sequences. the CNN and BiLSTM hybrid components further learn high-level abstract features and contextual information from the encoding vectors, respectively.




□ FORGe: prioritizing variants for graph genomes:

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1595-x

FORGe works in cooperation with a variant-aware read aligner (graph aligner) such as HISAT2. FORGe then uses a mathematical model to score each variant according to its expected positive and negative impacts on alignment accuracy and computational overhead. FORGe could consider factors such as the variant’s frequency in a population, its proximity to other variants, and how its inclusion affects the repetitiveness of the graph genome.






□ HiDRA: High-resolution genome-wide functional dissection of transcriptional regulatory regions and nucleotides in human:

>> https://www.nature.com/articles/s41467-018-07746-1

HiDRA overcomes the construct-length and region-count limitations of synthesis-based technologies at substantially lower cost, and our ATAC-based selection of open chromatin regions concentrates the signal on likely regulatory regions and enables high-resolution inferences. HiDRA selection approach resulted in highly-overlapping fragments (~32,000 regions covered by 10+ unique fragments, ~12,500 by 20+ fragments), enabling us to pinpoint “driver” regulatory nucleotides that are critical for transcriptional enhancer activity.




□ Cell Hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics

>> https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1603-1

By sequencing these tags alongside the cellular transcriptome, we can assign each cell to its original sample, robustly identify cross-sample multiplets, and “super-load” commercial droplet-based systems for significant cost reduction.




□ Fast and accurate differential transcript usage by testing equivalence class counts

>> https://www.biorxiv.org/content/early/2018/12/19/501106

equivalence classes (ECs) counts have similar sensitivity and false discovery rates as exon-level counts but can be generated in a fraction of the time through the use of pseudo-aligners. The transcript abundance estimates can be used as an alternative starting measure for DTU testing. the estimated transcript abundances can perform well in detecting differential transcript usage, pseudo-alignment is significantly faster than methods that map to a genome. count-based DTU testing procedures such as DEXSeq are applied directly to alignments generated from fast lightweight aligners, such as Salmon and Kallisto.




□ De-Novo-Designed Translational Repressors for Multi-Input Cellular Logic:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/19/501783.full.pdf

Automated in silico optimization of thermodynamic parameters yields improved toehold repressors with up to 300-fold repression, while in-cell SHAPE-Seq measurements of 3WJ repressors confirm their designed switching mechanism in living cells. The modularity, wide dynamic range, and low crosstalk of the repressors enable their direct integration into ribocomputing devices that provide universal NAND and NOR logic capabilities and can perform multi-input RNA-based logic.




□ G-Dash: A Genome Dashboard Integrating Modeling and Informatics:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/19/501874.full.pdf

G-Dash unites the Interactive Chromatin Modeling(ICM) tools with the Biodalliance genome browser and the JSMol molecular viewer to rapidly fold any DNA sequence into atomic or coarse-grained models of DNA, nucleosomes or chromatin. G-Dash demonstrates that such an inventory of Masks can be maintained and converted to 3D structures from single base pairs or entire chromosomes in real time. In this manner, genome dashboards enable users to both define and navigate chromatin folding energy landscapes.






□ OncodriveCLUSTL: a sequence-based clustering method to identify cancer drivers:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/19/500132.full.pdf

OncodriveCLUSTL, a new linear clustering algorithm to detect genomic regions and elements with significant clustering signals based on a local background model derived from a cohort’s observed tri- or penta-nucleotide substitutions frequency. OncodriveCLUSTL is an unsupervised clustering algorithm. It analyzes somatic mutations that have been observed in genomic elements (GEs) across a cohort of samples.




□ Improved Representation of Sequence Bloom Trees:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/19/501452.full.pdf

building on the Sequence Bloom Tree (SBT) framework to construct the HowDe-SBT data structure, which uses a novel partitioning of information to reduce the construction and query time as well as the size of the index. proving theoretical bounds on the performance of HowDe-SBT and also demonstrate its performance advantages on real data by comparing it to the previous SBT methods and to mantis, a representative from the second category of indexing methods.




□ Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/19/501130.full.pdf

Nucleotide Archival Format (NAF) - a new file format for lossless reference-free compression of FASTA and FASTQ-formatted nucleotide sequences. NAF compression ratio is comparable to the best DNA compressors, it provides 30 to 80 times faster decompression.




□ Expression reflects population structure

>> https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1007841

Lior Pachter:
Expression reflects population structure: while PCA does not reveal population structure in RNAseq (e.g. @tuuliel et al.'s GEUVADIS), it is revealed via another projection. Interesting implications for eQTL discovery.

The method is able to determine the significance of the variance in the canonical correlation projection explained by each gene. They identify 3,571 significant genes, only 837 of which had been previously reported to have an associated eQTL in the GEUVADIS results.




□ RAISS: Robust and Accurate imputation from Summary Statistics:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/21/502880.full.pdf

RAISS is a python package enabling the imputation of SNP summary statistics from the neighboring SNPs by taking advantage of the Linkage disequilibrium. Neighboring SNPs are highly correlated variables which the inversion of prone numerical instabilities, and invert with the Moore-Penrose inverse. To ensure numerical stability, eigen values below a given threshold are set to zero in the computation of pseudo inverse.




□ doepipeline: a systematic approach for optimizing multi-level and multi-step data processing workflows:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/21/504050.full.pdf

DoE-based strategy for a systematic approach for optimizing multi-level and multi-step data processing workflows, and exemplify the application of doepipeline in; de-novo assembly / scaffolding of contiguous sequence / k-mer classification of long noisy reads generated by MinION. Optimal parameter settings are first approximated in a screening phase using a subset design that efficiently span the entire search space, and subsequently optimized in the following phase using response surface designs and OLS modeling.




□ qgg: an R package for large-scale quantitative genetic analyses:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/21/503631.full.pdf

qgg handles large-scale data by taking advantage of:

multi-core processing using openMP
multithreaded matrix operations implemented in BLAS libraries (OpenBLAS, ATLAS or MKL)
fast and memory-efficient batch processing of genotype data stored in binary files (PLINK bedfiles)






□ NCUA: A novel structure-based control method for analyzing nonlinear dynamics in biological networks:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/21/503565.full.pdf

NCUA is a novel and general graphic-theoretic algorithm from the perspective of the feedback vertex set to discover the possible minimum sets of the input nodes in controlling the network state. NCUA is based on the assumption that the edges of the undirected networks are modeled as the bi-directed edges. NCUA determining the MDS of the top side nodes to cover the bottom side nodes in the bipartite graph by using Integer Linear Programming (ILP), and designing random Markov chain sampling to obtain different input node sets.




□ The Epistasis Boundary: Linear vs. Nonlinear Genotype-Phenotype Relationships:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/21/503466.full.pdf

in separability theory determine the conditions, which correspond to 3 biological criteria (Directional Consistency, Environmental Compensability, and Pathway Redundancy) together making up an Epistatic Boundary between systems suitable and unsuitable for linear modeling. a classification of types of nonlinearity from a systems perspective.




□ Predicting complex genetic phenotypes using error propagation in weighted networks:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/21/487348.full.pdf

investigate if biological networks could be approximated as overlapping, feed-forward networks where the nodes have non-linear activation functions. Mathematical formalization of this model followed by numerical simulations based on genomic data allowed us to accurately predict the statistics of gene essentiality.




□ SeqCrispr: Identifying Context-specific Network Features for CRISPR-Cas9 Targeting Efficiency Using Accurate and Interpretable Deep Neural Network:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/24/505602.full.pdf

seqCrispr involves a sequence feature engineering layer. It utilizes unsupervised representation learning to find the vector representation of 3mer instead of one-hot encoder. This hybrid model can take both advantages of RNN and CNN for the feature engineering of the sgRNA, and make the model more resistant to data noise. word2vec embedding with Hilbert-curve filling may have advantage over vertical stacking.