lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

Asterism.

2018-12-13 23:23:23 | Science News





□ High-resolution genetic mapping of putative causal interactions between regions of open chromatin:

>> https://www.nature.com/articles/s41588-018-0278-6

a Bayesian hierarchical approach that uses two-stage least squares and applied it to an ATAC-seq (assay for transposase-accessible chromatin using sequencing) data set from 100 individuals, to identify over 15,000 high-confidence causal interactions. Assignment of the direction of effect between different peaks allowed us to identify smaller sets of plausible candidate variants by identifying “master regulatory” regions, and also revealed the genomic architecture of causal interactions between regulatory elements.






□ Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies:

>> https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2485-7

Purge Haplotigs was developed specifically for third-gen sequencing-based assemblies to automate the reassignment of allelic contigs, and to assist in the manual curation of genome assemblies. The pipeline uses a draft haplotype-fused assembly or a diploid assembly, Minimap2 read alignments, and repeat annotations to identify allelic variants in the primary assembly. Purge Haplotigs will run on either a haploid assembly (i.e. Canu, FALCON or FALCON-Unzip primary contigs) or on a phased-diploid assembly (i.e. FALCON-Unzip primary contigs + haplotigs).






□ Fast and accurate large multiple sequence alignments using root-to-leave regressive computation:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/07/490235.full.pdf

developed and validated on protein sequences a regressive algorithm that works the other way around, aligning first the most dissimilar sequences. this algorithm produces more accurate alignments than non-regressive methods, especially on datasets larger than 10,000 sequences. By design, it can run any existing alignment method in linear time. in the case of Clustal Omega (ClustalO) using mBed trees, the regressive combination was about twice as fast as the progressive alignment and appeared to have a linear complexity.




□ isONclust: De novo clustering of long-read transcriptome data using a greedy, quality-value based algorithm:

>> https://www.biorxiv.org/content/biorxiv/early/2018/11/06/463463.full.pdf

isONclust is a tool for clustering either PacBio Iso-Seq reads, or Oxford Nanopore reads into clusters, where each cluster represents all reads that came from a gene. Output is a tsv file with each read assigned to a cluster-ID. isONclust on 3 simulated & 5 biological datasets, across a breadth of organisms, technologies, and read depths. the results demonstrate that isONclust is a substantial improvement over previous approaches, both in terms of overall accuracy and/or scalability to large datasets.






□ Hidden patterns of codon usage bias across kingdoms:

>> https://www.biorxiv.org/content/early/2018/11/24/478016

derive from first principles a mathematical model describing the statistics of codon usage bias and apply it to extensive genomic data. A new model-based measure of codon usage bias that extends existing measures by taking into account both codon frequency and codon distribution reveals distinct, amino acid specific patterns of selection in distinct branches of the tree of life.






□ Naught all zeros in sequence count data are the same:

>> https://www.biorxiv.org/content/biorxiv/early/2018/11/26/477794.full.pdf

a systematic description of different processes that can give rise to zero values as well as the types of methods for addressing zeros in sequence count studies. The results demonstrate that zero-inflated models can have substantial biases in both simulated and real data settings. Additionally, they find that zeros due to biological absences can, for many applications, be approximated as originating from under sampling. the zero-inflated models tend to inflate parameter estimates in both simulated and real data settings due to inherent identi-fiability issues. this parameter inflation can be so severe as to dominate the results of a DE analysis on a previously published single-cell RNA-seq study.




□ Devil in details: Beware the Jaccard: the choice of metric is important and non-trivial in genomic colocalisation analysis.

>> https://www.biorxiv.org/content/biorxiv/early/2018/11/27/479253.full.pdf




□ Efficient computation of spaced seed hashing with block indexing:

>> https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-018-2415-8

the FISH algorithm can be further exploited to improve the speed up with respect to the computation of the Q-grams hashing of each spaced seed separately. FISH can compute the hashing values of spaced seeds with a speedup, with respect to the traditional approach, between 1.9x to 6.03x, depending on the structure of the spaced seeds. Going from "contiguous k-mers" to "spaced k-mers" usually brings an overhead (e.g., 1.5-2.5x in Seed-Kraken).






□ RISC: robust integration of single-cell RNA-seq datasets with different extents of cell cluster overlap:

>> https://www.biorxiv.org/content/biorxiv/early/2018/11/29/483297.full.pdf

In RISC, instead of estimating the lambda, the PCR model selects the PCs based on dimension reduction, the process regularizes the matrices and generates the unique singular vectors at the first step of scRNA-seq data analysis. Because of the natural compatibility of eigenvectors between PCR model and dimension reduction, RISC can accurately integrate scRNA-seq datasets and avoid over-integration.






□ DeeReCT-PolyA: a robust and generic deep learning method for PAS identification.

>> https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty991/5221014

DeeReCT-PolyA is a robust, PAS motif agnostic, and highly interpretable and transferrable deep learning model for accurate PAS recognition, which requires no prior knowledge or human-designed features.




□ OCHROdb: a comprehensive, quality checked database of open chromatin regions from sequencing data:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/03/484840.full.pdf

Multiple large-scale consortia-based projects, including ENCODE, REMC, Blueprint and GGR have generated thousands of sequencing data samples that capture DNase-I hypersensitive sites (DHS) on the whole genome in hundreds of cell types. an analysis pipeline that gets hundreds of pre-processed DHS data as the input, aligns regions of open chromatin across samples, checks quality of each region using a replication-based test, and outputs a well-curated DB of open chromatin accessibility across the whole genome.




□ Ozymandias: A biodiversity knowledge graph:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/04/485854.full.pdf

it is worth noting that the biodiversity informatics community has been aware of knowledge graphs and semantic web technologies for a decade or more, and several taxonomic databases have been serving data in RDF since the mid-2000’s. Ozymandias is a biodiversity knowledge graph. This mapping can then be used to construct a knowledge graph, where entities such as taxa, publications, people, places, specimens, sequences, and institutions are all part of a single, shared knowledge space.




□ ClinGen Receives Recognition Through New @US_FDA Human Variant Database Program. ClinGen expert curated variants are available for unrestricted use in the community via @NCBI_Clinical ClinVar

>> http://bit.ly/2ScRzRi




□ Ultra-deep, long-read nanopore sequencing of mock microbial community standards

>> http://biorxiv.org/cgi/content/short/487033v1




□ CellTagging: Single-cell mapping of lineage and identity in direct reprogramming

>> https://www.nature.com/articles/s41586-018-0744-4

CellTagging is a combinatorial cell-indexing methodology that enables parallel capture of clonal history and cell identity, in which sequential rounds of cell labelling enable the construction of multi-level lineage trees. the results demonstrate the utility of our lineage-tracing method for revealing the dynamics of direct reprogramming.






□ Cell growth is an omniphenotype:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/05/487157.full.pdf

provide evidence that cell growth is a generalizable phenotype because it is an aggregation of phenotypes. To the extent that it might be an aggregation of all possible phenotypes – an omniphenotype – suggests its potential as a pan-disease model for biological discovery and drug development.




□ New methods to calculate concordance factors for phylogenomic datasets:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/05/487801.full.pdf

the gene concordance factor (gCF) is defined as the percentage of “decisive” gene trees containing that branch. a package that calculates it while accounting for variable taxon coverage among gene trees. the site concordance factor (sCF) is a new measure defined as the percentage of decisive sites supporting a branch in the reference tree. gCF and sCF complement classical measures of branch support in phylogenetics by providing a full description of underlying disagreement among loci and sites.




□ Accelerated Bayesian inference of gene expression models from snapshots of single-cell transcripts:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/07/489880.full.pdf

the time-dependent mRNA distributions of discrete-state models of gene expression are dynamic Poisson mixtures, whose mixing kernels are characterized by a piece-wise deterministic Markov process. combined this analytical result with a kinetic Monte Carlo algorithm to create a hybrid numerical method that accelerates the calculation of time-dependent mRNA distributions by 1000-fold compared to current methods. then integrated the hybrid algorithm into an existing Monte Carlo sampler to estimate the Bayesian posterior distribution of many different, competing models in a reasonable amount of time.




□ Galaxy-Kubernetes integration: scaling bioinformatics workflows in the cloud:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/07/488643.full.pdf

Kubernetes acts as an abstraction layer between Galaxy and the different cloud providers, allowing Galaxy to run on every cloud provider that supports Kubernetes (>10 cloud providers currently).




□ Sparse Dynamic Programming on DAGs with Small Width just accepted to ACM TALG. (ACM Transactions on Algorithms.)

>> https://link.springer.com/chapter/10.1007%2F978-3-319-89929-9_7

"Using Minimum Path Cover to Boost Dynamic Programming on DAGs: Co-linear Chaining Extended." - an algorithm for finding a minimum path cover of a DAG (V, E) in 𝑂(𝑘|𝐸|log|𝑉|) time, improving all known time-bounds when k is small and the DAG is not too dense. a general technique for extending dynamic programming (DP) algorithms from sequences to DAGs. This is enabled by our minimum path cover algorithm, and works by mimicking the DP algorithm for sequences on each path of the minimum path cover.






□ Statistical Dynamics of Spatial-Order Formation by Communicating Cells:

>> https://www.cell.com/iscience/fulltext/S2589-0042(18)30022-1?sf203801563=1

cellular automata and mimicking approaches of statistical mechanics—for understanding how secrete-and-sense cells with bistable gene expression, from disordered beginnings, can become spatially ordered by communicating through rapidly diffusing molecules.

Classifying lattices of cells by two “macrostate” variables—“spatial index,” measuring degree of order, and average gene-expression level: a group of cells behaves as a single particle, in an abstract space, that rolls down on an adhesive “pseudo-energy landscape” whose shape is determined by cell-cell communication and an intracellular gene-regulatory circuit.

the gradient of the pseudo-energy and a “trapping probability,” which quantifies the adhesiveness of the pseudo-energy landscape, together determine the particle's trajectories in the phase space - the particle rolls down along the negative of the gradient of the pseudo-energy.




□ ReMIX: Genome-wide recombination map construction from single individuals using linked-read sequencing:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/08/489989.full.pdf

ReMIX makes use of linked-read sequencing technology developed by 10X Genomics to acquire long-range haplotype information from gametes of a single individual. Using the recombinant molecules, crossover locations are defined as genomic intervals based on the location of the last variant of the first haplotype and first variant of the second. The linked-read information is exploited by ReMIX during three steps: identifying high-quality heterozygous variants, reconstructing molecules, and haplotype phasing each molecule.




□ Robust and Structural Ergodicity Analysis and Antithetic Integral Control of a Class of Stochastic Reaction Networks:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/08/481051.full.pdf

addressing the problem of verifying the conditions for large sets of reaction networks with time-invariant topologies, either from a robust or a structural viewpoint, using three different approaches. by exploiting the Metzler structure of the matrix, it has been possible to obtain interesting simplified conditions for the robust and structural ergodicity of stochastic reaction networks with uncertain reaction rates.




□ Smart computational exploration of stochastic gene regulatory network models using human-in-the-loop semi-supervised learning:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/08/490623.full.pdf

the information about a modeler’s preferences can be used to train classifiers and use them to guide the sampling process of the parameter space where the exploration of “interesting” regions are accelerated. This way of training classifiers based on modeler input can be seen as a way to engineer objective functions that can be used in systematic downstream sampling algorithms that require prior information.






□ GeTallele: a mathematical model and a toolbox for integrative analysis and visualization of DNA and RNA allele frequencies:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/09/491209.full.pdf

Based on the results, variant probability vPR can serve as a dependable indicator to assess gene and chromosomal allele asymmetries and to aid calls of genomic events. GeTallele allows to visualize the observed patterns, with the ability to magnify regions of interest to desired resolution, including chromosome, gene, or custom genome region, along with statistical measures of the modes, for all the modes in the examined segment.




□ Searching and mapping genomic subsequences in nanopore raw signals through novel dynamic time warping algorithms:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/10/491456.full.pdf

the Direct Subsequence Dynamic Time Warping for nanopore raw signal search (DSDTWnano) and the continuous wavelet Subsequence Dynamic Time Warping for nanopore raw signal search (cwSDTWnano), to enable the direct subsequence searching and exact mapping in nanopore raw signals. DSDTWnano could ensure an output of highly accurate query result and cwSDTWnano is the accelerated version of DSDTWnano, with the help of seeding and multi-scale coarsening of signals that based on continuous wavelet transform (CWT).




□ Expansion, Exploitation and Extinction: Niche Construction in Ephemeral Landscapes:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/09/489096.full.pdf

developed an Interacting Particle System (IPS) to study the effect of niche construction on metapopulation dynamics in ephemeral landscapes. Using finite scaling theory, a divergence in the qualitative behavior at the extinction threshold between analytic (mean field) and numerical (IPS) results when niche construction is confined to a small area in the spatial model.




□ simuG: a general-purpose genome simulator:

>> https://www.biorxiv.org/content/biorxiv/early/2018/12/09/491498.full.pdf

simuG, a light-weighted tool for simulating the full-spectrum of genomic variants. The simplicity and versatility of simuG makes it a unique general purpose genome simulator for a wide-range of simulation-based applications.