lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

almost no birds can be heard.

2015-08-14 05:01:29 | Science News


□ GOTHiC: a simple probabilistic model to resolve complex biases and to identify real interactions in Hi-C data:

>> http://biorxiv.org/content/early/2015/07/27/023317

今年三月に、多能性調節回路の長距離相互作用の論文(http://genome.cshlp.org/content/early/2015/03/07/gr.185272.114.full.pdf …)で使われていた、GOTHiCの論文がbioarxivに出てた。

The model corrects biases of known and unknown origin and yields a p-value for each interaction, providing a reliable threshold based on significance & demonstrate experimentally by testing the method against a random ligation dataset.

GOTHiC calculates p-values that allows the identification of true genomic interactions, and the removal of artefactual interactions with a well-controlled false discovery rate.

Filtering of self ligations read pairs by distance.

pval jh = Binom(N, cov j*cov h*2)

random ligation artifacts and biases are removed.

estimate the probability that a read pair is the consequence of a spurious ligation b/n 2 sites:

pj,h = 2 ∗ rel coverage j ∗ rel coverage h

In addition to a p-value, an observed-over-expected ratio can be easily calculated.

Obs_exp_ratio = nj,h/(pj,h ∗ N)






□ A Visual Introduction to Machine Learning:

>> http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

Overfitting happens when some boundaries are based on distinctions that don't make a difference. You can see if a model overfits by having test data flow the model.




□ Hybrid-Lambda: simulation of multiple merger and Kingman gene genealogies in species networks and species trees:

>> http://biorxiv.org/content/early/2015/07/29/023465

Λ-coalescent is a continuous-time Markov process, in which times b/n events are independent exponential random variables w/ different rates. Hybrid-Lambda checks the distances in coalescent units b/n the root & all nodes, warning messages if the ultrametric assumption is violated.






□ YC Backs Cofactor Genomics To Pursue RNA Testing, Which Could Offer Better Diagnoses Than DNA

>> http://techcrunch.com/2015/07/31/cofactor-genomics/

cofactor genomics社の動向は5年前からトラッキングしているけれど、NIHのグラントで開発しているキットは、技術的に難しいcircRNAを単離する数少ない有効な方法。RNA診断パイプラインにおいて、cofactorが市場のイニシアチブを握ることも有力視されてる。

Cofactor’s RNA-seq gets rid of the ambiguity by incorporating synthetic spike-in controls from the very first step.



Cofactor’s business model has setup a scenario where we control, assess, and change all of the variables. Cofactor is somewhat unique compared to a traditional biotech due in large part to it having its roots in service.






□ Lockheed Martin Launches Healthcare Technology Alliance with Illumina, Others

>> https://www.genomeweb.com/informatics/lockheed-martin-launches-healthcare-technology-alliance-illumina-others

>> https://www.genomeweb.com/informatics/lockheed-martin-launches-healthcare-technology-alliance-illumina-others

Lockheed Martin and Illumina are collaborating to develop methods for genomics on a national scale, Its systems integration and data analytics experience complements Illumina's genomic sequencing and analysis, large-scale information systems in an offering to countries that are beginning to integrate genomics into their national health systems.


price history
upper: Lockheed Martin (LMT) analyst ratings mean score: 2.38
lower: illumina (ILMN) mean score: 1.59






□ Illumina Signs Agreement to Acquire GenoLogics, Leader in Genomics Laboratory Information Management System Market:

>> http://www.illumina.com/company/news-center/press-releases/press-release-details.html?newsid=2075626






□ Cellular Research Introduces Whole Transcriptome Single Cell Precise Assays:

>> http://www.businesswire.com/news/home/20150729005450/en/Cellular-Research-Introduces-Transcriptome-Single-Cell-Precise#.VcKbUHj2BE5

Precise™ Assays is The first targeted RNA-seq product capable of absolute and direct molecular counting of transcripts. Precise Assays combine a robust design pipeline, a streamlined RNA-seq library prep protocol, and a bioinformatics tool for a simple, turnkey solution to performing multiplexed gene expression analysis on 80-120 genes in hundreds of samples.




□ 10X Genomics Releases Linked-Read Data from National Institute of Standards and Technology, NIST:

>> http://10xgenomics.com/releases/10x-genomics-releases-linked-read-data-national-institute-standards-and-technology-nist

four Linked-Read data sets generated by the GemCode™ have been submitted to NIST and are also publicly available for download. GemCode delivers valuable long range information utilizing current short read sequencers to generate a powerful new data type: Linked-Reads.






□ precisionFDA: A community approach for submitting & evaluating diagnostic tests by DNAnexus:

>> http://blog.dnanexus.com/2015-08-05-precisionfda-a-community-approach-for-submitting-evaluating-diagnostic-tests/

precisionFDA is a new approach for evaluating bioinformatics workflows, integral part of FDA’s work in better understanding diagnostic tests. precisionFDA will provide open source reference applications, reference datasets, and cloud-based informatics and data management resources.



□ Advancing precision medicine by enabling a collaborative informatics community:

>> http://blogs.fda.gov/fdavoice/index.php/2015/08/advancing-precision-medicine-by-enabling-a-collaborative-informatics-community/

アメリカではNGSと医療統計プラットフォーム周りの成果物が一気に市場に出回り始めている。それも政府絡みで。パイプラインやラボマネージャーの統合も著しい。






□ CoMEt: a statistical approach to identify combinations of mutually exclusive alterations:

>> http://www.genomebiology.com/2015/16/1/160

a Markov chain Monte Carlo (MCMC) algorithm to sample collections M, containing t sets of k alterations, in proportion to the weight Φ(M) -α.

in the degenerate case of perfect exclusivity (no sample with more than one alteration in M) there are no more extreme tables to enumerate, and the algorithm needs only to evaluate the hypergeometric probability of Eq. (2) for this single table.




□ Diagnostic biases in translational bioinformatics:

>> http://www.biomedcentral.com/1755-8794/8/46

identify total three types of diagnostic biases: overfitting bias, label skewness bias, and underfitting bias in SVM diagnostics. the DCA-SVM diagnosis provides a generic solution for the label skewness bias overcome due to the powerful feature extraction capability. It also has a positive impact on machine learning for adding new results to kernel-based learning for omics data.

a derivative component analysis (DCA) based support vector machines (DCA-SVM) to conquer the label skewness bias by extracting true signals by digging latent data characteristics from an input data. The true signals share the same dimensionally with the original data but capture essential data characteristics.

a Gaussian radial basis function kernel: k(x,x′ )= exp(||x-x′|| 2 /2σ2 ), multilayer perceptron kernel (‘mlp’): k(x,x′ )= tanh((x i ·x′ )-1)




□ ChromNet: Learning the human chromatin network from all ENCODE ChIP-seq data

>> http://biorxiv.org/content/early/2015/08/04/023911

GroupGM learning algorithm allow seamless integration of all datasets comprising 223 transcription factors and 14 histone marks from 105 cell types without requiring manual removal of potential redundancies.

While learning a Markov random field is challenging for over 100 datasets, relying on Σ^-1 allows ChromNet to learn a network from more than a thousand datasets across millions of samples in only a few minutes.




□ ARM-seq: AlkB-facilitated RNA methylation sequencing reveals a complex landscape of modified tRNA fragments:

>> http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.3508.html




□ Graphical Fragment Assembly (GFA) Format Specification for graphs arising in sequence assembly:

>> https://github.com/pmelsted/GFA-spec

DNA sequence assembly is often represented as a graph. There are multiple types: de Bruijn graph, overlap graph, unitig graph & string graph. each vertex is a sequence and each arc is an overlap. Because DNA sequences have two strands, an arc may have four directions

Although we can describe an assembly graph with bidirected arcs, it is easier and more explicit to describe links b/n the ends of segments. uniquely label the 5'-end and the 3'-end of each segment. The following shows an assembly graph w/ 7 segments in GFA



The CIGAR can describe symmetric overlaps(e.g. 5M), assembly gap(10N), gapped overlaps, open-end alignments (1M1D2M1S) or unaligned overlaps






□ Deep learning for regulatory genomics:

>> http://www.nature.com/nbt/journal/v33/n8/full/nbt.3313.html

The multiple learning layers can capture multiple levels of information processing and abstraction within cells. The explicit representations at each layer can reveal insights about the biological structures




□ High speed BLASTN: an accelerated MegaBLAST search tool:

>> http://nar.oxfordjournals.org/content/early/2015/08/06/nar.gkv784.full

experiments conducted on a 12-core server show that HS-BLASTN can be 22 times faster than MegaBLAST and exhibits better parallel performance. HS-BLASTN finds all the w-seeds (w ≥ k) and checks whether a set of k-seeds are contained in w-seeds all at the same time.




□ BioWardrobe: an integrated platform for analysis of epigenomics and transcriptomics data without the programming:

>> http://www.genomebiology.com/2015/16/1/158

Researchers can use BioWardrobe to upload data from a sequencing core or a public database and promptly receive quality control data, view the results in the browser, and perform some of the analysis without the assistance of bioinformaticians.




□ Crowdsourced geometric morphometrics enable rapid large-scale collection and analysis of phenotypic data:

>> http://biorxiv.org/content/early/2015/07/28/023382

BAMM estimates the location of rate shifts in either diversification or character evolution using a transdimensional (reversible jump) MCMC that samples a variety of models of lineage diversification and trait evolution.

Natural language processing of the scientific literature could potentially be used for automatic extraction of morphological characters using DeepDive, but it may require impractically large corpus sizes. MTurk could improve by confirming the presence of a leaf in the image segment & measure the leaf size to ground truth the algorithm results.




□ High-order neural networks and kernel methods for peptide-MHC binding prediction:

>> http://bioinformatics.oxfordjournals.org/content/early/2015/07/25/bioinformatics.btv371.abstract

nonlinear high-order machine learning methods including high-order neural networks, with possible deep extensions and high-order kernel support vector machines to predict major histocompatibility complex-peptide binding.






GitXiv:
Help open sourcing science with #GitXiv's 1st competition: replicate Google's Deep QA paper:

>> https://github.com/GitXiv/DeepQA




□ Pandas Releasing the GIL: Global-Interpreter-Lock - a mutex prevents multiple native threads from running in parallel

>> http://www.continuum.io/blog/pandas-releasing-the-gil




ChengxiYe:
The source code of DBG2OLC, together with a new consensus module, can be found here.
https://sourceforge.net/p/dbg2olc/
https://sourceforge.net/p/sparc-consensus/




□ New SMRT-BS Method to Revolutionize Quantitative, Multiplexed Targeted Bisulfite Sequencing for Methylation Analysis:

>> http://www.whatisepigenetics.com/new-smrt-bs-method-to-revolutionize-quantitative-multiplexed-targeted-bisulfite-sequencing-for-methylation-analysis/

Bismark was used for alignment, and the generated SAM files were subjected to read filtering and CpG methylation quantitation using an in-house developed Python script.


□ a freely available program to analyze SMRT-BS and other high-throughput bisulfite sequencing data "HiTMAP" to be released by October 2015.

>> http://benpullman.com/hitmap/

HiTMAP is a comprehensive web tool that takes raw amplicon bisulfite sequence data and demultiplexes against sample barcodes, aligns sequencing reads to in silico converted genomic reference sequences, quantitates CpG methylation levels, and exports resulting methylation data for both individual CpG sites and amplicon regions. HiTMAP eliminates the need for manual data manipulation, local computational resources and expertise, and provides an efficient mechanism.






□ RARseq: Development of Transcriptomic Markers for Population Analysis Using Restriction Site Associated RNA-Seq

>> http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0134855

RARseq utilizes mRNA transcriptome in lieu of genomic DNA in the GBS/RADseq protocol.




□ NanoCAGE-XL and CapFilter: an approach to genome wide identification of high confidence transcription start sites:

>> http://www.biomedcentral.com/1471-2164/16/597

nanoCAGE-XL is the first publicly available protocol adapting nanoCAGE for the HiSeq-2000 sequencing platform, making TSS sequencing of low input samples practical where significant depth of coverage is required.




□ emnlp2015: Conference on Empirical Methods in Natural Language Processing: list of papers:

>> http://www.emnlp2015.org/accepted-papers.html






□ Navigating the massive world of reddit: using backbone networks to map user interests:

>> https://peerj.com/articles/cs-4/






□ Composing Music With Recurrent Neural Networks:

>> http://www.hexahedria.com/2015/08/03/composing-music-with-recurrent-neural-networks/

Deep Learningを用いた作曲アルゴリズム。綺麗だけど何かがゴリゴリ削れていく。。

using "biaxial RNN", Each recurrent layer transforms inputs to outputs, and also sends recurrent connections along one of these axes. a Long Short-Term Memory (LSTM) node instead of a normal node. a “memory cell” value that is passed down for multiple time steps




□ Evolution of Deep learning models:

>> http://www.opengardensblog.futuretext.com/archives/2015/07/evolution-of-deep-learning-models.html

Deep neural networks are able to accurately gauge electricity load demand across the grid. Deep learning techniques have synergies amongst themselves. DBNs and DNNs can be used in conjunction i.e. Deep Belief Net.




□ 批判的視点はもちろん重要だけれど、そもそも機械学習やDeep Learningの設計思想が、データドリヴンで何処まで出来るか、誰にでも利用できるデータの利便性が齎す影響のほかに、人の知見のbiasを取り除こうとする理想を追求するもの。

それを用いる立場にない人が揶揄する、バズワードとしてのBig Dataに何の価値もないことは議論するまでもなく、それを何処吹く風で困難を分割し、成果物と実績を発表し続ける第一線がある事実。事故例ばかり挙げて外野から遠巻きの批判に拘泥する人たちは、誰もそのレベルにいないという事実。

そもそも生物学統計やBioinformaticsにニューラルネットが実用可能なんて、20年前(相転移が起きたのは2011年ごろ)から言われてることで、何年間も基幹技術の進歩を見てきた私にとってはバズでも何でもなく、遅すぎたテクノロジーとすら思える。




□ 私が生物学や数理統計学、トポロジー、環境工学に興味を持ったのは、幼少の頃に読んだマイクル・クライトンのJurrassic Parkの影響が大きい。どちらかと言えば続編の"Lost World"の方。"Lost World"では、セルオートマトンや人工生命の振る舞いを、遺伝子工学で復活した恐竜と人間の捕食関係、生存闘争に準えていた。映画化にあたっては、それらの要素は全て削ぎ落とされたけれど。

適応度地形の上のライフゲームが数理的に解析可能であるなら、現実世界の生命体群の振る舞いは何を準えているのか。そこに構造があるということは何を意味するのか。であれば、数式は何を再現するのか。子供心にそんな思索まで至らせたクライトンの小説は偉大。今も私のクエスチョンは変わっていない。

『ジュラシックワールド』ではゲノム編集が扱われてるけど、もしMichael Crichtonが生きていて、いま"Jurrassic Park"の続編を執筆するとしたら、生物学に浸透した人工知能アプローチ、ニューラルネットやDeep Learningを絡めずにはいられないだろう。




□ 「網羅的なアイデア」は幾何的に存在しない。言論や概念は共有・共感を経て個体~群の行動原理をオーダーするが、「~であるべき」という理念が、他者を取り巻く環境にapplyされるか否かは自然選択的であり、他者にとっておよそ力点を置かない場合には、置かれた環境条件と出自が大きく関っている。

組織制度・社会制度とは無条件に供与するものではなく、闘争によって自らを組み入れるもの。その設計にあたっては、環境も視点も異とする、自身の想定する理念や価値観とは最も遠い存在に作用することを出発点にする必要がある。




□ 経済活動というのは不均衡からエネルギーを掘り出す行為だ。バイオや医療から富が生まれるのは、従来のライフゲームの摂理に遡行することにアンバランスが生じ、キャッシュフローの規模は比例して大きくなる。不均衡を許容できるリソースが問題であり、これはアナロジーではなく在り方なのだ。