lens, align.

Lang ist Die Zeit, es ereignet sich aber Das Wahre.

entry6.

2015-03-19 16:24:36 | Science News
□ テストケースとして、APIドリヴン型のDeep Learningと自然言語処理を組み合わせた、顧客指向型データベースの設計を書き上げた。トレーニング・データセットの扱いが難しい。

□ 先日に設計案を出したDeep Learningと自然言語処理によるユーザ指向型データベース、市場動向とタグ呼び出し相関の解析と実需要データを組み合わせて開発精度を上げるのが目標なのだけど、個々のユーザに最適化されるサジェストは、この相関に干渉してしまい、相関自体が偏ることがネック

□ データの特徴量さえ設計できれば、pre-processingでPCAにかけて次元削減を行う方法でいけるのだけど、やっぱDNNに投げてしまう方が便利だなー。

□ Once sufficiently many layers have been learned the deep architecture may be used as a generative model by reproducing the data when sampling down the model (an "ancestral pass") from the top level feature activations.

□ 再来週の取締役会で、件の合弁事業に伴う拠点異動をぶち上げる。アメリカ!







□ Finding long-range interactions in the genome with Capture Hi-C:

>> http://genome.cshlp.org/content/early/2015/03/07/gr.185272.114.full.pdf

an interaction-calling algorithm called GOTHiC, that accounts for biases in Hi-C experiments by considering that these will be represented by the total coverage of the interacting fragments. GOTHiC detected 317,271 genomic fragments engaged in 548,551 significant, reproducible interactions with 21,748 promoters in ESCs.






□ sense - agile data science: a collaborative platform to accelerate data science from exploration to production.

>> https://sense.io

Sense provides a powerful workbench supporting multiple tools - R, Python, Julia, Spark, Impala, Redshift, and more.




□ BayesPy: Variational Bayesian Inference in Python:

>> http://arxiv.org/pdf/1410.0870v2.pdf

BayesPy has two types of nodes: stochastic and deterministic. Stochastic nodes correspond to probability distributions and deterministic nodes correspond to functions.
Future plans include support for non- conjugate models and non-parametric models (e.g., Gaussian and Dirichlet processes).

from bayespy.nodes import Dirichlet, Categorical
alpha=Dirichlet(1e-5*np.ones(K),
name='alpha')
Z=Categorical(alpha,
plates=(N,),
name='z')







□ Artificial Neurons and Single-Layer Neural Networks: How Machine Learning Algorithms Work:

>> http://sebastianraschka.com/Articles/2015_singlelayer_neurons.html

stochastic gradient descent converges much faster than gradient descent since the updates are applied immediately after each training sample,

Perceptron Rule in Python:

for_in range(self.epochs):
errors=0
for xi, target in zip(X, y):
update = self.eta * (target - self.predict(xi))




□ Training a LSTM language model.: recurrent neural networks (RNNs): a powerful model for sequential data

>> https://www.cs.ox.ac.uk/people/nando.defreitas/machinelearning/practicals/practical6.pdf

zthe proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units on the tasks of character-level language modeling and Python program evaluation. GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions by learning to gate these interactions.




□ In-depth introduction to machine learning in 15 hours of expert videos:

>> http://www.dataschool.io/15-hours-of-expert-machine-learning-videos/






□ An Integrated Approach to Reconstructing Genome-Scale Transcriptional Regulatory Networks

>> http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004103

Transcriptional regulatory networks (TRNs) program cells to dynamically alter their gene expression in response to changing internal or environmental conditions. In this study, we develop a novel workflow for generating large-scale TRN models that integrates comparative genomics data, global gene expression analyses, and intrinsic properties of transcription factors (TFs).




□ SC3-seq: a method for highly parallel, quantitative measurement of single-cell gene expression: 1-cell level analysis

>> http://nar.oxfordjournals.org/content/early/2015/02/26/nar.gkv134.long

10 000-cell level versus 1-cell level: R2 = 0.776–0.797; 1-cell level versus 1-cell level: R2 = 0.677–708

SC3-seq exhibits a superior quantitative performance in that it does not underestimate the expression LVs of relatively longer transcripts and it detects much larger numbers of transcripts with smaller sequence depth.

PCAはスケーリングなしprcomp関数を使用。多群間のDEGs識別にはANOVA、qvalueをp値及び偽陽性の計算に用いる。gene ontologyの解析はDavid Web Toolにて実行。そこはDeep Learningでパイプライン化してみよう。2D PlotとグラフマイニングもDCNNで行ける。




□ SimSeq: A Nonparametric Approach to Simulation of RNA-Sequence Datasets:

>> http://bioinformatics.oxfordjournals.org/content/early/2015/02/26/bioinformatics.btv124.abstract

methods based on parametric modeling assumptions seem to perform better with respect to false discovery rate (FDR) control when data are simulated from parametric models rather than using more realistic nonparametric simulation strategy.




□ how to perform Kolmogorov-Smirnov statistic in GSEA in R?

>> http://buff.ly/17S2ZCF




□ CauseMap: fast inference of causality from complex time series:

>> https://peerj.com/articles/824/

CauseMap, a method for establishing causality from long time series data (≳25 observations). CCM builds on Takens’ Theorem, a well-established result from dynamical systems theory that requires only mild assumptions. This theorem allows us to reconstruct high dimensional system dynamics using a time series of only a single variable.

Whitney’s Theorem tells us that the dimensionality of the full causal system is generically between (Emax - 1)/2 and Emax,

複雑な時系列データからの因果推論。収束クロスマッピング。因果システムは方向性という概念の上に成立する。




□ BGI Plans to Launch Two NGS Systems This Year Based on Complete Genomics Technology:

>> https://www.genomeweb.com/business-news/bgi-plans-launch-two-ngs-systems-year-based-complete-genomics-technology


□ BGI Sequencer Buzz:

>> https://storify.com/OmicsOmicsBlog/bgi-sequencer-buzz

BGI/Complete Genomics technology relied on sequencing-by-ligation, with the innovative "rolony" (aka nanoball) template amplification. It potentially offers accuracy advantage as ligases are extremely finicky about insisting on correct base-pairing at the ligation junction.




GigaScience:
Tuatara genome assembly v1.0 challenges: 120x coverage, estimated genome size 4.6GB. High GC content and 46% of the genome repetitive #G10K




□ Next Generation Sequencing (NGS) Markets 2015: information and financial information for over 100 companies

>> https://www.reportbuyer.com/product/2736754/next-generation-sequencing-ngs-markets-2015.html

Driven by the promise of clinical diagnostic applications, the DNA sequencing market is growing rapidly and companies are raising money.

Sequencing companies are starting to develop and sold to clinical laboratories following the traditional in vitro diagnostic (IVD) model. Illumina's MiSeqDx was the first cleared in vitro diagnostic (IVD) next generation sequencing system. Kalorama expects the market to grow from its current size of 2.2 billion to 5.6 billion.




□ smllmp:
Interesting comments on the "BioFabric - Network vizualisation w/o hairballs" thread https://news.ycombinator.com/item?id=9159522 #dataviz




□ “Data-processing and machine learning with Python” 

>> http://kachkach.com/data-processing-and-machine-learning-with-python/

Numerical variables
Categorial variables
Boolean variables

represent all these variables in the vector space model to train models.

Example: SVM

from sklearn import datasets from sklearn import svm
iris = datasets.load_iris() X = iris.data[:, :2]
y = iris.target

# Training the model
clf = svm.SVC(kernel='rbf') clf.fit(X, y)

# Doing predictions
new_data = [[4.85, 3.1], [5.61, 3.02], [6.63, 3.13]] print clf.predict(new_data)


Most scikit-learn classifiers have a score function that takes a list of inputs & target outputs. calculate accuracy, precision/recall etc…

TruncatedSVD implements a variant of singular value decomposition. applied to term-document matrices (CountVectorizer or TfidfVectorizer)

latent semantic analysis (LSA), it transforms such matrices to a “semantic” space of low dimensionality.





(sensitivities of RIEMS, Clinical PathoScope, Kraken & Megablast against the MetaPhlAn clade specific marker database )

□ RIEMS: a software pipeline for sensitive & comprehensive taxonomic classification of reads from metagenomics datasets

>> http://www.biomedcentral.com/1471-2105/16/69

Although various tools and strategies were published and are publicly accessible, the available capacities are not sufficient.




□ SFA-SPA: a suffix array based short peptide assembler for metagenomic data:

>> http://bioinformatics.oxfordjournals.org/content/early/2015/03/01/bioinformatics.btv052.short

The improved computational efficiency is achieved using a suffix array data structure allows for fast querying during the assembly process and a significant redesign of assembly steps that enables multi-threaded execution.




Clive_G_Brown:
@nanopore 1 PromethION produces about 1.5 Gb/s of raw data, similar to LHC, but that is raw. I think this 1Gb/s number must be processed




□ A genomic data viewer for iPad http://bit.ly/1aLJYn8 via @GenomeBiology




□ Now everybody can do their part to advance medical research.

>> https://www.apple.com/researchkit/

□ Apple Introduces ResearchKit, Giving Medical Researchers the Tools to Revolutionize Medical Studies

>> http://www.apple.com/pr/library/2015/03/09Apple-Introduces-ResearchKit-Giving-Medical-Researchers-the-Tools-to-Revolutionize-Medical-Studies.html

□ ResearchKit – how iPhone is transforming medical research

>> https://www.youtube.com/watch?v=VyY2qPb6c0c

Sagebio:
Learn more about @sagebio and our two clinical iPhone apps here: http://sagebase.org #AppleLive #AppleEvent

Developed by Sage Bionetworks and the University of Rochester, the Parkinson mPower app helps people living with Parkinson’s disease track their symptoms by recording activities using sensors in iPhone. hese activities include a memory game, finger tapping, speaking & walking.

ResearchKitってUCLA、マウントサイナイ医科大学とLifeMapの共同開発だったのか!SageBioも名を連ねてる。参入障壁高そう。

Apple、医学研究活動と研究者をサポートするツール、ResearchKitを発表:
http://www.apple.com/jp/pr/library/2015/03/09Apple-Introduces-ResearchKit-Giving-Medical-Researchers-the-Tools-to-Revolutionize-Medical-Studies.html

『世界有数の医療研究機関』がガチでアプリ開発。iPhoneで直接DNAサンプル解析できるアプリも既にあるし、ここにDeep Learning勢力が交われば、ゲノミクス分野で相転移が起こる要素は揃ってる。そのうちilluminaのシーケンシングアプリも出るかも。




□ MetaCompass: a software package for comparative assembly of metagenomic reads

>> http://metacompass.cbcb.umd.edu




□ Metassembler: Merging and optimizing de novo genome assemblies:

>> http://biorxiv.org/content/early/2015/03/10/016352

The only data requirement is at least one jumping library is available to evaluate the presence of compression/expansion mis-assemblies, although that data type need not been used in any or all of the assemblies.

metassembly evaluated the presence of core eukaryotic genes using the CEGMA algorithm, as well as the concordance of the metassembly sequence with remapped pair-end and mate-pair reads using REAPR.

CEGMAパイプラインの成果物だった。




□ Modelling Computational Resources for Next Generation Sequencing Bioinformatics Analysis of 16S rRNA Samples:

>> http://arxiv.org/pdf/1503.02974v1.pdf

多重線形回帰分析によるバイオインフォマティクス計算リソースの構築。AmpliconNoise、Perseus、ChimeraSlayerを用いたパイプラインによるノイズ除去と処理時間の正確なモデリング。DIAGのようなグリッドサービスを標準とした並列化スクリプト。

MLR models were able to accurately predict clock time for denoising sequences from a naturally assembled community dataset, but not an artificial community.




□ HISAT: a fast spliced aligner with low memory requirements

>> http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.3317.html

HISAT's hierarchical index for the human genome contains 48,000 local FM indexes, each representing a genomic region of ~64,000 bp. HISAT requires only 4.3 gigabytes of memory. HISAT supports genomes of any size, including those larger than 4 billion bases.




□ Scaffold assembly based on genome rearrangement analysis

>> http://www.sciencedirect.com/science/article/pii/S1476927115000225

a new method for scaffold assembly based on the analysis of gene orders and genome rearrangements in multiple related genomes (some or even all of which may be fragmented). Evaluation of the proposed method on artificially fragmented mammalian genomes demonstrates its high reliability.


□ 10X uses an expensive box and fancy microfluidics; Dovetail just some molecular biology steps. 10X also has an advantage for input material, requiring only 1 nanograms versus multiple micrograms for Dovetail.

10X and Dovetail are trying to tackle a similar space as those mapping companies, to resolve long range structure but via leveraging an Illumina sequencer.




□ DANN: a deep learning approach for annotating the pathogenicity of genetic variants

>> http://bioinformatics.oxfordjournals.org/content/early/2014/10/22/bioinformatics.btu703.abstract

DNNs can capture nonlinear relationships among features are better suited than SVMs for problems w/ a large number of samples and features.




□ CosmosID Adds Cloud Option with the Release of Metagenomics App on Illumina's BaseSpace:

>> https://www.genomeweb.com/informatics/cosmosid-adds-cloud-option-release-metagenomics-app-illuminas-




□ Hidden meaning and 'speed limits' found within genetic code:

>> http://www.sciencedaily.com/releases/2015/03/150312173800.htm




□ Second-generation PLINK: rising to the challenge of larger and richer datasets:

>> http://www.gigasciencejournal.com/content/4/1/7

[sum of null hypothesis likelihoods of at-least-as-extreme tables] / [sum of null hypothesis likelihoods of all tables]




□ Hadoop as a Platform for Genomics - Strata 2015, San Jose:

>> http://www.slideshare.net/allenday/hadoop-as-a-platform-for-genomics-strata-2015-san-jose

Genome × Phenome Tensor

• Aggregating over individuals with matrix could ignore the correlations among genotypes and phenotypes

Aadhaar Biometric ID Creation

900MM people loaded in 4 years
1MM registrations/day
200+ trillion lookups/day
All built on MapR-DB (HBase)







□ Introducing DataFrames in Spark for Large Scale Data: RDD, Machine Learning, GraphX, Spark R

>> http://www.slideshare.net/databricks/introducing-dataframes-in-spark-for-large-scale-data-science







□ NoDe: a fast error-correction algorithm for pyrosequencing amplicon reads:

>> http://www.biomedcentral.com/1471-2105/16/88/abstract

The positive effect of NoDe in 16S rRNA studies was confirmed on the precision of the clustering of pyrosequencing reads in taxonomic units. In NoDe, the pre-cluster like algorithm is proceeded by a machine learning approach identifying potentially erroneous nucleotides,




□ LINKS: Scaffolding genome assemblies with kilobase-long nanopore reads:

>> http://biorxiv.org/content/biorxiv/early/2015/03/13/016519.full.pdf

The predecessor of LINKS is the unpublished scaffolding engine in the SSAKE assembler and foundation of the SSPACE-LongRead scaffolder. LINKS has two main stages: contig pairing & scaffold layout. Cycling through k-mer pairs, that are uniquely placed on contigs are identified.




□ SCIEX Announces OneOmics Collaborators:

>> http://www.businesswire.com/news/home/20150316005134/en/SCIEX-Announces-OneOmics-Collaborators

Advaita Bioinformatics, ISB and Yale Univ.: NGSアプリケーションのクラウド統合が進む。アドヴァイタが来たか。
OneOmics: SCIEX and Illumina have partnered to create the world's first multi-omics applications within a SWATH & BaseSpace cloud-computing.





AGBT 15.

2015-03-04 13:20:22 | Science News


□ The Future of Medicine Is Not In Your Hands (Yet)

>> http://goo.gl/Re1gXQ






NewUniverseD:
Astronomers Discover Record Breaking Quasar http://buff.ly/1wuLkqy






□ Scientists Find Fractal Patterns in Variable Stars

>> http://buff.ly/1wA7k3l

The "underlying nonchaotic attractor" is not dark matter and gravity.






□ Cubist Saturn

>> http://astronomynow.com/2015/02/24/cubist-saturn/

The view was obtained at a distance of approximately 2 million kilometres (1.2 million miles) from Saturn. Image scale is 11 kilometres (7 miles) per pixel.






□ The 16th annual Advances in Genome Biology and Technology (AGBT)

>> http://www.agbt.org
>> #AGBT15


□ 10X Genomics at AGBT:

>> http://www.bio-itworld.com/2015/2/25/10x-genomics-agbt.html

10X genomicsの技術摘要。the GemCode Platformと命名されたらしい。DNAサンプルを14塩基分子のバーコードを割り当て断片化。Matrix Viewによる可視化。

at AGBT, the 10X technology finally has a name: the GemCode Platform, including an instrument, chemistry kit, and informatics software. The DNA is fragmented into short-read libraries suitable for Illumina and receiving a 14-base molecular barcode unique to its gem of origin. The GemCode Software works on “Linked Read Data ” where every short read can be binned back to its DNA molecule of origin.

>> https://vimeo.com/120429438


□ RM, using BWA-MEM + Lumpy / Localize HGAP + CA + Mummer / Whole genome assembly (done by DNAnexus in less than one day)


□ WRM: "World's fastest genome assembly" in 22hrs using FALCON+Daligner on @dnanexus. #PacBio #SMRTseq



今年のAGBTに関するコメントで「10x Genomicsの発表の前に『オ・フォルトゥーナ』を演奏すべきだ」に笑うのと同時に、新プラットフォームへの現地の期待感がよく伝わる。



□ Linked Read Algorithms for Haplotype Phasing and Structural Variant Detection




□ CIViC database for clinical interpretation of variants in cancer, open source & community-driven:

>> https://civic.genome.wustl.edu/

obigriffith:
#CIViCdb provides a fully open source and open access solution to knowledge curation bottleneck for precision medicine. #AGBT15


CIgenomics:
#AGBT15 software demos were good; @GenomOncology @GenoLogics@obigriffith with CIViC; and finally
@scilifelab almost ready for 2D RNA-seq




□ Megabase-scale deletion using CRISPR/Cas9 to generate a fully haploid human cell line:

>> http://genome.cshlp.org/content/24/12/2059.full




PacBio:
Seeing the Genome in a New Light (Sunshine?) with SMRT Sequencing

>> http://blog.pacificbiosciences.com/2015/02/agbt-2015-seeing-genome-in-new-light.html?m=1





gene changes in space

□ LM: Genomic portrait foa space neuron, some of the abundant transcripts in neurons are non-coding RNAs, but down-regulated in space #AGBT15

Leonid Moroz: "Neurons evolved more than once, 9-12 ways to make a brain. Access 3.5 billion years of experiments already done in evolution"

宇宙における無重力環境でのゲノムへの影響をトラッキングするSpace-Seqという概念。シーケンシング原理が、塩基よりMolecular-Basisに偏って行けば、例えば地球外の未知生命が発見された場合の遺伝子的レベルでの相互毒性を計る指標にもなるかもしれない。




□ Evan Macosko to present "Drop-Seq", transformative new single-cell seq technology, at 3 pm at #AGBT15. Not yet tested at zero-gravity tho.

□ Drop-Seq: Single Cell RNA-Seq on a Massive Scale using DNA Barcode Beads and Droplet Microfluidics:

>> http://weitzlab.seas.harvard.edu/research/current/oni-basu




□ G&T-seq: Separation and parallel sequencing of the genomes and transcriptomes of single cells

G&T-seq provides whole genome amplified genomic DNA and full-length transcript sequence and with automation, 96 samples can be processed in parallel.




□ iGenomics: Alignment and variant calling on the iPhone.

>> http://schatzlab.cshl.edu/iGenomics/tutorial/

iPhoneでIllumina, Ion Torrent, PacBio, MinIONのSeqデータを扱える






□ GenoPharmix is offering free visualizations for your data based on d3.

>> http://genopharmix.com/genopharm-v0.2/solutions.html

Deep Learningアルゴリズム "Tuatara GS1"による関係抽出メソッド、Cognitive Biomimicryを応用したソリューションや、テキストマイニング、可視化による知見提供。実験的サービスではあるけど、アウトソーシングビジネスとして成立するか注視したい。

Deep LearningアルゴリズムをGraph Serverと紐付けて解析するサービス、ゲノム分野でも主流となる可能性がある。ピンカーの理論に依る処の、人の認知不可能なレベルの複雑なデータからの関連性抽出は仮説生成に有用。ただし収益性のあるハードとして提供出来るかどうかは課題

In silico emulation of biomimetic object-association strategies has proven to be very effective in relationship discovery and deep learning, leading to new hypothesis and new discoveries based on pushing the boundaries of data science. the algorithm set accomplishes is similar to pLSI or probabilistic latent semantic indexing aka probabilistic latent semantic analysis PLSA.

The methods & algorithm sets were primarily forged for Life Sciences work in the area of Genomic pathway prediction/analysis at Berkeley Labs.

probalistic approaches in high-dimensional vector space designed for the purpose of mimicking portions of human cognition principles via CTM.

Biomimetic high-dimensional vector space algorithm sets that generate relationships between objects are based on CTM and auto-association.






□ Investigation of gene-gene interactions in dose-response studies w/ Bayesian nonparametrics:

>> http://www.biodatamining.org/content/pdf/s13040-015-0039-3.pdf

MANOVA and the novel Bayesian framework present a trade-off between computational complexity and model flexibility. Bayesian posterior probabilities are computed to assess how likely each SNP is to be involved in determing drug-response.

the Bayesian neural network can be employed on a smaller subset of SNPs to explore a richer model space.

As with any MCMC method, sub-optimal parameter values will result in a chain failing to converge, even in Bayesian neural network framework.






□ Differential co-expression network centrality and machine learning feature selection for identifying susceptibility hubs in networks with scale-free structure

>> http://www.biodatamining.org/content/pdf/s13040-015-0040-x.pdf

a novel simulation strategy generates microarray case–control data w/ embedded differential co-expression networks. and underlying correlation structure based on scale-free or Erdos-Renyi (ER) random networks.






□ Quantitative and logic modelling of molecular and gene networks:

>> http://www.nature.com/nrg/journal/v16/n3/full/nrg3885.html

a hybrid approaches will become essential for further progress in synthetic biology and in the development of virtual organisms.

Network inference methods: The four main approaches to infer networks from data include correlation (part a), information theoretic (part b), Bayesian inference (part c) and differential equations (part d).




□ Prometheus Project: integrated data platform that enables the world to fulfill the promise of genomics for medicine

>> https://www.broadinstitute.org/dsde/




infoecho:
I am always curious about the relationship between "Better machine learning" and "Better learning about machines" :)




□ Data intensive biology and data provenance graphs - Gianluigi Zanetti:

>> http://youtu.be/3Ftd3NBzaZ8

"Actionable" data provenance graph.

・Full tracing of the graph of operations performed.
・Large scale (10^3 genomes) comparisons usage.






tranSMART_org:
In town for #TRICON? Come see us at #booth115 for the latest on the tranSMART platform for translational research.

>> http://transmartfoundation.org


□ tranSMART: A collaborative approach to develop a multi-omics data analytics platform for translational research

>> http://www.sciencedirect.com/science/article/pii/S2212066114000350

To fully utilize the strengths of tranSMART, several functionalities are required that can be complemented by the strong algorithmic capabilities included in Genedata Analyst.


□ Life Science and IT Organizations Invest in the tranSMART Foundation to Advance Translational Medicine Research:

>> http://www.marketwatch.com/story/life-science-and-it-organizations-invest-in-the-transmart-foundation-to-advance-translational-medicine-research-2015-01-14






□ Measuring semantic similarities by combining gene ontology annotations and gene co-function networks:

>> http://www.biomedcentral.com/content/pdf/s12859-015-0474-7.pdf

NETSIM calculates the functional distance between a pair of gene sets that are annotated to a pair of GO terms. Second, it calculates GO term similarity based on the annotations to the common parent term, but propagates only the annotations to the terms that lie on the paths from the two GO terms to the common parent term.

Ga ∩ Gb ≠ ∅, and U(ta, tb, p) = Gp. Therefore, D(ta, tb) = 1 and S(ta, tb, p) = 2IC(p)/(IC(ta) + IC(tb)) × (1 - |Gp|/|G|)





(different enrichment measures on a small region of GO DAG. rectangles contain subset of genes annotated by each node)

□ A Bayesian extension of the hypergeometric test for functional enrichment analysis:

>> http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3954234/

a Bayesian approach based on the non-central hypergeometric model to addresses the limitations of the traditional hypergeometric P-value. The Gene Ontology dependence structure is incorporated through a prior on non-centrality parameters.

Gene Ontologyの超幾何分布モデルに対するベイズ・アプローチ。尤度関数には重複情報を含まない。




□ Patent: Long fragment de novo assembly using short reads

>> http://www.freepatentsonline.com/y2015/0057947.html

Complete GenomicsのRadoje氏の取得した特許。あれ?これって。。

A kmer index can include labels indicating an origin of each of the nucleic acid molecules that include each kmer, memory addresses of the reads that correspond to each kmer in the index, and a position in each of the mate pairs that includes the mer.





□ Correcting Illumina sequencing errors: extended background:

>> http://lh3.github.io/2015/02/13/comments-on-illumina-error-correction/
>> http://arxiv.org/pdf/1502.03744v1.pdf

Input: K-mer size k, set H of trusted k-mers, and one string S
Output: Set of corrected positions and bases changed to

Function CORRECTERRORS(k,H, S) begin Q←HEAPINIT()
HEAPPUSH(Q,(k - 2, S[0, k - 2], ∅, 0))
while Q is not empty do
(i, W, C, p)←HEAPPOPBEST(Q)
i←i+1
if i = |S| then return C
N ← {(i,A),(i,C),(i,G),(i,T)}




□ Improved data analysis for the MinION nanopore sequencer:

>> http://www.nature.com/articles/nmeth.3290.epdf

Over 99% of high-quality 2D MinION reads mapped to the reference at a mean identity of 85%.




□ Minimum Information for Reporting Next Generation Sequence Genotyping (MIRING):

>> http://biorxiv.org/content/early/2015/02/16/015230

The data recorded in a MIRING message are essential for the systematic traceability of a NGS genotyping result. MIRING incl 5 categories of structured information: message annotation/reference context/full genotype/consensus sequence/novel polymorphism






□ Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships:

>> http://pubs.acs.org/doi/abs/10.1021/ci500747n

□ Multi-task Neural Networks for QSAR Predictions: http://arxiv.org/abs/1406.1231

Merckのコンペティションで1st placeを受賞した技術を応用したものらしい。

neural nets with multi-tasking can lead to significantly improved results over baselines generated with random forests. Neural networks are powerful non-linear models for classification, regression, or dimensionality reduction.
The most direct application of neural networks to QSAR modeling is to train a neural net on data from a single assay using vectors of molecular descriptors as input and recorded activities as training labels.






□ Architecture of the DNN used to predict AS patterns: Deep learning of the tissue-regulated splicing code (MKK Leung)

It Contains 3 hidden layers, with hidden variables that jointly represent genomic features and cellular context.




□ Deep Learning: Learning deep representations for single cell genomics: http://www.ndm.ox.ac.uk/principal-investigators/project/deep-learning-learning-deep-representations-for-single-cell-genomics … Nuffield Department of Medicine




□ GraphSAW: web-based system for graphical analysis of drug interactions and side effects using pharmaceutical and molecular data

>> http://www.biomedcentral.com/1472-

using GraphSAW which analyzes multi-medications and evaluates with regards to pharmaceutical and molecular adverse drug reactions.




□ DOSE: an R/Bioconductor package for disease ontology semantic and enrichment analysis:

>> http://bioinformatics.oxfordjournals.org/content/31/4/608.abstract

Enrichment analyses incl hypergeometric model and gene set enrichment analysis are implemented to support discovering disease associations.




□ DIME: A Novel Framework for De Novo Metagenomic Sequence Assembly:

>> http://online.liebertpub.com/doi/abs/10.1089/cmb.2014.0251

メタゲノムアセンブリにMapReduceを実装をすることで、理論速度に近いシーケンスを可能に。

two MapReduce implementations of DIME, DIME-cap3 and DIME-genovo, tested comparison of Cap3, Genovo, MetaVelvet, SOAPdenovo, and SPAdes.
a multilevel k-way graph partitioning algorithm, KMETIS, performs best when the number of parts is more than 8 as stated in the literature.




□ RAPTR-SV: a hybrid method for the detection of structural variants:

>> http://bioinformatics.oxfordjournals.org/content/early/2015/02/16/bioinformatics.btv086.short

RAPTR-SV had superior sensitivity and precision, as it recovered 66.4% of simulated tandem duplications with a precision of 99.2%.




□ The long road from Data to Wisdom, and from DNA to Pathogen:

>> http://microbe.net/2015/02/17/the-long-road-from-data-to-wisdom-and-from-dna-to-pathogen/

The two most important aspects relevant to public health are that these DNA fragments, “if truly present,” are present at extremely low levels, and that there are no reported cases of either of these pathogens.





(Overview of the flow of the StringTie algorithm, compared to Cufflinks and Traph. )

□ StringTie enables improved reconstruction of a transcriptome from RNA-seq reads:

>> http://www.nature.com/nbt/journal/vaop/ncurrent/full/nbt.3122.html

StringTie, a computational method that applies a network flow algorithm originally developed in optimization theory.
StringTie produces more complete and accurate reconstructions of genes and better estimates of expression levels, compared with other leading transcript assembly programs including Cufflinks, IsoLasso, Scripture and Traph.






□ Deterministic Chaos and the Evolution of Meaning:

>> http://bjps.oxfordjournals.org/content/63/3/547.full

決定論的カオスと意味の進化: 非収束適応ダイナミクスとゼロサム・シグナリング・ゲーム

a new explanation for the evolution or spontaneous emergence of meaning: non-convergent adaptive dynamics. a zero-sum strategic interaction―information transmission is sustained indefinitely by the replicator dynamic. The key to the persistence of this out-of-equilibrium information transfer is deterministic chaos.




□ SNN-Cliq: a novel algorithm that clusters single cell transcriptomes:

>> http://bioinformatics.oxfordjournals.org/content/early/2015/02/10/bioinformatics.btv088.abstract

SNN-Cliq utilizes the concept of shared nearest neighbor that shows advantages in handling high dimensional data. to define similarities between data points (cells) and achieve clustering by a graph theory-based algorithm.