(Almost all of Japan at dawn. Photo by Reid Wiseman)
Time is not our friend.
Don't stop running.
Just keep reaching.
□ Bioinformaticsは方法論の分野ではあるけれど、データの影を映してるだけで、法則を示唆するものではない。データドリヴンは方法論が先に立つ。モデルドリヴンは法則が先に立つ。境界の擦り合わせが鍵。
□ 必要に迫られて、Dependency-based Compositional Semanticsに基づく言語統計モデルを勉強中。でも教則が少な過ぎる。意味と共変量の潜在構造に拠り、非λ計算でアノテーションの高性能化、という概念は理解。 http://cs.stanford.edu/~pliang/papers/dcs-acl2011.pdf …
□ DIAMOND OpenSource SW align 20,000x faster than BLASTX: metagenomics/data-intensive evolution
>> http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.3176.html …
□ Deep Learning Methods with Recursive Perceptual Representations using layers of Linear Support Vector Machines:
>> http://www.eecs.berkeley.edu/~jiayq/assets/pdf/nips12_rsvm.pdf …
using linear SVM as the base building blocks and emplying a random non-linear projection to add flexibility to the model.
□ MetaVelvet-SL: an extension of the Velvet assembler to a de novo metagenomic assembler utilizing supervised learning:
>> http://dnaresearch.oxfordjournals.org/content/early/2014/11/27/dnares.dsu041.full …
A Support Vector Machine (SVM) is used for learning the classification model. If chimeric nodes can be identified correctly, it means that the de Bruijn graph can be disconnected appropriately by splitting the chimeric nodes.
MetaVelvet-SL outperformed the original MetaVelvet, IDBA-UD, Ray Meta, Omega to reconstruct accurate longer assemblies w/ higher N50 scores
MetaVelvet-SL is that the expected coverage to extract the unique nodes is calculated for each sub-graph.
MetaVelvet-SL, which classifies every node in a de Bruijn graph constructed from mixed short reads of multiple species into the following four types by employing supervised machine learning.
connecting MetaPhlAn & MetaVelvet-SL, which can generate a classification model & assemble automatically to learn a model of chimeric nodes.
□ Deep Learning vs. Neural Network Learning: Whiteboard Walkthrough:
>> https://www.mapr.com/blog/deep-learning-vs-neural-network-learning-whiteboard-walkthrough#.VHm7eIv2BE4 …
□ KirkDBorne:
Segmenting a customer database using K-means Clustering: http://www.randalscottking.com/2014/11/building-a-data-science-lab-rapidminer-part-3/ … #MachineLearning #DataScience #BigData #Analytics
Argyle Data launches #MachineLearning-based Hadoop with @hortonworks for fraud #analytics: http://www.fiercebigdata.com/story/argyle-data-partners-hortonworks-machine-learning-based-hadoop-native-fraud/2014-11-26 … via http://www.fiercebigdata.com/offer/share?swyn_id=NDA4MTc1NTczNDUS1 …
□ MaxSSmap: a GPU program for mapping divergent short reads to genomes with the maximum scoring subsequence:
>> http://www.biomedcentral.com/content/pdf/1471-2164-15-969.pdf …
meta-methods achieve high accuracy and low error at all settings but at the cost of increased runtime compared to NextGenMap and BWA. The data structure stores the query in a profile format and so occupies a total of (readlength + 16) × 4 × 5 bytes. the 4 accounts for number of bytes in a float, 5 is for bases A, C, G, T, and N, and 16 is for additional space used by the look-ahead strategy and to eliminate if-statements in the code.
□ Biological Data Science November 5 - 8, 2014 (New CSHL meeting)
>> http://meetings.cshl.edu/meetings/2014/data14.shtml
>> #biodata14
□ Lightning fast genomics with Spark, Adam and Scala: framework based on Apache Spark and the Parquet storage
ADAM API
Scala classes generated from Avro
Data loaded as RDDs (Spark’s Resilient Distributed Datasets)
functions on RDDs (write to HDFS)
val adamVariants:RDD[VariantContext]=sparkContent.adamVCFLoad(cf0nHdfs,dict=None)
val gts:RDD[genotype]=adamVariants.flatMap(p=>p.genotypes)
□ morgantaschuk:
MM: ADAM platform abstracts difficulties of distributed computing, run code anywhere, uses off the shelf hardware, open source #biodata14
MM: Avro IDL for defining data representation, multiple language support. Parquet column file format http://parquet.io #biodata14
MK: MetaCRAM is first do novo, parallel, CRAM-like software specialized for FASTA-format metagenomic read compression #biodata14 http://t.co/o3oDkAZsMx
Victor Felix with Open Science Data Framework (OSDF)?A system for organizing, accessing, and querying scientific data http://t.co/SzpsYJWkzI
□ rdmelamed:
#biodata14 -ers: “trust-centric framework” for genomic data sharing from @erlichya http://bit.ly/1uwyhsQ
□ Bionimbus Protected Data Cloud for secure exchange and analysis of genomic data https://bionimbus-pdc.opensciencedatacloud.org
□ jim_dowling:
I'm giving a talk on BiobankCloud at #biodata14 this week. Great talk from David Haussler on graph-based models of reference genome.#CA4GH
□ JasonJPitt:
AD: STARtools to perform conversion, quantification, fusions, and ASE in real-time with alignment. Very cool #biodata14
□ STARtools: Ultra-fast comprehensive RNA-seq analysis suit:
>> https://github.com/alexdobin/STAR
STAR in the 2-pass mode for the most sensitive novel junction discovery, run 1st pass of STAR mapping with the usual parameters then collect the junctions detected in the first pass, and use them as ”annotated” junctions for the 2nd pass mapping.
□ genomeresearch:
DChurch: Single haplotype assembly of the human genome from a hydatidiform mole @genomeresearch http://tinyurl.com/mnntm3w #biodata14
□ DanEvans0:
#biodata14 Support Vector Machine algorithms are the new black. My poster is feeling poorly dressed right now.
□ jxtx:
Great to see deep learning at #biodata14, for those interested in dAs, http://deeplearning.net/tutorial/dA.html …
bit vectors or vectors of bit probabilities, cross-entropy of the reconstruction
L(x,z) = -sum_{k=1}^d [x_k \log z_k + (1-x_k) \log( 1-z_k)]
□ sminot:
GR: Machine learning clinical semantics from 2 million electronic clinical notes #biodata14
Is aggregating multiple analyses at different scales for peak mapping similar to box counting in fractal analysis? #biodata14
Kana Shimizu: Privacy Preserving Similarity Search in Biomedical Data by Homomorphic encryption #biodata14
□ Lifted-ElGamal Cryptosystem: https://github.com/aistcrypt/Lifted-ElGamal/blob/master/readme-en.md …
□ E-MEM: efficient computation of maximal exact matches for very large genomes:
>> http://bioinformatics.oxfordjournals.org/content/early/2014/11/14/bioinformatics.btu687.short …
□ ViQuaS: An improved reconstruction pipeline for viral quasispecies spectra generated by next-generation sequencing:
>> http://bioinformatics.oxfordjournals.org/content/early/2014/11/13/bioinformatics.btu754.abstract …
ViQuaS outperformed other previously published methods named ShoRAH, QuRe, PredictHaplo, with improvements of at least 3.1 - 53.9% in recall, 0 - 12.1% in precision, 0 - 38.2% in F-Score in terms of strain sequence assembly, improvements of at least 0.006 - 0.143 in KL-Divergence
□ ISCB Africa ASBCB Conference on Bioinformatics in Dar es Salaam, Tanzania March 09 -11, 2015.
>> http://www.iscb.org/iscbafrica2015
□ SplitMEM: A graphical algorithm for pan-genome analysis with suffix skips:
>> http://bioinformatics.oxfordjournals.org/content/early/2014/11/13/bioinformatics.btu756.abstract …
deep topological relationships between suffix trees and compressed de Bruijn graphs and introduce an algorithm, directly constructs the compressed de Bruijn graph in time and space linear to the total number of genomes for a given maximum genome size.
new MerVertex_t(ST->m_suffixArray[memNode->m_SA_start]+ offset, length);
for(int i = 1; i< memNode->m_SA_end - memNode->m_SA_start + 1; i++)
□ GenPlay: Multi-Genome, a tool to analyze multiple human genomes in a graphical interface for RNA- / Chip- / TimEX-seq developed by Albert Einstein College of Medicine of Yeshiva Univ.
>> http://bioinformatics.oxfordjournals.org/content/early/2014/08/31/bioinformatics.btu588.short …
>> https://github.com/JulienLajugie/GenPlay
GenPlay employs Gaussian filter/peak finders/signal saturation/island finders/scatter plots to depict repartition & distances between genes.
こないだNew Yorkで仕事したイェシーバ大学から、配列決定可視化プラットフォームがリリースされました。
ちなみにこんなとこ。NY Genome Centerとはバイオインフォマティクス事業で提携関係にあります。
Albert Einstein College of Medicine at New York City
□ Statistical Significance of Variables Driving Systematic Variation in High-Dimensional Data:
>> http://bioinformatics.oxfordjournals.org/content/early/2014/10/21/bioinformatics.btu674.full.pdf …
Calculate m observed F-statistics F1, . . . , Fm, testing H0 : γi = 0 vs.H1:γi?=0 from model
Evaluation pipeline for 16 simulation scenarios to assess statistical accuracy. evaluate an anti-conservative bias among 500 KS test p-values with another application of the KS test (a “double KS test”)
The conventional F test resulted in an anti-conservative bias among the null P values, with a double KS test P value of 8.73×10-20, while the proposed method produced a correct joint null P value distribution with a double KS test P value of 0.352.
□ LFQC: a lossless compression algorithm for FASTQ files:
>> http://bioinformatics.oxfordjournals.org/content/early/2014/11/23/bioinformatics.btu701.short …
The improvement obtained is up to 225% compared to other algorithms (gzip, bzip2, fastqz, fqzcomp, G-SQZ, SCALCE, Quip, DSRC, DSRC-LZ etc.)
□ bam.iobio: a web-based, real-time, sequence alignment file inspector:
>> http://www.nature.com/nmeth/journal/v11/n12/full/nmeth.3174.html …
□ CLCA: Maximum Common Molecular Substructure Queries within the MetRxn Database
>> http://dx.doi.org/10.1021/ci5003922 …
Canonical Labeling for Clique Approximation, w/ polynomial run-time complexity to quickly generate atom maps for all the reactions in MetRxn
□ ORCID_Org:
New Functionality! ORCID Record Auto Updates. Several organizations involved in completing the information loop.
>> http://orcid.org/blog/2014/11/21/new-functionality-friday-auto-update-your-orcid-record …
□ ORCID_Org:
Overview of ORCID meeting in Tokyo: learn about NII, JST, NIMS, KISTI, KAMJE, HKBU, NTNU, JpGU ORCID projects. http://orcid.org/blog/2014/11/18/tokyo-november-orcid-outreach-meeting …
□ Power Laws from Linear Neuronal Cable Theory: Power Spectral Densities of the Soma Potential, Soma Membrane Current
>> http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003928 …
These power-law exponents are found for arbitrary combinations of uncorrelated and correlated noisy input current. The significance of this finding goes beyond neuroscience as it demonstrates how 1/f^x power laws with a wide range of values for the power-law exponent may arise from a simple, linear physics equation. the cable equation describing the electrical properties of membranes, transfers white-noise current input into ‘colored’ 1/f^x -noise where may have any half-numbered value within the interval from 1/2 to 3 for the different measurement modalities.
□ Brain paper on Neurology and Psychiatry in Babylon featured on today's OUP blog http://bit.ly/1sGIIWy
□ FORMAInc:
Global Health Network launches open access process-map to guide scientific collaborations http://buff.ly/1upwWDq
□ 1月の詳報にあった通り、プロトコル統合マネージャーのClarity LIMSが、HiSeq Tenのサポートを開始。 Illumina Selects GenoLogics' LIMS Software to Support HiSeq
>> http://www.genomeweb.com/informatics/illumina-selects-genologics-lims-software-support-hiseq-customers
Illumina will be using GenoLogics’ LIMS to support its collaboration with Genomics England to sequence 100,000 genomes.
□ Why is Hadoop not used a lot in bio-informatics?
>> https://www.biostars.org/p/115260/
For some large on-demand services, hadoop from massive cloud computing providers is hugely advantageous over the traditional computing model. Hadoop may also do a better job for certain bioinfo tasks (gVCF merging and de novo assembly) A good examples is the ADAM format, a hapdoop friendly replacement of BAM, and its associated tools.
Some pipelines are able to call variants from 1 billion raw reads in 24 hours with multiple CPU cores. although hadoop frequently saves wall-clock time due to its scalability, at times it wastes CPU cycles on its extra layer. In a production setting, the total CPU time across many jobs matters more than the wall-clock time of a single job. Some argue that the compute-close-to-data model of hadoop is better, but for many analyses we only read through data once. The data transferred over network is the same as dispatching data in the hadoop model.
□ MinION USB stick gene sequencer finally comes to market:
>> http://www.extremetech.com/extreme/190409-minion-usb-stick-gene-sequencer-finally-comes-to-market …
□ BigDataScript: a scripting language for data pipelines: simplifies implementation of complex bioinformatics pipelines
>> http://bioinformatics.oxfordjournals.org/content/early/2014/10/01/bioinformatics.btu595.full …
The GO executable invokes main BDS, written in JAVA, performs lexing, parsing, compilation to AST and runs AST.
an implementation of a sequencing data analysis pipeline with BDS, (i) map reads to a reference genome using BWA. (ii) call variants using GATK’s HaplotypeCaller and (iii) annotate variants using SnpEff and SnpSift. The pipeline makes efficient use of computational resources by making sure tasks are parallelized whenever possible.
□ Tangram: a comprehensive toolbox for mobile element insertion detection:
>>