====== Lecture Notes for April 09, 2010 ====== Add to these lecture notes with any notes you have! ===== Notes before main lecture ===== Take fix mode script from /projects/compbio/bin/scripts and replace protein user group with BME 235 user group. Next week will have a reference genome (POG) to use for testing the tools on. For the most part POG is done; however, there are still some uncertainty with 8 SNPs left. It is definitely past the MIAMI standard at this point. Note about sequencing platform quality scores: most platforms are trying to use the phred quality score((, so the quality score is comparable between the platforms and runs It can be informative, once reads are mapped, to look at the quality scores for reads with observed errors. //Pog// assembly is down to only 8 snps & one potentially variable insert ===== Main lecture: Assembler graphs ===== Types of assembler graphs: * Overlap graph * de Bruijn graph Differences are "What are the nodes?" * Overlap: reads * de Bruijn: k-mers (usually fixed k, k < = length(read)) === Overlap graphs === * read -> * read (a directed graph) <code> A _______________ | | | | | | __________________ B </code> The problem is the direction of the reads when aligning: * 4 different edge scenarios: * -> -> (A -> B) * -> <- (A -> B' or B -> A') * <- -> (B' -> A or A' -> B) * <- <- (B -> A) * 3 different edge types: * same dir: A to B / B to A * tail-to-tail: A' to B * head-to-head: A to B' Need to have some tolerance for error because the reads are noisy. === de Bruijn graphs === * kmer -> * kmer -> * kmer -> * kmer … <code> |----------| |-----------| |----------| |----------| … </code> No different than a count of k+1 mers. Ways to handle representing the graph: * Direct addressing (becomes useless for assembly, need to map to unique places) * Hashing (key = size of kmer, have to go to 20…25 mer for the keys to start to be unique) May run into problems with RAM on the computational nodes. * ~ 16 GB per core, was that 4 cores sharing 16 GB or each core with 16 GB? * /proc/meminfo, /proc/cpuinfo directories contains information about the node With overlap graph: A -> B A -> C A -> D <code> |----------| A |----------| B |----------| C |----------| D </code> Don't know where to go / which copy of the repeat currently in. In ideal situation for de Bruijn graph: <code> kmer -> kmer -> kmer -> kmer -> kmer (done!) </code> Realistically, there are issues: Spurs: <code> kmer -> kmer -> kmer -> kmer -> kmer \-> kmer -> kmer -> kmer (off to nowhere) </code> Collapse bubbles: <code> /-> kmer -> kmer -> kmer -\ kmer -> kmer -> kmer -> kmer -> kmer </code> Other issues: Loop: <code> kmer -> kmer -> kmer -> kmer -> kmer -> kmer -> kmer \- kmers <-/ </code> Take the loop? Multiple paths: <code> A B \ / -------------------> / \ B A' </code> Which path to take? If you have clean data, you can disambiguate some issues. Largest bias usually comes from PCR for amplification. Need to collapse the graph (both overlap and de Bruijn) to assemble the reads.


, 2010/04/10 04:22

Hi All,

At today's lecture Sol mentioned a recent paper describing FASTQ, a new standard for including base quality info with sequence data. The paper is:

  P.J.A. Cock, C.J. Fields, N. Goto, M.L. Heuer and P.M. Rice.
  The Sanger FASTQ file format for sequences with quality scores,
  and the Solexa/Illumina FASTQ variants.
  Nucleic Acids Research. 38(6):1767-1771 (2010).

Essentially, PHRED quality scores are treated as indexes into the ASCII table so they can be represented as single characters (that align nicely with their bases).


