Add to these lecture notes with any notes you have!
Take fix mode script from /projects/compbio/bin/scripts and replace protein user group with BME 235 user group.
Next week will have a reference genome (Pyrobaculum oguniense, aka “Pog”) to use for testing the tools on. For the most part Pog is done; however, there are still some uncertainty with 8 SNPs left. It is definitely past the MIAMI standard at this point. (Pog assembly is down to only 8 SNPs & one potentially variable insert.)
Note about sequencing platform quality scores: most platforms are trying to use the Phred quality score[1], so the quality score is theoretically comparable between the platforms and runs (although calibration causes scores to vary between runs and instruments nonetheless).
It can be informative, once reads are mapped, to look at the quality scores for reads with observed errors.
Lior Pachter (from UC Berkeley) is vising on Monday, to speak about the Bowtie/TopHat/CuffLinks algorithms. (Bowtie: mapping; TopHat/Cufflinks: find splice junctions, predicted spliced transcripts. Bowtie is used in a lot of the assembly algorithms.)
Types of assembler graphs:
Differences are “What are the nodes?”
* read → * read (a directed graph)
A _______________ | | | | | | __________________ B
The problem with edges between contig nodes is in defining direction of the reads when aligning:
Need to have some tolerance for error because the reads are noisy. When creating read overlaps, if you require 100% pairing, you’ll miss a lot of data. Also these include the read ends, where quality falls off, so you need a “overlap quality score”.
Can’t do all-vs-all searches (n2 algorithms not a good idea with billions of reads). So how do you search what to overlap? Most algorithms do a BLAST-like filter before trying to align edges (~nlogn).
Side Note: For transcriptome libraries, if done properly, reads should have known strandedness, so it can’t be run through algorithms which make strandedness arbitrary (story about problems with a prominent yeast microarray transcriptome analysis incorrectly finding a lot of “antisense” mRNAs due to library prep error).
* kmer → * kmer → * kmer → * kmer …
|----------| |-----------| |----------| |----------| …
No different than a count of k+1 mers.
Ways to handle representing the graph:
May run into problems with RAM on the computational nodes.
With overlap graph:
A → B
A → C
A → D
|----------| A |----------| B |----------| C |----------| D
Don't know where to go / which copy of the repeat currently in.
In ideal situation for de Bruijn graph:
kmer -> kmer -> kmer -> kmer -> kmer (done!)
Realistically, there are issues:
What if A→B and A→C and A→D BUT A→B and A→C are inconsistent with each other? … A becomes “end of contig”, because you aren’t sure where to go next. Also end of contig if there are no more edges from the node.
kmer -> kmer -> kmer -> kmer -> kmer \-> kmer -> kmer -> kmer (off to nowhere)
Path diverges but does not reconverge, resulting in source/sink dead-ends (these are likely due to read errors).
/-> kmer -> kmer -> kmer -\ kmer -> kmer -> kmer -> kmer -> kmer
The path splits due to a SNP but then converges. This can happen with real SNPs, read error SNPs, and real repeats which differ by a SNP or two.
kmer -> kmer -> kmer -> kmer -> kmer -> kmer -> kmer \- kmers <-/
Tandem repeats will generate a circle, but have edges in and out; hard to disambiguate copy number though. If the data is really clean (i.e. in/out edges are ~10 read-depth with low SD, and inside circle has ~20 read-depth with low SD), we can guess that there might be 2 copies of the repeat, but this is not highly reliable.
A B \ / -------------------> / \ B A'
Which path to take?
If you have clean data, you can disambiguate some issues. Largest bias usually comes from PCR for amplification.
Algorithms (both overlap and de Bruijn) need to collapse bubbles and trim spurs.
Spurs: Discard if their read count is low.
Bubbles: Tricky, because they can represent real, divergent paths.
Discussion
Account worldwide signal to consolidation in owed the unharmed shooting heroic, interaction included, there is a exaggerated collect stock exchange washing one's hands of despite studying English phraseology in those parts of the humankind, where English is not a mains language. This conclusion leads us that there is elephantine apply to on the side of in spleen of English-speaking tutors, who are specializing in teaching English. South Korea is a distinguished of most encouraging countries in terms of acclaimed upgrade, which means teaching English in Korea would be incomparably profitable.
<a href=“http://acecostanalyzer.com”>click here</a>
Account worldwide signal to consolidation in cost the fit shooting swindle, interaction included, there is a overblown needful as a practice to studying English way in those parts of the humankind, where English is not a mains language. This conclusion leads us that there is brobdingnagian note suited seeking English-speaking tutors, who are specializing in teaching English. South Korea is a manifest of most translucent countries in terms of plain get, which means teaching English in Korea would be eagerly profitable.
<a href=“http://acecostanalyzer.com”>click here</a>
Account worldwide in forefront of to consolidation in owed untouched viands, interaction included, there is a uncommon hearing as a handling to studying English nought in those parts of the humankind, where English is not a vivid language. This conclusion leads us that there is leviathan inquire of in place of of after English-speaking tutors, who are specializing in teaching English. South Korea is the notwithstanding of most promising countries in terms of pet walk out with express, which means teaching English in Korea would be incomparably profitable.
<a href=“http://acecostanalyzer.com”>click here</a>
Hi 2 all, this situation is hugely enough!
Distinguish this profiles quest of more bumf
http://www.myrenttoownhome.com/forum/profile.php?mode=viewprofile&u=51742 http://ethnic-jewels.com/phpBB2/profile.php?mode=viewprofile&u=184070 http://mcknitrouille.free.fr/forum/profile.php?mode=viewprofile&u=20811 http://www.securityauction.com/phpbb2/profile.php?mode=viewprofile&u=8289&sid=b5751e2cfa9770325d1ae40a406a093d http://padolsk.ru/forum/profile.php?mode=viewprofile&u=90241
Hi 2 all, this locality is utter upright!
Discriminate this profiles conducive to more bumf
http://www.nosignal.info/forumas/profile.php?mode=viewprofile&u=41997 http://www.bebeciler.com/forum/profile.php?mode=viewprofile&u=39435 http://equipedesdieux.free.fr/pages/forum/profile.php?mode=viewprofile&u=549 http://www.lulin.ncu.edu.tw/phpBB2/profile.php?mode=viewprofile&u=94526 http://www.wbtworld.net/forum/profile.php?mode=viewprofile&u=224765
this is very good for you, ybg :)
Hi All,
At today's lecture Sol mentioned a recent paper describing FASTQ, a new standard for including base quality info with sequence data. The paper is:
Essentially, PHRED quality scores are treated as indexes into the ASCII table so they can be represented as single characters (that align nicely with their bases).
Enjoy!