Illumina sequencing technology
Administrative
We need to be better about documenting everything.
Need to have clear documentation of how each file was produced (and why)
Be more careful with meta-data, specifically the processing (FASTQC/preqc) results should be clearly linked to the data set they were made from.
Need more discussion regarding results
Make a page for each data set that was collected, linking relevant information
The wiki needs more information analysis, less information dump. (Look at the old data for examples)
Make sure everything you do (process, results, discussion, notes) ends up on the wiki! If it isn't up, it never happened.
High-throughput sequencing techniques
Background
Sanger sequencing
NOT high-throughput sequencing
Was the standard for decades
Huge problem: higher throughput basically meant buying lots more (expensive) machines
High-throughput sequencing
Everything in this class is based on the idea of a complex DNA sample. Basically you make the library without knowing anything about the DNA. Basic idea:
Get DNA (Complex DNA sample) -> Adapter Ligation -> PCR amplification & sequencing
The difference between different technologies is generally in the PCR amplification & sequencing steps.
We can't amplify everything because there is too much noise, so library molecules are physically separated first. Each tech does this in it's own way. One approach is dilute down to a single molecule, then amplify, but this has low throughput. Newer technologies have clever ways to physically partition stuff from each other, and then amplify those all at the same time.
Illumina library construction
Sonicate to break DNA into small templates
Repair DNA to get blunt ends
You can extend the shorter end if it goes 5' to 3'. Otherwise you have to remove the longer end, because you can only extend DNA one way.
Ligate adaptors (P5, P7)
Fill in adaptors (they are designed to be extendable)
PCR amplification
Uses an indexing oligo on one end that has a barcode specific to the library, so that later you will know where the sequence came from
This step introduces lots of bias (more on this later)
Sequence
Attach primers to either side of the target segment
Note that DNA can only be extended on the 3’ end (There is an -OH on the ribose at the 3’ end).
PCR biases
Other notes
If you want a random subset of your reads, it generally works to take a continuous block of them, as long as you skip the first million or so. This is easy using head and tail on linux
The main factor in seeing duplicates isn't how many rounds of PCR you do, it's how many unique molecules you started with (due to the orders of magnitude of each)
3 pg is roughly 3 gig abases, therefore 1pg = 1gb
2pg = 1 banana slug genome
1mg/2pg = 500,000 genomes
there are 4 million 500 mers in a banana slug genome
2,000,000,000,000 total fragments for the MiSeq run
Often Fastq sequences come based on the tile where they were sequenced
Dilution parameter (critical!)
Too dilute and not enough colonies form
Too concentrated and colonies merge into each other, making sequencing impossible
Sequencing by synthesis
Add a new base (flourescent tagged) and watch for the light
Quality is determined by how close the signal is to that of another base
The “secret sauce” of Illumina sequencing: getting a polymerase that will accept the modified bases as legitimate, and that won't back up for error correction
Limitations of Illumina sequencing