Illumina sequencing technology

Administrative

We need to be better about documenting everything.

Need to have clear documentation of how each file was produced (and why)
Be more careful with meta-data, specifically the processing (FASTQC/preqc) results should be clearly linked to the data set they were made from.
Need more discussion regarding results
Make a page for each data set that was collected, linking relevant information
- Should have detailed info about the data set and what was done to it, with discussion and interpretation of results, if applicable
The wiki needs more information analysis, less information dump. (Look at the old data for examples)
Make sure everything you do (process, results, discussion, notes) ends up on the wiki! If it isn't up, it never happened.

High-throughput sequencing techniques

Background

Sanger sequencing
- NOT high-throughput sequencing
- Was the standard for decades
- Huge problem: higher throughput basically meant buying lots more (expensive) machines

High-throughput sequencing

Everything in this class is based on the idea of a complex DNA sample. Basically you make the library without knowing anything about the DNA. Basic idea:

  Get DNA (Complex DNA sample) -> Adapter Ligation -> PCR amplification & sequencing

The difference between different technologies is generally in the PCR amplification & sequencing steps.

We can't amplify everything because there is too much noise, so library molecules are physically separated first. Each tech does this in it's own way. One approach is dilute down to a single molecule, then amplify, but this has low throughput. Newer technologies have clever ways to physically partition stuff from each other, and then amplify those all at the same time.

Illumina library construction

Sonicate to break DNA into small templates
Repair DNA to get blunt ends
1. You can extend the shorter end if it goes 5' to 3'. Otherwise you have to remove the longer end, because you can only extend DNA one way.
Ligate adaptors (P5, P7)
Fill in adaptors (they are designed to be extendable)
PCR amplification
1. Uses an indexing oligo on one end that has a barcode specific to the library, so that later you will know where the sequence came from
2. This step introduces lots of bias (more on this later)
Sequence
1. Attach primers to either side of the target segment

Note that DNA can only be extended on the 3’ end (There is an -OH on the ribose at the 3’ end).

PCR biases

Shorter strands are more amplified
GC content: anything that is especially GC-rich or -poor will be a problem to amplify

Other notes

If you want a random subset of your reads, it generally works to take a continuous block of them, as long as you skip the first million or so. This is easy using head and tail on linux
The main factor in seeing duplicates isn't how many rounds of PCR you do, it's how many unique molecules you started with (due to the orders of magnitude of each)
- 3 pg is roughly 3 gig abases, therefore 1pg = 1gb
- 2pg = 1 banana slug genome
- 1mg/2pg = 500,000 genomes
- there are 4 million 500 mers in a banana slug genome
- 2,000,000,000,000 total fragments for the MiSeq run
Often Fastq sequences come based on the tile where they were sequenced
Dilution parameter (critical!)
- Too dilute and not enough colonies form
- Too concentrated and colonies merge into each other, making sequencing impossible
Sequencing by synthesis
- Add a new base (flourescent tagged) and watch for the light
- Quality is determined by how close the signal is to that of another base
- The “secret sauce” of Illumina sequencing: getting a polymerase that will accept the modified bases as legitimate, and that won't back up for error correction

Limitations of Illumina sequencing

Slow run time
- Each step involves real live chemistry
- A 2×100 run typically takes a week to 10 days
Short reads (~150 nt)
Fluorophore overlap
Out-of-phase accumulation
- If one template messes up slightly (by not incorporating a base at one step), then it will be out of phase forever. The number of templates getting out of phase increases with time, so the signal goes down and the noise goes up. This is why quality scores drop off at the ends of reads, and what limits the length of reads, since at some point it's just not worth continuing anymore.

You could leave a comment if you were logged in.

Banana Slug Genomics

Table of Contents

Illumina sequencing technology

Administrative

High-throughput sequencing techniques

Background

High-throughput sequencing

Illumina library construction

PCR biases

Other notes

Limitations of Illumina sequencing

Banana Slug Genomics

User Tools

Site Tools

Table of Contents

Illumina sequencing technology

Administrative

High-throughput sequencing techniques

Background

High-throughput sequencing

Illumina library construction

PCR biases

Other notes

Limitations of Illumina sequencing

Page Tools