User Tools

Site Tools


lecture_notes:04-28-2010

This is an old revision of the document!


John St. John's lecture on EULER-SR and Celera; Michael Cusack's lecture on MIRA

Misc Notes:

campusrocks is broken!

Pog has 2 repeats: ~1k & 1.1k
use makefiles, not shell scripts!

SOLiD data formats:
.csfasta = colorspace with numbers
.de = changes #s to letters (0123 → ACGT) but it’s colors not numbers! very confusing.
.fa is the real basespace

Kevin mapped newbler to join the contigs found a bug in the python script to map the solid reads. Detected because there were no joining reads for the two that joined the extrachromosomal reads. There was a sign error in one of my tests. Re-did colorspace mapping on newbler5 assembly. May still have a bug since one gap is covered by 10 thousand reads whereas the other side has one that is only covered by 200 reads. Will be looking to see if there is another bug. If you have mate-pair data, it's good to have software to check for correct answers. Pog matepair data, compare to other assembly tools.

Euler-SR

SR == Short Reads

Euler-SR is a short-read De Bruijn Graph assembler that can use long reads and mate-pairs.

euler-sr-assembly1/
Ran on 454 data with the Sanger data concatenated into one file.

Have to set up env vars.
No make install options.
Things are mixed up.
You have to run it where you installed it
${EUSRC}

It ran well the first time (it ran, at least)

${EUSRC}/assembly/Assemble.pl pogreads.fasta 25

result:
~2k contigs which create a 2x long genome… suspicious
are contigs overlapping?
find out:
check contig-blat_strict_match (blat alignment to reference genome)
look for “Q name” (contigs) which match to the same “T start” positions on the reference genome
answer:yes, appear to overlap a lot – double coverage because they totally overlap

There is one 91k contig.

Things to try to improve the run:
- longer k-mers, increasing to 31 should be easy
- increase frequency threshold (help make up for read errors, maybe?)
- throw out the tiny contigs, reduce your cutoff.

Does have an option to do some simple quality filtering on the reads
if quality data such as fastq is used?
-minmult look at how many things map to this area,
if less than this many things, throw it out.

Error-correct reads, construct repeat graph,
simplifiy repeat graph with mate-reads
Error correction by threading.
Tries to make minimal corrections to beginnings of reads,
uses those to make the kmers. Later threads the full readlength through.

“Error Correction via threading”
- took reads that “they couldn’t make error free”
- made contigs out of these
- tried to map them back to the “error-free” contigs
- perhaps this is where it went wrong?

Mate reads.
Multiple paths of similar length are hard to disambiguate.
You can use multiple matepairs and bootstrap analysis.
Use the paths with the highest probability.

Pog repeats aside:
There are several large homologous regions on opposite strands
in Pog data that are kinds of repeats.
They are at both ends of the area that inverts.
Inversion happens by homologous matching, then swapping by two strands.
Like a sloppy integrase.

Solid data.
Used the regular base-space data in colorspace_input.fa (not double-encoded).
Tried to run on just the SOLiD data… started on Sunday, but still running (Wed)

Celera Assember:

needs qual info (need this from Sanger reads, too)
… so can't run unless you have the .qual files

seemed to have a script to convert Illumina → their format… but not released yet

result:
with 454 data alone: 386 contigs
(newbler: ~40 contigs)

took about 50min

Mira

needs datafile named pog_in.[format].fa
sff_extract script to create .qual files

created 30 contigs >=500 (largest contig 640k)
but… upon mapping to the reference genome,
it turns out that while it is making big contigs, it's producing a chimeric assembly, in which the contigs join genomic regions that are not truly adjacent. it’s getting bigger contigs because it’s joining them incorrectly!
this is very bad; worse even than a lot of small contigs

You could leave a comment if you were logged in.
lecture_notes/04-28-2010.1272777767.txt.gz · Last modified: 2010/05/01 22:22 by galt