User Tools

Site Tools


lecture_notes:04-30-2010

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Last revision Both sides next revision
lecture_notes:04-30-2010 [2010/05/02 22:05]
jstjohn
lecture_notes:04-30-2010 [2010/05/05 20:53]
jmagasin
Line 303: Line 303:
 Even just getting the right answer within a factor of 10 was  Even just getting the right answer within a factor of 10 was 
 encouraging. encouraging.
 +
 +
 +===== Second set of lecture notes =====
 +
 +From Jonathan Magasin.
 +
 +Today'​s lecture covered files generated by Kevin'​s map-colorspace5
 +script which will be necessary for the homework assignment: ordering
 +the POG contigs. ​ Also covered: How to roughly estimate banana slug
 +genome size.
 +
 +==== Homework assignment: assemble POG ====
 +
 +From the newer newbler we have a different set of contigs, thirty-one
 +of them.  We also have mate pairs from SOLiD mapped to newbler
 +assembly 5.  The assignment is to order and orient them the contigs.
 +Kevin has posted his solution in the assembly directory
 +(assemblies/​Pog/​map-colorspace5).
 +
 +=== '​delete'​ and '​invert'​ files ===
 +
 +The trim9 files are output from Kevin'​s map-colorspace5 script. ​ Most
 +of them are not for us.  The '​delete'​ files are for mate pairs where
 +both mapped but not at the appropriate distance: closer than 350
 +bases, or more than 5000 bases apart. ​ These files can be studied to
 +find massive deletions.
 +
 +'​invert'​ files are for mate pairs that were on opposite strands (at
 +any distance) and are useful for studying inversions. ​ However
 +inversions are not so useful because they will not be within a single
 +contig.
 +
 +The '​between contigs'​ files are for when the reads mapped to different
 +contigs. ​ Only cases with unique mappings are in these files. ​ fixme:
 +What file(s) is this?
 +
 +=== trim9.out ===
 +
 +The trim9.out file is a summary of what happened during mapping.
 +Kmers are colorspace kmers. ​ These kmers are the lengths after
 +trimming off nine.  '​uniquly mapped'​ means one for the R3 and one for
 +the F3.  Note that some of the summary stats are bugs (the 'wrong
 +range' line) that have since been fixed. ​ [Kevin reran the script of
 +Friday.]
 +
 +Compared to earlier mapping, Kevin has increased the agressiveness.
 +
 +The deletions ('​wrong range' line) appear to be from reads taken from
 +clones with multiple adapters.
 +
 +Strand biases (algorithmic) were checked for in the forward and
 +backward counts of mapped reads.
 +
 +Tonight Kevin [reran] the mapping script. ​ The new output will
 +include error rates by position. Sol requested these be called
 +mismatches rather than errors. ​ Kevin said they are estimates of
 +sequencing error. ​ They are in colorspace, and exclude indels. ​ In
 +reality they are differences between the reads and assemblies, not
 +errors. ​ But for all practical purposes when there is a mismatch
 +between a read and the assembly the error is in the read.)
 +
 +=== trim9.joins ===
 +
 +The data we'll use is not in trim9.out, rather it is in trim9.joins.
 +trim9.joins is computed from trim9-cross.rdb (which has reads that
 +crossed a contig boundary).  ​
 +
 +Looking at the first line: How many reads support contig 22 following
 +contig 3?  34K.  The contigs lengths are there to see if a contig is
 +short enough that mate pairs might span them.  E.g. what orderings are
 +consistent with the listed contig orders: 2->​3->​4,​ or maybe 3 is very
 +short so we have a mate pair that jumps over 3.
 +
 +A minus sign appearing before a contig name means to take the
 +reverse-complement of that contig. ​ (And contig3 followed by contig22
 +is the same as -contig22 followed by -contig3.)
 +
 +The last number on each line indicates if the mate pair were the
 +optimal length (optimal is the peak of length distribution). ​ That
 +number is the expected length of the gap between the contigs. ​ It is
 +very rough.
 +
 +The file is sorted most-counts to fewest. ​ The stuff at bottom of the
 +file is noise. ​ Only about first fifty lines are helpful.
 +
 +Long mate pairs are very helpful for disambiguating,​ and wish we had
 +them for H.pylori.
 +
 +=== Your task ===
 +
 +Take the trim9.joins file and try to order and orient the contigs,
 +knowing that some of them are duplicated, and that there is a virus in
 +the mix and there should be one large chromosome.
 +
  
lecture_notes/04-30-2010.txt ยท Last modified: 2010/05/05 21:05 by jmagasin