Why Mate-Pair library?: long distance information to span contig gaps Problems: * low-complexity data * lucigen mate-pair kit not very user friendly * only 5-10% of your DNA gets circularized like it's supposed to * only 10% of circular DNA contains junctions * less than 1 ng of every microg is actual usable data How to compensate: * start with lots of DNA * more efficient molecular biology * use Tn5 (from Chris Vollmers) * recognizes and loads specific sequence (an adapter) * cuts the DNA and ligates an adapter in the same step * very efficient - all sheared DNA has adapters * add a biotinylated linker to the end of the adapters * 2x75 data Other Issues: * Lots of linker near the beginning of the reads * those reads need to be filtered out * We want AT LEAST 30bp of non-linker at the beginning * For Tn5 data, linker sequence is more likely to be farther into the read * That's a good thing! Almost always have at least 30bp before linker What to do about read where you don't see any linker? * might want to throw them out because we're not confident that they're actually mate pairs * throws out tons of data if you are sequencing less than 2x300 How to avoid chimeric circular DNA? * can't run it on a gel (circular dna smears) * adjust insert size to ~4kb - chimeras are large and unlikely to circularize properly 2x75 data with long linker (60bp): we'll probably not read all the way through the linker, but we'll see bits of it.