Testing adaptors and looking for PCR duplicates

Doing some work trying to figure out what is going on with my library construction and how to try to improve things. First I did a direct test of mine and Alyssa's adaptors. I did a side by side comparison, using 4 of my normalized DNA extractions, making a total of 8 libraries. I used all the same reagents and made them at the same time, with the only difference between the two sets of four being the adaptors. I first ran them on the Bioanalyzer without doing the individual well cleanup, because we had decided to drop the individual well cleanup. However, the large amount of primers left in the samples made it hard for the Bioanalyzer to actually analyze the samples. This initial Bioanalzyer run made it seem like there was a big difference between mine and Alyssa's adaptors, with mine doing much better in terms of yield. However, I wanted to clean up the libraries and try again to get a better comparison. Once I did the individual library cleanups, it no longer looked like there was much of a difference between the two sets of adaptors. However, Hudson Alpha will now allow dual-indexed barcodes for HiSeqX sequencing, so I might use my adaptors for the next set of libraries just to see if it makes any difference.

I next wanted to look into the PCR duplicates issue a bit more. I used the PCR duplicates program in the paper Alan had emailed me. I ran it on a few of the Sp2017 samples, as those will be much less confounded with microbial sequences than the 2016 samples. Here are the results I got for the few samples I have run so far:
Clone                 Total Marked Dups      Actual PCR Dups
April_2017_DBunk_6    17.19%                 15.20%
April_2017_DBunk_152  20.17%                 18.48%
April_2017_DBunk_103  18.83%                 16.98%
April_2017_D8_131     20.63%                 18.85%

So, at least from these four samples, not all the marked PCR dups are actual PCR dups, but the majority of them are. Not sure if this ~2% is enough to be worried about or not. Thoughts? I also haven't tried making the VCF without getting rid of the marked PCR dups to see if that makes any difference. Should we try that?

Then I wanted to go back and do a better job of really looking at the PCR duplicate rates in the 2016 and 2017 sampling. Initially I had only looked at duplicate rates in one sample per plate (one from 2016, one from 2017), and looked at the distribution of duplicates across scaffolds. However, I now realize that was not such a good idea, because there can be a good bit of variation among samples. I can look at the overall duplicate rate of each sample, but that is not a good idea for the 2016 samples, because the microbial scaffolds have a really high duplicate rate and skew the overall percentage. If I look at the overall PCR duplicate rate for the 2017 samples using the marked reads in the bam files, I find that samples range from 12.43-27.21% duplicates, with an average of 18.39%. In order to directly compare the 2016 and 2017 samples, I decided to focus on a single scaffold. I went with scaffold 951, which is one of the two largest D84A Good scaffolds. For the 2016 samples, the PCR duplicate rate for scaffold 951 (as marked in the bam file) ranges from 11.14-44.23% duplicates, with an average of 23.72% duplicates.




For the 2017 samples, the PCR duplicate rate for scaffold 951 ranges from 12.21-31.44%, with an average of 19.49%.



Thus there does appear to be an overall lower PCR duplicate rate and less variation in the 2017 libraries than the 2016 libraries. And I did do less PCR in the 2017 libraries. So, in contrast to what I had been thinking, the number of rounds of PCR does probably influence the rate of PCR duplicates. We will see what this next plate of sequencing looks like. I only did one more round of PCR for this second SP2017 plate than the first SP2017 plate, so hopefully there aren't too many more duplicates.

So digging a bit more into the PCR duplicates. Here are the distribution of PCR duplicates for the other largest scaffold (1032) in 2016 (average is 21.12%).


This looks pretty similar to the distribution seen for scaffold 951. And in fact they are highly correlated.


We see a similar story when we look at the 2017 samples (average is 16.36%).


So this seems to be something that is happening to a sample as a whole, regardless of scaffold. Though maybe I should test some smaller scaffolds too? These seemed like the most robust scaffolds to use, as they were the largest.

Just to compare the differences between 2016 and 2017 again.


I wondered if the quality of the original DNA extraction might have any influence. I looked at how the DNA concentration prior to normalization for each sample was related to PCR duplicates. Here is the data looking at DNA concentration prior to normalization versus duplicates for scaffold 1032 for the 2016 data.


And for the 2017 data.


Not a super obvious pattern. Though in general the samples with the highest PCR duplicate rates are on the lower ends of DNA concentration.

Ok, so I looked at a somewhat smaller scaffold. 1032 and 951 are 4 and 3 MB respectively, 1813 is more like 1MB. Here are the distributions of PCR dups for 1813 in 2016 (average is 20.25%).


This distribution looks like the 1032. Shifted a bit to the left compared to 951.

Here is the 2017 data (average is 16%)


Again, this distribution looks like 1032.

Comments