Looking at read depth per site using pileup

After doing some spot checking on individual libraries for the distribution of read depth per site, we wanted to do some more thorough quality checking by looking at stats for all the libraries. For what I am looking at here I am using mpileup data. For making the mpileup files I randomly subsampled 0.5% of the sites from all the TRUE TRUE D84A contigs that were over 2.5kb in length. Thus this there should be minimal influence of microbial DNA.

I first looked at median read depth for all the clones. This is sorted by median read depth, and according to year/plate, where red is the SpFall2016 libraries, and turquoise is the Sp2017 libraries. You can see from this that six of the SpFall2016 libraries have very low read depths.

I next worked on simulating poisson distributions for each sample. I used the number of sites for each sample and the median read depth as parameters. Here are some examples below of distributions of observed versus simulated read depth for a couple of clones. You can see overall that the observed distributions are more spread out/wider. Is this a problem? Why is this?

Then I subtracted the percent expected zeros from the percent observed zeros. Here is the graph with all clones included. The bar graph is still sorted by total read depth. So you can easily refer between graphs.

First here is the graph with all clones. You can see that the six with the really low read depth also have way fewer zeros than expected. I think this is a side effect of the median read depth being so low.

Next I dropped those six lowest libraries, do get a better picture of what was happening. You can see the three libraries with a media read depth of 4 or so are also missing zeros, but not near to the same extent as the lowest libraries. The green bar that has a marked excess of zeros, is DBarb. I think this is probably due to mapping issues, since DBarb is divergent from D. pulex.

I looked at the observed and simulated distribution of read depth graphs for DBarb just to get a better idea of what was going on. You can see that the DBarb observed distribution is actually bimodal with a large peak at 0. Again, I imagine this is due to mapping bias.