Posts

Showing posts from November, 2017

Looking at read depth per site using pileup

Image
After doing some spot checking on individual libraries for the distribution of read depth per site, we wanted to do some more thorough quality checking by looking at stats for all the libraries. For what I am looking at here I am using mpileup data. For making the mpileup files I randomly subsampled 0.5% of the sites from all the TRUE TRUE D84A contigs that were over 2.5kb in length. Thus this there should be minimal influence of microbial DNA. I first looked at median read depth for all the clones. This is sorted by median read depth, and according to year/plate, where red is the SpFall2016 libraries, and turquoise is the Sp2017 libraries. You can see from this that six of the SpFall2016 libraries have very low read depths. I next worked on simulating poisson distributions for each sample. I used the number of sites for each sample and the median read depth as parameters. Here are some examples below of distributions of observed versus simulated read depth for a couple of cl...

Distribution of Read Depth Per Site when all individuals are combined

Image
Just wanted to go ahead and put this graph up because I think it looks really nice! This is looking at the distribution of read depth per site using the VCF file for all clones sequenced so far and just combining all clones. Since it is from the VCF file there should not be PCR duplicates, right? This is also only looking at the "good" scaffolds. Here is the code in case you want to check what I did. genofile <- seqOpen("/mnt/spicy_3/Karen/Sp2017/SpFall2016Sp2017B.seq.gds") snps <- data.table(variant.ids = seqGetData(genofile, "variant.id"), chr = seqGetData(genofile, "chromosome"), pos = seqGetData(genofile, "position"), dp = seqGetData(genofile, "annotation/info/DP")) TCOPA42 <- fread("TCOPA42merge.csv") colnames(TCOPA42) <- c("Num","chr","RefContigx","RefLengthx","Evaluex","MatchLenx","NumMatchx","P...

Per Site Distributions of Read Depth

Image
Now I am going to look at per site distributions of read depth. To start with I am looking at per site distributions of read depth for two Sp2017 clones. This way I am minimizing any confounding aspects of microbes. I chose two clones to start with, one, DBunk 147, was estimated to have a relatively low rate of PCR duplicates (~12%), where as the other, DBunk 132, was estimated to have a relatively high rate of PCR duplicates (~21%). First I just graphed the log10 read depth per site. Here is DBunk 147:  I graphed log10 read depth because there is quite the tail. Here is DBunk 132: I then wanted to look at the read depth distribution without doing log10, so I choose to only look at the distribution with sites that had read depths of 50 or less. Here is DBunk 147: And here is DBunk 132: From this, it looks like DBunk 147 is a better library. That DBunk 132 does not have as much coverage. Maybe it being a lower quality library is leading to high PCR duplicate...

Comparing Read Depth between 2016 and first plate of 2017

Image
I realized that I did not really know how the average read depth per individual compared between 2016 and 2017. So I checked it out using the VCF files built from the 2016 and 2017 data separately. I then pulled out average read depth (DP) per scaffold and graphed that against length, separating out scaffolds based on blast information. Here is the data for 2016. You can see average log10 read depth for the good Daphnia scaffolds falls out a little above 2.5 (~2.65). This equates to an average read depth of 446 per scaffold, which gives about 8X per individual (55 individuals). Here is the data for 2017. You can see that average log10 read depth for the good Daphnia scaffolds falls out a little above 3 (~3.15). This equates to an average read depth of 1400 per scaffold, which gives about 15X per individual (92 individuals). This is better than the 2016 individuals. Part of that is due to doing a bit more sequencing, but also due to the fact that we are not losing reads to micro...

Testing adaptors and looking for PCR duplicates

Image
Doing some work trying to figure out what is going on with my library construction and how to try to improve things. First I did a direct test of mine and Alyssa's adaptors. I did a side by side comparison, using 4 of my normalized DNA extractions, making a total of 8 libraries. I used all the same reagents and made them at the same time, with the only difference between the two sets of four being the adaptors. I first ran them on the Bioanalyzer without doing the individual well cleanup, because we had decided to drop the individual well cleanup. However, the large amount of primers left in the samples made it hard for the Bioanalyzer to actually analyze the samples. This initial Bioanalzyer run made it seem like there was a big difference between mine and Alyssa's adaptors, with mine doing much better in terms of yield. However, I wanted to clean up the libraries and try again to get a better comparison. Once I did the individual library cleanups, it no longer looked like the...

Messing around with the mitochondria

Image
OK, just a bit more procrastination. I wanted to try one more time to look at mitochondria SNPs. Mapping to the PA42 mitochondria was seeming a bit messy, so I wanted to try a different approach. I looked for D84A scaffolds that had decent blast lengths to the PA42 mitochondrion (over 1kb). I then took the sequence of those D84A scaffolds and blasted them to the nucleotide database using blastn on NCBI's website. I found three D84A scaffolds that had blast hit lengths to PA42 of over 1kb, and when blasted on NCBI had really good matches to Daphnia mitochondria. These three scaffolds vary in length from 2407-3314, and sum to 8246bp, which is a little over 50% of the size of the mitochondria in PA42. I then extracted the SNPs from just those contigs using SNPrelate and did a PCA on all the 20162017 data (without D Barb). After filtering for maf=0.15 and missing=0.5, I had 15 SNPs. I did not do LD pruning. Here is the result of the PCA, with the size of the circle matching the frequen...

Phenotyping and Survival by SuperClone

Image
We had been wanting to look at whether ability to phenotype a clone, or survival of a clone, covaries with super clone assignment. Here are the results from some initial work on the data in excel (I know you hate excel, but it was easiest for me for an initial look.) Here is data for looking at whether clones were phenotypes or not based on super clone. I am only including data from 2017 D8 and DBunk. The two bars on the far right are looking at the other/unique clones from the two ponds. I grouped them all together here. Now looking at whether the clones are still alive, or are dead/missing. Now looking at all 2017 data for survival. From this there certainly seems to be variation among the super clones in terms of whether we have been able to phenotype them and whether they are alive. Super clone A, overall seems to be doing better than super clone B (those are the two main clones in D8). Super clone D from D Bunk did badly across the board, while H and I did bet...

SpFall2016 and Sp2017(Plate 1) working with data 11/1/17

Image
First input the VCF file to R (VCF does not include D Barb, so should only have D. pulex). Use SNPRelate to make input file. Then subset SNPs to only include SNPs that are in good D84A contigs (TRUE TRUE). This resulted in 2,721,457 SNPs. Next, filter SNPs based on minor allele frequency of 0.15 and and missing rate of 0.5. Also slide.max.bp of 500bp and LD threshold of 0.1. This cuts down the number of SNPs to 120,767. Running a PCA with the 120,767 SNPs results in PC1 and PC2 explaining 25.56 and 18.10 percent of the variance respectively. Initial PCA graphed looks like: What becomes clear from this is that there are several "super clones" where clones/jars that we have sequenced are the same clone. Thus we need a way to assign individual clones to super clones. So next I did an Identity-By-State analysis using SNPRelate. I then sorted the individuals by pond and season/year. This is the result. I am not sure how to get the labels on the x-axis, so for now I did it...