SpFall2016 and Sp2017(Plate 1) working with data 11/1/17
First input the VCF file to R (VCF does not include D Barb, so should only have D. pulex). Use SNPRelate to make input file. Then subset SNPs to only include SNPs that are in good D84A contigs (TRUE TRUE). This resulted in 2,721,457 SNPs. Next, filter SNPs based on minor allele frequency of 0.15 and and missing rate of 0.5. Also slide.max.bp of 500bp and LD threshold of 0.1. This cuts down the number of SNPs to 120,767.
Running a PCA with the 120,767 SNPs results in PC1 and PC2 explaining 25.56 and 18.10 percent of the variance respectively.
Initial PCA graphed looks like:
What becomes clear from this is that there are several "super clones" where clones/jars that we have sequenced are the same clone. Thus we need a way to assign individual clones to super clones. So next I did an Identity-By-State analysis using SNPRelate. I then sorted the individuals by pond and season/year. This is the result. I am not sure how to get the labels on the x-axis, so for now I did it by hand.
Next I performed cluster analysis using snpgdsHCluster, and then determined groups of individuals using snpgdsCutTree. I tried a z.threshold of 5, 3, and 2.
From this, I decided to move ahead with a z-threshold of 2. But when I tried coloring clones according to their group assigned from this in the PCA, it didn't seem to be assigning clones very well. Some groups were tightly clustered in the PCA, but others were not.
Alan then suggested trying the corrplot package in R, to graph individuals according to their correlation, and then draw rectangles around groups. In corrplot you choose the number of rectangles to draw, and it then draws those according to the correlation. So by tuning the number of rectangles you ask it to draw, you can change the threshold correlation value for where a group is defined. After playing around with this for a while I decided to go with 70 rectangles (or groups).
Though 67 also looked good. (Hard to tell the differences, look closely.)
Here is the graph with all the unique clones included, but just assigned one color (group O). Again, looks pretty good overall.
Here is a PCA focusing on just D10, D8, and DBunk, broken out by population and year/season, with points again colored by clone (or as unique clone (O)).
I know this is going to be hard to see, but tried to do the same thing as above but for all ponds. You can see the single individual from DLily has been assigned to the same clone as three individuals from DRamps. Also there is an individual from DBunk that has been assigned to one of the clones in D8.
Now each clone is only represented in the PCA once per pond/season, but the circles are sized according to frequency of that clone.
Same graph but with all ponds.
Next I redid the PCA with only a single rep from each clone per pond/season. This PCA was based on 35,009 SNVs. PC1 explained 23.10% of the variance and PC2 explained 12.59% of the variance.
Here is the new PCA of all the ponds by season/year with points colored by super clone and sized by frequency.
And separated by ponds.
From these it looks like there are originally unique clones that are now appearing almost identical to super clones. Is this a problem?
Running a PCA with the 120,767 SNPs results in PC1 and PC2 explaining 25.56 and 18.10 percent of the variance respectively.
Initial PCA graphed looks like:
What becomes clear from this is that there are several "super clones" where clones/jars that we have sequenced are the same clone. Thus we need a way to assign individual clones to super clones. So next I did an Identity-By-State analysis using SNPRelate. I then sorted the individuals by pond and season/year. This is the result. I am not sure how to get the labels on the x-axis, so for now I did it by hand.
Next I performed cluster analysis using snpgdsHCluster, and then determined groups of individuals using snpgdsCutTree. I tried a z.threshold of 5, 3, and 2.
From this, I decided to move ahead with a z-threshold of 2. But when I tried coloring clones according to their group assigned from this in the PCA, it didn't seem to be assigning clones very well. Some groups were tightly clustered in the PCA, but others were not.
Alan then suggested trying the corrplot package in R, to graph individuals according to their correlation, and then draw rectangles around groups. In corrplot you choose the number of rectangles to draw, and it then draws those according to the correlation. So by tuning the number of rectangles you ask it to draw, you can change the threshold correlation value for where a group is defined. After playing around with this for a while I decided to go with 70 rectangles (or groups).
Though 67 also looked good. (Hard to tell the differences, look closely.)
Looking at the groupings from addrect 70, the large groupings match fairly well with what I assigned looking at the initial PCA analysis. I then assigned all the individuals to groups according to the addrect 70, and then graphed the PCA coloring individuals according to those groups (dropping all the unique clones to make visualization easier). This is the outcome. I think it looks like the assignment to groups/superclones is working pretty well.
Here is a PCA focusing on just D10, D8, and DBunk, broken out by population and year/season, with points again colored by clone (or as unique clone (O)).
I know this is going to be hard to see, but tried to do the same thing as above but for all ponds. You can see the single individual from DLily has been assigned to the same clone as three individuals from DRamps. Also there is an individual from DBunk that has been assigned to one of the clones in D8.
Now each clone is only represented in the PCA once per pond/season, but the circles are sized according to frequency of that clone.
Same graph but with all ponds.
Next I redid the PCA with only a single rep from each clone per pond/season. This PCA was based on 35,009 SNVs. PC1 explained 23.10% of the variance and PC2 explained 12.59% of the variance.
Here is the new PCA of all the ponds by season/year with points colored by super clone and sized by frequency.
And separated by ponds.
From these it looks like there are originally unique clones that are now appearing almost identical to super clones. Is this a problem?


I think that it looks really good. I'm not sure that I understand your last question though.
ReplyDeleteI was a little concerned that now it looks like some of the unique clones are falling out right on top of some of the super clones in the updated PCA. So are they really different/unique clones? We can talk in person it this doesn't make sense.
DeleteOne way to think about it might be to ask: how similar should samples of a superclone look given read depth? i.e., if two samples are truly identical, they will have less than 100% IBS if read depth is on the low end.
Delete