Looking at fixed differences between super clones A and B

I have looked at fixed differences between super clones A and B before, but I realized I had been starting with a parsed set of SNPs, the SNPs that I had been using for PCA analysis. So pruned based on linkage and such. Thus I was underestimating the number of SNPs fixed between and and B. So I went back and started with all the SNPs. I then first looked at SNPs that are fixed between super clones A and B, regardless of what their frequency is in any other pond/super clone. I found 28,237 SNPs that are homozygous reference in super clone A and homozygous alternate in super clone B, and 54,000 SNPs that are homozygous alternate in super clone A and homozygous reference in super clone B. There are also 83,789 SNPs that are heterozygous in super clone A and homozygous reference in super clone B, and 63,908 SNPs that are homozygous reference in super clone A and heterozygous in super clone B. So there are quite a number of SNPs between these super clones. One other thing is that there are more alternate SNPs in super clone B than A, perhaps this suggest that the reference genome clone is more similar to the A super clone than the B?

I was then curious about where these SNPs occur across the genome. I noticed that most of the SNPs occurred in a subset of scaffolds. Not surprisingly, these tended to be the largest scaffolds. So then I decided to graph the number of SNPs per scaffold against the length of those scaffolds. I figured there should be a correlation between them if the SNPs were more or less randomly distributed, but there might also be some outliers. I have them graphed separately here for whether the SNPs are alternate in super clone A or B (just for bookkeeping purposes). Here are the graphs.

Alternate in super clone B
Nothing looks like too much of an outlier in the snaps that are alternate in super clone B.

Alternate in super clone A

Here I noticed several scaffolds that looked like they had a high number of SNPs relative to their lengths. I then looked at which PA42 scaffolds they blasted to, and whether those scaffolds were mapped to chromosomes. All three of the scaffolds circled in red map to chromosome 10. I then also looked at the PA42 genetic map, and it appears that these three scaffolds are side by side, so part of one chunk of chromosome 10. Here are the number of SNPs per each of these scaffolds. 

On scaffold 2175 (chr 10) there are 2543 SNPs over 480652 bp (1 snp roughly every 189 bp).
On scaffold 1982 (chr 10) there are 1537 SNPs over 273906 bp (1 snp roughly every 178 bp).
On scaffold 1927 (chr 10) there are 1149 SNPs over 326645 bp (1 snp roughly every 284 bp).

Together they add up to 5,229 SNPs, which is ~19% of the SNPs that are alternate in super clone A and reference in super clone B.

I then was curious about how these SNPs are distributed across those scaffolds, and how that distribution compares to SNPs that are alternate in super clone B and reference in super clone A.

Distribution of SNPs that are alternate in super clone A along scaffold 2175 

Distribution of SNPs that are alternate in super clone B along scaffold 2175 

Distribution of SNPs that are alternate in super clone A along scaffold 1982 

Distribution of SNPs that are alternate in super clone B along scaffold 1982 

Distribution of SNPs that are alternate in super clone A along scaffold 1927 

Distribution of SNPs that are alternate in super clone B along scaffold 1927 

What I got from this is that the SNPs do not appear evenly distributed across the scaffolds. However, it is important to note that some of the gaps, where there are 0 SNPs, are due to strings of Ns in the reference genome. However, some of those gaps are also real. Still trying to figure out how best to deal with that. Also, SNPs that are alternate in super clone A are concentrated in these three scaffolds, but we see many fewer of the SNPs that are alternate in super clone B (despite there being a higher number of alternate super clone B SNPs across the genome as a whole.)

I then looked at the frequency of these SNPs in the other super clones. First for a sanity check I looked at super clones A and B.

Just as we should see, we find that the SNPs are almost 100% fixed alternate in super clone A:
And almost 100% fixed reference in super clone B:

In superclone C (which is in D10_2016), we see that about 1/3 of the SNPs are present (fixed alt), while 2/3 are not (fixed reference).

In superclone D, a D Bunk superclone, we basically find none of the SNPs (a small number are at 50% frequency). We also see similar distributions for most of the other super clones in D8, D Bunk, DC at, and D Ramps. There are only two super clones that show a different distribution.

 Superclone H, another D Bunk super clone, has some of the SNPs, though not too many.

 Super clone I, also a D Bunk clone, has more than half the SNPs.

I was also curious about the frequency of these SNPs in the one D Barb individual I had sequence for.  I though maybe these SNPs were coming from D Barb. For about half of the SNPs I don't have sequence. For the other half, a little over half of them are present (alternate) in DBarb, while the other half are reference. So not sure what this means. 

I then looked to see if maybe the distribution of the SNPs varied. Perhaps the SNPs that are present in D Barb are only in some portions of the chromosome 10 chunk. But this doesn't really appear to be the case. I see similar distributions as I saw above for the SNPs fixed between super clones A and B, and the distributions of the SNPs that are fixed alternate in D Barb (so present), don't look that different from those that are not present in D Barb (reference).

Distribution of D Barb SNPs (alternate) on scaffold 2175

 Distribution of D Barb SNPs (reference) on scaffold 2175

Distribution of D Barb SNPs (alternate) on scaffold 1982

Distribution of D Barb SNPs (reference) on scaffold 1982

Distribution of D Barb SNPs (alternate) on scaffold 1927

Distribution of D Barb SNPs (reference) on scaffold 1927

I next wanted to look at the frequency of these SNPs in the ponds. First looking at D8 across time.

D8 2012 (pool)

D8 2016 (artificial pool)

D8 2017 April (artificial pool)

D8 2017 May (pool)
In 2012 it looks like some of these SNPs were fixed in the population, some were not, and some were heterozygous? Or present in some clones and not others? In D8 2016 we see very little of these SNPs, which fits with the finding that most of the 2016 clones were related to super clone B, only 1 individual we have sequenced from 2016 looks similar to super clone A. In 2017 we again see the shift from a large percentage of super clone A individuals to mostly B individuals.

Next looking at the other ponds near D8.

D Bunk 2017 April (artificial pool)

D Bunk 2017 May (pool)
This distribution in D Bunk fits with the fact that I had super clones in D Barb that had none of the SNPs present, and super clones that had some of the SNPs present.

D Oily 2017 May (pool)
The D Oil distribution is interesting in that the SNPs are relatively centered in frequency around 0.5.

Next I looked at D10.
D10 2012 (pool)

 D10 2016 (artificial pool)
This is interesting. The 2012 D10 pool data looked like the D Oil distribution with the frequency centered around 0.5, whereas in the 2016 (artificial pool) data look like the SNPs are either mostly fixed, or mostly not present. What does this mean?

Ok, now looking at the ponds near D8 that are mixed (species wise) or are a different species.

D Mud 2017 May (pool) - thought to be a mix of the two species.

D Barb 2017 May (pool) - unknown species

 D Oak 2017 May (pool) - unknown species (same as D Barb).

Honestly not sure exactly what to make from these.

Finally, I started looking at the more distant ponds. I only looked at a few so far, but they all looked like this:
Which means they actually look pretty similar to the D Barb/D Oak distributions. Why is this? Is this because a set of these alternate SNPs in super clone A are actually old SNPs that were present before these species diverged, and then they were actually lost in super clone B (and the reference genome clone?) Or is something else going on?

Conclusion of all this is, that I think something interesting is going on on chromosome 10. There appears to be an elevated divergence on chromosome 10, with a large number of SNPs fixed between super clones A and B. But where did these SNPs come from? What is their evolutionary history? What are these SNPs doing? Are they in genes? Could these SNPs explain some of the functional differences between super clones A and B?








Comments