Figuring out how to map outgroups
Trying to decide how to deal with mapping the more distant species. Specifically Obtusa and Simocephalus.
A couple of different considerations.
1) Obtusa is not that divergent from Pulex. Simocephalus is VERY divergent. So maybe divergent mapping and using a different mapper is more of a concern for Simocephalus than Obtusa.
2) What are we using the outgroups for? Why do we want to map them? For Obtusa, which is less divergent, we have two reasons. 1) We want to use them as an outgroup to polarize SNPs in the Pulex dataset. 2) We want to construct a pseudoreference genome for Obstusa to use for competitive mapping with pooled data.
Where I am right now: I looked at the input fastq file size, the final bam file size, and the number of reads mapped in that final bam (using samtools flagstat), and compared 12 Obtusa samples and 16 Pulex samples from the same plate of libraries. As you can see in the graph below, there is a positive correlation between incoming fastq size and number of reads mapped (as expected). The slope of the Obtusa correlation is shallower than that of the Pulex, which is also to be expected since we know Obstusa is divergent, and it will naturally not map as well.
Seeing this, I was a bit concerned and tried a different mapper, NextGenMap, on a couple of Obtusa samples to see how it compared. However, I was dismayed to find that the final output bam files are even smaller and have substantially fewer (only 30% compared to original mapping pipeline) reads mapped. So, it doesn't seem like it is working very well. It could be a matter of optimizing parameters, but I am not sure.
Also, in hindsight, when looking at the graph below, the Obtusa aren't doing too much worse than the Pulex. And if we primarily want to use them as an outgroup for polarizing SNPs and for making the pseudoreference for competitive mapping, maybe the mapping is of sufficient quality. I am thinking that it is ok to use the current mapping pipeline for Obtusa for our purposes. Do you agree?
Now, the two Simocephalus individuals are an entirely different matter. They had substantially fewer reads map, like an order of magnitude. So clearly bwa-mem is not working for them. But what exactly do we want the Simo data for? I can work on finding a better mapper for them, like trying Stampy. But I don't think the Simo data is going to be holding us up the same way as Obtusa.
So here is what I propose:
1) We move ahead with the bwa-mem original mapping pipeline for Obtusa.
2) I use that data to make a VCF that combines Pulex and Obtusa so we have a VCF that contains the outgroup and all the Pulex samples.
3) I also make a VCF using just the Obtusa samples to then construct the pseudoreference Obtusa genome for the pooled mapping.
4) I work on finding a mapping solution for Simo, but that is lower priority.
Let me know if you agree/don't agree or have any comments.

Comments
Post a Comment