This is part of the "DNA Fingerprint" section of the Computational biology course. The mechanics of how you'd apply the Chi-Square test in the scenario isn't explained and I would have liked more info on it since its such a useful tool in statistical analysis.
The scenario is:
- You are given 2 arrays: one with gene sequences from 10 cats with polydactyly (cases) and the other with gene sequences from 10 normal cats(controls). Each gene sequence is 113 bases long. How would you go about conducting a chi-square test to find the position and nucleotide change which is the most likely cause for polydactyly. This is what I'd like to discuss.
Read on if you'd like to understand the mechanics of the code from Brilliant .
- sum up the number(frequency) of A,C,T or G nucleotides per position for the cases and controls which produces 2 separate arrays, one for cases and another for controls. Each array is a 2D NumPy array with 4 nested arrays (each 113 bases long) for each nucleotide. Each array is of the form:
[[113 frequencies of Nucleotide A] , [113 frequencies of Nucleotide T] , [113 frequencies of Nucleotide G] , [113 frequencies of Nucleotide C]]
- take the corresponding values for nucleotide and position from the cases array and from the controls array and plug into the "chisquare" function (from "scipy.stats") as one array as follows.
n,p = chisquare([cases[nucleotide,position], controls[nucleotide,position]])
- it returns the chi-square statistic as "n" and the probability as "p"
- after looking at the documentation for the "chisquare" function, I found that passing a single array into the function causes it to calculate the average between the values in the array which is used uniformly as the expected value. Surely this cannot be correct as you want to be doing cases - control or observed - expected to calculate the chi-square statistic.
Easy Math Editor
This discussion board is a place to discuss our Daily Challenges and the math and science related to those challenges. Explanations are more than just a solution — they should explain the steps and thinking strategies that you used to obtain the solution. Comments should further the discussion of math and science.
When posting on Brilliant:
*italics*
or_italics_
**bold**
or__bold__
paragraph 1
paragraph 2
[example link](https://brilliant.org)
> This is a quote
\(
...\)
or\[
...\]
to ensure proper formatting.2 \times 3
2^{34}
a_{i-1}
\frac{2}{3}
\sqrt{2}
\sum_{i=1}^3
\sin \theta
\boxed{123}
Comments
@Samarth Satish hi Samarth, you're right, we basically employ the χ2 test in computational biology but don't stop to explain it. As it happens, our new course on statistical methods is set to publish in the next few months and it treats the t-test, the χ2 test, and ANOVA from the ground up. I can email you when it's released if you like.
Log in to reply
@Josh Silverman. Thats good to hear. I'll be sure to jump on that course to clear my doubts. Thank you very much for offering to email me, but I've selected the option to notify me on the course page itself.