An R test
These simple questions are designed to test the applicants on their expertise in R, self learning, working under pressure, presenting their results, and explaining their work. If you get help from people and internet, please include the person's name and the link to the webpage in your answers.
Write a recursive function that computes n!.
Deliverables: The source code of the function.
library(limma) library(GEOquery) gset <- getGEO("GSE59259", GSEMatrix =TRUE) type <- c(rep("N",8),rep("H",8)) des <- ???? Data <- log(exprs(gset[[1|]])+0.001) fit <- lmFit(Data, des) ## ! ...
- The rest of the script follows the instruction in Section 3.2 (Sample limma Session) with appropriate modifications.
- Your script should save the list of differentially expressed (DE) genes and in csv format in a file named “de.csv”. The output file should have 2 columns: gene names and the corresponding adjusted p-value (see the “adj.P.Val” column of the top table).
- Use pheatmap function to plot the expression of the top 5 DE genes.
Deliverables: Report the number of DE genes, de.csv file, your script, heatmap.png, and the approximate number of hours it took you to do the test. Additionally, write a short description of what you did in 5-10 sentences. The description should indicate your proficiency in English writing. To test your ability to communicate with a biologist who has little or no background in programming, write the summary of your results in a paragraph titled “conclusion”.
- You need to be able to orally explain all parts of your script including the above lines. Be prepared to explain the input, output, and process done by each function you use.
- If you are not familiar with gene expression and have little idea what limma does, try to learn the needed concept from the web, e.g., Wikipedia. You can use information from tutorials, books, papers, experts in the fields, your friends, etc., however, make sure you can explain your work thoroughly.
- Follow the suggested file formats on the members' page.
A) Using the maftools and TCGAbiolinks packages, determine the 3 most frequently mutated genes in liver cancer. Which of these 3 mutations is more predictive of survival? To answer this question, write a function that takes as input a gene name, and save KM plots in png format. Add the p-value as a legend in the plot. Deliverables are similar to question 2.
B) Let's define the impact of a set of genes to be the p-value of a log-rank test that determines whether that gene set is associated with survival. Specifically, the null hypothesis of the log-rank test is that when all of these genes are mutated together, the survival does not change. I.e., we compare the cases who have mutations in all of these genes with the the rest of cases to compute a p-value. A small p-value indicates a significant difference between the two survival curves (KM plots) corresponding to these two groups.
Write a function most.impact()
that takes as input two k1
and n1
integers, and in the list of n1
most mutated genes, finds the names of the k1
genes with the best impact. Your function should return the names of the best k1
genes (i.e., the set of genes with the best log-rank p-value), and also their impact. Run your function for k1=3
, and n1=3
, 10
, and 100
. What is the biological interpretation of your results?
Hint: Solution 1: Use the utils::c?m?n()
function, where you need to guess the question marks.
Solution 2: Use another R function that uses utils::c?m?n()
.
Deliverables are similar to question 2 plus you need to guess the above question marks and copy the line of the code on which c?m?n()
is used in a short paragraph titled “Question marks”.
Bonus: Implement the utils::c?m?n()
function yourself using dynamic programming. Compare the running time of your implementation vs. the utils implementations using large inputs that require at least a couple of minutes.